Honors Year Project - Scalable
Integration of XML Schemas |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 Experimental Study 5.1 Platform and Language All implementation is in Java2SE 1.4 using JFC/Swing for GUI components. The implementation is platform independent however it has only been tested on Windows and with minor file system adjustments if necessary, should work on all other platforms. Windows libraries of WordNet [WN] have been used; implementations for other platforms are also available. 5.2 System Components Broadly the XClustXS system can be classified into the following modules:
Appendix C gives a complete class diagram of the XClustXS system. A scalable system design has been considered so that new modules may be easily added. Besides a few components needed to be created for experimental verification and not directly related to the system. 5.3 Test Set The test set consists of 127 XML Schema Definition (XSD) files. They are grouped into domains as follows
These domains reflect a wide range of businesses commonly operating on the Internet using XML and serve as a good base for our tests. 5.4 XST Viewer This is a tool developed to create a visual tree representation of XML Schema based on a XST to enable efficient manual comparison among the XML Schemas. It can create a visual representation of a XML Schema as per the XST model at runtime or save it as a GIF file and has been used extensively in comparing XML Schemas manually for creating a human perception similarity benchmark.
5.5 Similarity Benchmark Before proceeding to tests on XClustXS we do a manual study of the XML Schemas from the human point of view of similarity. Aim: Create a similarity benchmark to gauge the effectiveness of methods employed by XClustXS by creating a similarity ranking of the XML Schemas under the various domains. Method: Using the XSTViewer the number of matching nodes and the maximum of nodes of the trees is counted and their ratio is used to rank them. Definition of a match: We rely on the “human perception” of similarity that is different from methods of computer computed similarity. The primary factors are:
However there are exceptional cases as discussed later. Consider the following examples
Normal Cases: Matching by name
Matching in context
Exceptional Cases: Matching by name
Ranking Algorithm Matching as illustrated by these examples creates a list of number of matching nodes and the maximum of total nodes in both trees. A ranking is then generated by using their ratio and compared against XClustXS. Using the ranking algorithm a list of out of rank pairs is created and this can be used to give an idea of the validity of the XClustXS algorithm. Step 1 – Ranking Based on the similarity and ratio values a ranking table is generated. If values are the same the same rank may be given to two or more pairs. Creating such a ranking enables direct comparison of Manual and XClustXS values. Step 2 – Clustering The Manual and XClustXS pairs are clustered based on the ranks Step 3 – Out of Rank Comparing within the same clusters (same rank) pairs which occur in both the manual and XClustXS clusters are removed, leaving a list out of rank pairs. As an example consider the following two graphs created for Schemas A, B, C and D. The XClustXS represents similarity values as generated automatically and the Manual represents ratio values computed manually.
We manually benchmarked four domains: Car Rentals, Property, Hospitals and Hotels. A direct comparison for Hospital is shown in the following charts and helps conclude an important aspect as seen all the other experiments. We follow it with an exhaustive comparison using each XClustXS component as compared to the manual benchmark by calculating the percentage of out of rank pairs.
Deductions/Observations: A seen from the chart above, ratio values below 0.2 (less than 20% matching nodes) generally result in zero or insignificant similarity values by XClustXS and manually they cannot be ranked accurately. Intuitively this shows the obvious that trying to match XML Schemas with fewer than 20% matching nodes cannot be reliable. Results Generally the results by XClustXS follow the pattern of ratios computed manually. Exceptional cases are described in the limitations (Section 7.2). In this table we show the affect of using individual components of XClustXS and their combinations. Results are presented for the domains manually benchmarked. The percentage is calculated as the number of out-of-rank pairs to the total pairs for that domain.
Conclusion: As we see from the results above all the components play a vital role in similarity computation. The broad spectrum of differences from 10% to 30% can be explained since the XML Schemas for Hospital domain are large and deep while those of Car present almost a flat structure. This absence of context information for Car schemas results in poor matching, on the other hand hospital schemas are considerably deep providing sufficient context to match. 5.6 Comparing XML Schema and DTD similarity We now focus on our important motivation of using XML Schemas by comparing results for DTDs vs. XML Schemas. The DTDs for each domain have been obtained by converting XML Schemas that partly restructures the schema and eliminates data types and other XML Schema recommendations.
As evident from the table above, our experiments show that additional information present in XML Schemas and does prove helpful presenting better matches. Since XClustXS algorithms are based on XClust the conclusions derived in Section 5.5 and the limitations in Section 7.2 apply to XClust also. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||