5.1 Platform and Language
All implementation is in Java2SE 1.4 using JFC/Swing for GUI components. The implementation is platform independent however it has only been tested on Windows and with minor file system adjustments if necessary, should work on all other platforms. Windows libraries of WordNet [WN] have been used; implementations for other platforms are also available.
5.2 System Components
Broadly the XClustXS system can be classified into the following modules:
- Parser – Apache Xerces2-J [XER2]
- XST Modeler/Viewer - as proposed in Section 3, a complete class diagram of the XST module is given in Appendix B.
- XST Simplifier - All referenced objects can be duplicated as local objects; during this process some objects may have to be renamed since local names need not be globally unique. This process will affect the Schema as follows:
- References are removed
- Global declarations are reduced to the root element only. As given in the table of global declaration in Section 4.2, the possible global objects that affect similarity computation can be transformed into local declarations; attribute, attributeGroup, complexType and simpleType can be directly resolved using their local declaration specification [W3CXS1].
- Those Elements that are orphaned can be removed from the Schema - only one element which is the root remains as a global declaration, the other global elements MUST be referenced from within the root otherwise they can be removed once orphaned since they cannot be part of a valid instance document.
- All named global objects are reduced to anonymous objects
- Conversion of graph to tree
Figure 5.2-1: Example of Simplification
- Similarity Computation – as described in Section 4.3
- Dynamic Weights and Thresholds – as described in Section 4.4
- Clustering – as described in Section 4.5
Appendix C gives a complete class diagram of the XClustXS system. A scalable system design has been considered so that new modules may be easily added.
Besides a few components needed to be created for experimental verification and not directly related to the system.
5.3 Test Set
The test set consists of 127 XML Schema Definition (XSD) files. They are grouped into domains as follows
Table 5.3-1 Test Set
These domains reflect a wide range of businesses commonly operating on the Internet using XML and serve as a good base for our tests.
5.4 XST Viewer
This is a tool developed to create a visual tree representation of XML Schema based on a XST to enable efficient manual comparison among the XML Schemas. It can create a visual representation of a XML Schema as per the XST model at runtime or save it as a GIF file and has been used extensively in comparing XML Schemas manually for creating a human perception similarity benchmark.
Figure 5.4-1 XST Viewer Example
5.5 Similarity Benchmark
Before proceeding to tests on XClustXS we do a manual study of the XML Schemas from the human point of view of similarity.
Aim: Create a similarity benchmark to gauge the effectiveness of methods employed by XClustXS by creating a similarity ranking of the XML Schemas under the various domains.
Method: Using the XSTViewer the number of matching nodes and the maximum of nodes of the trees is counted and their ratio is used to rank them.
Definition of a match: We rely on the “human perception” of similarity that is different from methods of computer computed similarity.
The primary factors are:
- Matching name
- Matching context
However there are exceptional cases as discussed later. Consider the following examples
Figure 5.5-1 Example Schema Fragments
Matching by name
- Example 1 and 2 - <car> = <car>
- Example 3 and 4 - <hotel> = <hotel>
Matching in context
- Example 1 and 2 - <location><from/></location> = <car><pickupLocation/></car>
- However in different context (Example 1, 2 and 5) -<location><from/></location> ¹<destination><from/></destination>
- and <destination><from/></destination> ¹ <car><pickupLocation/></car>
- Example 1 and 4 - <car><location/></car> ¹ <hotel><location/></hotel>
Matching by name
- Example 3 and 4 - <hotel><description/></hotel> ¹ <location><description/></location> though they are in the same domain.
Matching as illustrated by these examples creates a list of number of matching nodes and the maximum of total nodes in both trees. A ranking is then generated by using their ratio and compared against XClustXS. Using the ranking algorithm a list of out of rank pairs is created and this can be used to give an idea of the validity of the XClustXS algorithm.
Step 1 – Ranking
Based on the similarity and ratio values a ranking table is generated. If values are the same the same rank may be given to two or more pairs. Creating such a ranking enables direct comparison of Manual and XClustXS values.
Step 2 – Clustering
The Manual and XClustXS pairs are clustered based on the ranks
Step 3 – Out of Rank
Comparing within the same clusters (same rank) pairs which occur in both the manual and XClustXS clusters are removed, leaving a list out of rank pairs.
As an example consider the following two graphs created for Schemas A, B, C and D. The XClustXS represents similarity values as generated automatically and the Manual represents ratio values computed manually.
Figure 5.5-2 Example of Ranking
Step 1: Create Rankings
Step 2: Cluster into Ranks
Step 3: Out of Rank – The ones highlighted are in rank leaving AC, BD and AD out of rank.
We manually benchmarked four domains: Car Rentals, Property, Hospitals and Hotels. A direct comparison for Hospital is shown in the following charts and helps conclude an important aspect as seen all the other experiments.
We follow it with an exhaustive comparison using each XClustXS component as compared to the manual benchmark by calculating the percentage of out of rank pairs.
Figure 5.5-3 Direct Comparison of Manual and XClustXS similarity
Deductions/Observations: A seen from the chart above, ratio values below 0.2 (less than 20% matching nodes) generally result in zero or insignificant similarity values by XClustXS and manually they cannot be ranked accurately. Intuitively this shows the obvious that trying to match XML Schemas with fewer than 20% matching nodes cannot be reliable.
Generally the results by XClustXS follow the pattern of ratios computed manually. Exceptional cases are described in the limitations (Section 7.2). In this table we show the affect of using individual components of XClustXS and their combinations. Results are presented for the domains manually benchmarked. The percentage is calculated as the number of out-of-rank pairs to the total pairs for that domain.
Table 5.5-1 XClustXS Components Importance
Percent of pairs out of Rank using XClustXS components
Conclusion: As we see from the results above all the components play a vital role in similarity computation. The broad spectrum of differences from 10% to 30% can be explained since the XML Schemas for Hospital domain are large and deep while those of Car present almost a flat structure. This absence of context information for Car schemas results in poor matching, on the other hand hospital schemas are considerably deep providing sufficient context to match.
5.6 Comparing XML Schema and DTD similarity
We now focus on our important motivation of using XML Schemas by comparing results for DTDs vs. XML Schemas. The DTDs for each domain have been obtained by converting XML Schemas that partly restructures the schema and eliminates data types and other XML Schema recommendations.
Table 5.6-1 DTD vs. XML Schema
As evident from the table above, our experiments show that additional information present in XML Schemas and does prove helpful presenting better matches. Since XClustXS algorithms are based on XClust the conclusions derived in Section 5.5 and the limitations in Section 7.2 apply to XClust also.