Wednesday, January 1, 2003

Scalable Integration of XML Schemas - Chapter 2


Related Work
2.1 Related Work
There are a number of proposed algorithms for schema similarity computation and each tries to exploit schema structures and information differently. To the best of our knowledge this is the first work focusing on XML Schemas and thus presents a XML Schema recommendation specific algorithm to best exploit these features. The other algorithms are briefly described followed by a comparison.
2.2 XClust and XClustXS
XClust is an algorithm for DTDs and our solution is based on it for XML Schemas.
XClustXS is largely similar in concept and design and follows the same conventions for naming consistency. Many of the new features such as Dynamic weights and thresholds and improved Linguistics can be used in XClust. Using DTDs in the XClustXS system shows a decent improvement in matching ability. This is shown below as % out of rank using the old and new methods.
Using new Linguistics and Dynamic Weights/Thresholds for DTDs
2.3 Analytical Comparison of other algorithms
We consider the following four:
1. [XYLEME]: Xyleme is a XML data repository on the Internet and supports query evaluation. The goals of the Xyleme project are to creating efficient storage mechanisms, query processing with indexing at element level, data acquisition to build repository and to keep it up-to-date and semantic data integration to free users from having to deal with many specific DTDs expressing queries. 
Its architecture is shown here; the repository uses a manually created mediated schema called an “abstract DTD” onto which the schemas it crawls have to be mapped to. This manual creation seems to be one big bottleneck in the ambitious project of indexing the internet. There is a system called SAMAG (the semantic module as shown) to create abstract path and concrete path mappings that are essentially mappings from real DTDs to this abstract DTD. This system may be considered an attempt at similarity computation among DTDs. It uses syntactic/semantic and structural matching that compare to our Linguistic and Path Context. The system is only able to generate partial mappings and a human user is required to “guarantee the semantic consistency of such mappings.” They also mention that in their experiments even more than 50% of automatically generated mappings are irrelevant.
2. [SPL]: Using DTD element graphs it tries to determine syntactically similar graphs. The SPL system is based only on linguistic similarity so it may be considered a rather weak approach. After a simplification transformation on the DTDs creating directed acyclic graphs (DAG) it uses a bottom up approach using children nodes and linguistics only to establish a similarity. It effectively considers recursive nodes though lacking a context driven approach. Its experimental results as show satisfactory result using renaming of nodes (unaltered structure) so it would be unwise to predict how the algorithm would behave given complex structured schemas.
3. Cupid: [MRB01, RB01] is the closest algorithm to ours and presents very convincing solutions for DTDs. Their experiments using real world purchase order DTDs demonstrates quite accurate results. Cupid matches schemas using semantics and structures by discovering mappings based on names, data types (not explained), constraints and schema structure. It has automated linguistic based matching, it is both element and structure based, it considers leaves independently capturing much of the schema semantics, it tries to explore internal structure however does not consider a path context, it used ID/IDREFS exploiting keys and tries to make a context dependant computation. A schema tree is used to model the DTDs and a tree matching is carried out.
4. MOMIS: [BCV, BCVMV] is for both structured and semi-structured data. It is a larger system for query across heterogeneous systems containing structured and semi-structured data. Using schema inputs it creates name affinities using WordNet and its ARTEMIS component calculates the structural affinity. So it uses both linguistics and structural information though not considering the path context from the root. Its architecture shown here shows the ARTEMIS system that helps create the Global Schema. These “affinity” values are used as it tries to create a global schema based on human intervention. 
2.4 Summary
A Summary of the above discussion is presented in the table below.
Table 2.4-1 Comparison with other algorithms
 
XClustXS
XClust
Xyleme
SPL
Cupid
MOMIS
XML Schema (XS)/DTD
XS
DTD
DTD
DTD
DTD
Semi/Structured data
Simplification
Yes
Yes
No
Yes
No
No
Match Names
Yes
Yes
Yes
Yes
Yes
Yes
Handle Linguistics
Yes
Yes
Yes
Yes
Yes
Yes
Domain Specific
Yes
No
No
No
Yes
No
Consider Cardinality
Yes
Yes
Yes
Yes
Yes
No
Consider Type
Yes
No
No
No
Partly
No
Consider Namespace
Yes
No
No
No
No
No
Handle Recursion
Yes
Yes
No
Yes
No
No
Consider children
Yes
Yes
No
Yes
No
Yes
Consider path context
Yes
Yes
Yes
No
Partly
No
Consider Leaf nodes independently
Yes
Yes
No
No
Yes
Yes
Consider both linguistics and structure
Yes
Yes
Yes
No
Yes
Yes

No comments: