home personal workZone travel about
Honors Year Project - Scalable Integration of XML Schemas

2

Related Work

2.1 Related Work

There are a number of proposed algorithms for schema similarity computation and each tries to exploit schema structures and information differently. To the best of our knowledge this is the first work focusing on XML Schemas and thus presents a XML Schema recommendation specific algorithm to best exploit these features. The other algorithms are briefly described followed by a comparison.

2.2 XClust and XClustXS

XClust is an algorithm for DTDs and our solution is based on it for XML Schemas.

XClustXS is largely similar in concept and design and follows the same conventions for naming consistency. Many of the new features such as Dynamic weights and thresholds and improved Linguistics can be used in XClust. Using DTDs in the XClustXS system shows a decent improvement in matching ability. This is shown below as % out of rank using the old and new methods.

Using new Linguistics and Dynamic Weights/Thresholds for DTDs

2.3 Analytical Comparison of other algorithms

We consider the following four:

1. [XYLEME]: Xyleme is a XML data repository on the Internet and supports query evaluation. The goals of the Xyleme project are to creating efficient storage mechanisms, query processing with indexing at element level, data acquisition to build repository and to keep it up-to-date and semantic data integration to free users from having to deal with many specific DTDs expressing queries.

Its architecture is shown here; the repository uses a manually created mediated schema called an “abstract DTD” onto which the schemas it crawls have to be mapped to. This manual creation seems to be one big bottleneck in the ambitious project of indexing the internet. There is a system called SAMAG (the semantic module as shown) to create abstract path and concrete path mappings that are essentially mappings from real DTDs to this abstract DTD. This system may be considered an attempt at similarity computation among DTDs. It uses syntactic/semantic and structural matching that compare to our Linguistic and Path Context. The system is only able to generate partial mappings and a human user is required to “guarantee the semantic consistency of such mappings.” They also mention that in their experiments even more than 50% of automatically generated mappings are irrelevant.

2. [SPL]: Using DTD element graphs it tries to determine syntactically similar graphs. The SPL system is based only on linguistic similarity so it may be considered a rather weak approach. After a simplification transformation on the DTDs creating directed acyclic graphs (DAG) it uses a bottom up approach using children nodes and linguistics only to establish a similarity. It effectively considers recursive nodes though lacking a context driven approach. Its experimental results as show satisfactory result using renaming of nodes (unaltered structure) so it would be unwise to predict how the algorithm would behave given complex structured schemas.

3. Cupid: [MRB01, RB01] is the closest algorithm to ours and presents very convincing solutions for DTDs. Their experiments using real world purchase order DTDs demonstrates quite accurate results. Cupid matches schemas using semantics and structures by discovering mappings based on names, data types (not explained), constraints and schema structure. It has automated linguistic based matching, it is both element and structure based, it considers leaves independently capturing much of the schema semantics, it tries to explore internal structure however does not consider a path context, it used ID/IDREFS exploiting keys and tries to make a context dependant computation. A schema tree is used to model the DTDs and a tree matching is carried out.

4. MOMIS: [BCV, BCVMV] is for both structured and semi-structured data. It is a larger system for query across heterogeneous systems containing structured and semi-structured data. Using schema inputs it creates name affinities using WordNet and its ARTEMIS component calculates the structural affinity. So it uses both linguistics and structural information though not considering the path context from the root. Its architecture shown here shows the ARTEMIS system that helps create the Global Schema. These “affinity” values are used as it tries to create a global schema based on human intervention.

2.4 Summary

A Summary of the above discussion is presented in the table below.

Table 2.4-1 Comparison with other algorithms

 

XClustXS

XClust

Xyleme

SPL

Cupid

MOMIS

XML Schema (XS)/DTD

XS

DTD

DTD

DTD

DTD

Semi/Structured data

Simplification

Yes

Yes

No

Yes

No

No

Match Names

Yes

Yes

Yes

Yes

Yes

Yes

Handle Linguistics

Yes

Yes

Yes

Yes

Yes

Yes

Domain Specific

Yes

No

No

No

Yes

No

Consider Cardinality

Yes

Yes

Yes

Yes

Yes

No

Consider Type

Yes

No

No

No

Partly

No

Consider Namespace

Yes

No

No

No

No

No

Handle Recursion

Yes

Yes

No

Yes

No

No

Consider children

Yes

Yes

No

Yes

No

Yes

Consider path context

Yes

Yes

Yes

No

Partly

No

Consider Leaf nodes independently

Yes

Yes

No

No

Yes

Yes

Consider both linguistics and structure

Yes

Yes

Yes

No

Yes

Yes