Wednesday, January 1, 2003

Scalable Integration of XML Schemas - Chapter 7

7.1 Summary
During this project we have examined the W3C XML Schema recommendations very closely and using the new set of information available, modeled it and successfully implemented algorithms for computing their similarity using the new features, dynamic weights and thresholds and finally clustering them. We have created a similarity benchmark judging against human perception and the results of matching by XClustXS have been very encouraging. We show the usefulness of our approach to real world applications in Supply Chain Management and eBusiness.
7.2 Limitations
Observed limitations of XClustXS algorithm are described here.
  • Trees with less height are unable to provide sufficient context information for matching. Thus trying to match flat schema structures proves difficult as it get reduced to mostly linguistic matching.
  • From the examples in Figure 5.5.-1 we see that XClustXS wrongly matches <hotel><description/></hotel> = <location><description/></location>
This is the linguistic matching in play and given the use of PCC the effect maybe nullified. However the is subjective and also depends on the weights assigned.
  • XClustXS will not recognize a match for
<location><from/></location> = <car><pickupLocation/></car>
The domain dictionary used cannot be used in this context since putting “from” equivalent to “pickupLocation” cannot be justified.
These cases present some problems in schema matching however results are very encouraging when considered in a larger context.
7.3 Future Work
Future work can be aimed at refining the algorithm and overcoming the observed limitations and ways to successfully map between DTDs and XML Schemas since DTDs will continue to be in use. We also need to develop algorithms that may automatically perform the task of Schema integration preferably with no user interference required. This will be very helpful in fully realizing the goals for applications in Supply Chain Management and Querying across multiple XML repositories.

[BCV] Bergamaschi S, Castano S, Vincini M. Semantic Integration of Semistructured and Structured Data Sources. SIGMOD Record 28(1): 54-59 (1999)
[BCVMV] Bergamaschi S, Castano S, De Capitani di Vimercati, Montanari S, Vincini M. Exploiting Schema Knowledge for Integration of Heterogeneous Sources. Sistemi Evoluti per Basi di Dati (SEBD): 103-120 (1998)
[CRO79] John S. Croucher, Eddie Oliver. Statistics: An Introduction, McGraw-Hill Australia 1979.
[EV00] Eric van der Vlist. W3C Schema Structure Reference,
[EVE74] Everitt, Brian. Cluster Analysis. Hienemann Educational Books Ltd, London, 1974.
[GH] Ganeshan, Ram and Harrison, Terry P. An Introduction to Supply Chain Management. Pennsylvania State University.
[GUL02] Gulbransen, David. Using XML Schema, Que 2002.
[MRB01] Jayant Madhavan, PA Bernstein, Erhard Rahm. Generic Schema Matching with Cupid. VLDB 2001, Italy: 49-58
[NJ] Andrew Nierman and HV Jagadish. Evaluating Structural Similarity in XML Documents. WebDB Workshop 2002
[ORASS] Gillian Dobbie, Xiaoying Wu, Tok Wang Ling, Mong Li Lee. A Semantically Richer Data Model for Semi-structured Data, Technical Report TR21/00 National University of Singapore
[RB01] Erhard Rahm, PA Bernstein. On Matching Schemas Automatically. MSR-TR-2001-17. VLDB Journal 10(4): 334-350 (2001)
[RT97] Rune Teigen. Information Flow in Supply Chain Management System. University of Toronto 1997.
[SPL] Hong Su, Padmanabhan S, Ming-Ling Lo. Identification of Syntactically Similar DTD Elements for Schema Matching. WAIM 2001: 145-159
[STG02] Semantic Heterogeneity Among Document Encoding Schemes, David Durand and Paul Caton. Scholarly Technology Group, Brown University
[W3CNS] Namespaces in XML,
[W3CXS0] XML Schemas,
[W3CXS1] XML Schemas Structures,
[W3CXS2] XML Schemas Datatypes,
[WN] WordNet, Cognitive Science Laboratory, Princeton University.
[XCLUST] Mong Li Lee, LiangHuai Yang, Wynne Hsu, Yang Xia. XClust: Clustering XML Schemas for Effective Integration, CIKM 2002.
[XER2] Apace Xerces2 Parser, URL:
[XYLEME] Reynaud C, Sirot JP, Vodislav D. Semantic Integration of XML Heterogeneous Data Sources. International Database Engineering and Applications Symposium (IDEAS), IEEE 2001

No comments: