Wednesday, January 1, 2003

Honors Year Project - Scalable Integration of XML Schemas

Honors Year Project - Scalable Integration of XML Schemas
Honors Year Project Report
Scalable Integration of XML Schemas
By
Rahul Agarwal
School of Computing
National University of Singapore
2001/2002
Project No: H37090
Advisor: Dr Lee Mong Li

Abstract
The eXtensible Markup Language (XML) is increasingly being used for the dissemination and exchange of information over the World Wide Web, making data retrieval and integration critical. The new XML Schema recommendation is now the emerging standard slowly replacing the popular Document Type Definitions (DTDs). We look at new ways to for improved automatic similarity computation among XML Schema using the enhanced features and information available. This similarity computation is an important step towards large scale integration of schemas. XML sources can then be grouped into clusters that are similar in structure and semantics. Reconciling similar schemas within a cluster is easier since it involves lesser restructuring. We develop a algorithm called XClustXS to efficiently compute similarity among XML Schemas and perform experiments to shows how the system is able to satisfactorily match human levels of similarity perception involving XML Schemas. We then discuss applications of XClustXS in Supply Chain Management and eBusiness.
Subject Descriptors
H.2.1 Logical Design
H.3.1 Content Analysis and Indexing
H.3.3 Information Search and Retrieval
Keywords
XML Schema, Similarity measure, Similarity perception, Schema trees, Supply Chain Management, eBusiness
Implementation Hardware and Software
Pentium III, 256 Mb RAM
Java 2 SDK, Apache Xerces-J2, WordNet 1.7.1, Microsoft Office Suite

Acknowledgement
I am grateful to my project advisor Dr Lee Mong Li for her help, patience and guidance throughout the project and Dr Wynne Hsu who has always been providing useful insights into the project.
Dr Yang Liang Huai also helped immensely by providing reading materials from his paper and an initial impetus by providing some Java code.
I would like to thank the Apache Foundation, Sun Microsystems and WordNet (Princeton University) for providing free software used in this project.

Table of Contents
Title
Abstract
Acknowledgement
List of Figures
List of Tables
1. Introduction
            1.1 Why XML
            1.2 DTD vs. XML Schema
            1.3 Motivation
            1.4 Contribution
            1.5 Thesis Organisation
2. Related Work
            2.1 Related Work
            2.2 XClustXS and XClust
            2.3 Analytical Comparison of other Algorithms
            2.4 Summary
3. Modeling XML Schema
            3.1 DTDs and DTD Trees
            3.2 XML Schema
            3.3 Proposed XST model
4. Computing Similarity in XML Schema
            4.1 Similarity Perception
            4.2 Similarity Computation
            4.3 Similarity Measures
            4.4 Dynamic Weights and Thresholds
            4.5 Clustering
5. Experimental Study
            5.1 Platform and Language
            5.2 System Components
            5.3 Test Set
            5.4 XST viewer
            5.5 Similarity Benchmark
            5.6 Comparing XML Schema and DTD similarity
6. Schema Integration, Supply Chain Management and eBusiness
            6.1 Schema Integration
            6.2 Supply Chain
            6.3 eBusiness and XClustXS
7. Conclusion
            7.1 Summary
            7.2 Limitations
            7.3 Future Work
References
Appendix A Built-in Datatype Hierarchy
Appendix B XST UML Class Diagram
Appendix C XClustXS UML Class Diagram

List of Figures
1.1-1
3.1-1
3.2-1
3.2-2
3.3-1
3.3-2
4.1-1
4.1-2
4.3-1
4.3-2
4.3-3
4.5-1
4.5-2
5.2-1
5.4-1
5.5-1
5.5-2
5.5-3
6.1-1
6.1-2
6.2-1
6.2-2
6.3-1
6.3-2
6.3-3
6.3-4
6.3-5
6.3-6
6.3-7
Timeline of XML Development
Example DTD modeled as a tree
Example of XML Schema Definition(XSD)
XSD Representation in proposed XST model
Modeling as [ORASS] Schema Diagram
Modeling as XST
Symmetric Distribution
Skewed Distribution
Similarity Measures
Match Ambiguity
Type Organisation
Example Schema Graph
Dendrogram
Example of Simplification
XST Viewer Example
Example Schema Fragments
Example of Ranking
Direct comparison of Manual and XClustXS Similarity
Why Schema Integration
Integration Process
Supply Chain Example
Supply Chain Decisions
A2Ai Architecture
XML Native Applications
XML Non-Native Applications
B2Bi Example
Information Subscriber/Distributor Business Process Model
Data Translation
Example Schema Samples and Integration

List of Tables
2.4-1
3.3-1
3.3-2
4.2-1
4.3-1
4.3-2
4.3-3
4.5-1
5.3-1
5.5-1
5.6-1
Comparison with other algorithms
Summary of XST Notation
XST Links/Edges
Assumptions
Cardinality Lookup
Datatype Hierarchy
Primitive TypeSim values
Agglomerative Clustering Example
Test Set
XClustXS Components Importance
DTD vs. XML Schema
Available at School of Computing, National University of Singapore, Digital Library (here)
Also available for download here.
Reproduction is Prohibited without permission.
All rights reserved. No part of this Thesis or Technical Report may be reproduced, stored in a retrieval system or transmitted in any form or by any means, without the prior written permission of School of Computing, National University of Singapore, except in the case of brief quotations embodied in critical articles or reviews.

No comments: