home personal workZone travel about
Honors Year Project - Scalable Integration of XML Schemas

Honours Year Project Report

Scalable Integration of XML Schemas

By

Rahul Agarwal

School of Computing

National University of Singapore

2001/2002

Project No: H37090

Advisor: Dr Lee Mong Li


Abstract

The eXtensible Markup Language (XML) is increasingly being used for the dissemination and exchange of information over the World Wide Web, making data retrieval and integration critical. The new XML Schema recommendation is now the emerging standard slowly replacing the popular Document Type Definitions (DTDs). We look at new ways to for improved automatic similarity computation among XML Schema using the enhanced features and information available. This similarity computation is an important step towards large scale integration of schemas. XML sources can then be grouped into clusters that are similar in structure and semantics. Reconciling similar schemas within a cluster is easier since it involves lesser restructuring. We develop a algorithm called XClustXS to efficiently compute similarity among XML Schemas and perform experiments to shows how the system is able to satisfactorily match human levels of similarity perception involving XML Schemas. We then discuss applications of XClustXS in Supply Chain Management and eBusiness.

Subject Descriptors

H.2.1 Logical Design

H.3.1 Content Analysis and Indexing

H.3.3 Information Search and Retrieval

Keywords

XML Schema, Similarity measure, Similarity perception, Schema trees, Supply Chain Management, eBusiness

Implementation Hardware and Software

Pentium III, 256 Mb RAM

Java 2 SDK, Apache Xerces-J2, WordNet 1.7.1, Microsoft Office Suite


Acknowledgement

I am grateful to my project advisor Dr Lee Mong Li for her help, patience and guidance throughout the project and Dr Wynne Hsu who has always been providing useful insights into the project.

Dr Yang Liang Huai also helped immensely by providing reading materials from his paper and an initial impetus by providing some Java code.

I would like to thank the Apache Foundation, Sun Microsystems and WordNet (Princeton University) for providing free software used in this project.


Table of Contents

Title

Abstract

Acknowledgement

List of Figures

List of Tables

1. Introduction

            1.1 Why XML

            1.2 DTD vs. XML Schema

            1.3 Motivation

            1.4 Contribution

            1.5 Thesis Organisation

2. Related Work

            2.1 Related Work

            2.2 XClustXS and XClust

            2.3 Analytical Comparison of other Algorithms

            2.4 Summary

3. Modeling XML Schema

            3.1 DTDs and DTD Trees

            3.2 XML Schema

            3.3 Proposed XST model

4. Computing Similarity in XML Schema

            4.1 Similarity Perception

            4.2 Similarity Computation

            4.3 Similarity Measures

            4.4 Dynamic Weights and Thresholds

            4.5 Clustering

5. Experimental Study

            5.1 Platform and Language

            5.2 System Components

            5.3 Test Set

            5.4 XST viewer

            5.5 Similarity Benchmark

            5.6 Comparing XML Schema and DTD similarity

6. Schema Integration, Supply Chain Management and eBusiness

            6.1 Schema Integration

            6.2 Supply Chain

            6.3 eBusiness and XClustXS

7. Conclusion

            7.1 Summary

            7.2 Limitations

            7.3 Future Work

References

Appendix A Built-in Datatype Hierarchy

Appendix B XST UML Class Diagram

Appendix C XClustXS UML Class Diagram


List of Figures

1.1-1

3.1-1

3.2-1

3.2-2

3.3-1

3.3-2

4.1-1

4.1-2

4.3-1

4.3-2

4.3-3

4.5-1

4.5-2

5.2-1

5.4-1

5.5-1

5.5-2

5.5-3

6.1-1

6.1-2

6.2-1

6.2-2

6.3-1

6.3-2

6.3-3

6.3-4

6.3-5

6.3-6

6.3-7

Timeline of XML Development

Example DTD modeled as a tree

Example of XML Schema Definition(XSD)

XSD Representation in proposed XST model

Modeling as [ORASS] Schema Diagram

Modeling as XST

Symmetric Distribution

Skewed Distribution

Similarity Measures

Match Ambiguity

Type Organisation

Example Schema Graph

Dendrogram

Example of Simplification

XST Viewer Example

Example Schema Fragments

Example of Ranking

Direct comparison of Manual and XClustXS Similarity

Why Schema Integration

Integration Process

Supply Chain Example

Supply Chain Decisions

A2Ai Architecture

XML Native Applications

XML Non-Native Applications

B2Bi Example

Information Subscriber/Distributor Business Process Model

Data Translation

Example Schema Samples and Integration


List of Tables

2.4-1

3.3-1

3.3-2

4.2-1

4.3-1

4.3-2

4.3-3

4.5-1

5.3-1

5.5-1

5.6-1

Comparison with other algorithms

Summary of XST Notation

XST Links/Edges

Assumptions

Cardinality Lookup

Datatype Hierarchy

Primitive TypeSim values

Agglomerative Clustering Example

Test Set

XClustXS Components Importance

DTD vs. XML Schema

Available at School of Computing, National University of Singapore, Digital Library (here)

Also available for download here.

Reproduction is Prohibited without permission.

All rights reserved. No part of this Thesis or Technical Report may be reproduced, stored in a retrieval system or transmitted in any form or by any means, without the prior written permission of School of Computing, National University of Singapore, except in the case of brief quotations embodied in critical articles or reviews.