Honors Year Project - Scalable
Integration of XML Schemas |
|||||
| Honours Year Project Report Scalable Integration of XML Schemas By Rahul Agarwal School of Computing National University of Singapore 2001/2002 Project No: H37090 Advisor: Dr Lee Mong Li Abstract The eXtensible Markup Language (XML) is increasingly being used for the dissemination and exchange of information over the World Wide Web, making data retrieval and integration critical. The new XML Schema recommendation is now the emerging standard slowly replacing the popular Document Type Definitions (DTDs). We look at new ways to for improved automatic similarity computation among XML Schema using the enhanced features and information available. This similarity computation is an important step towards large scale integration of schemas. XML sources can then be grouped into clusters that are similar in structure and semantics. Reconciling similar schemas within a cluster is easier since it involves lesser restructuring. We develop a algorithm called XClustXS to efficiently compute similarity among XML Schemas and perform experiments to shows how the system is able to satisfactorily match human levels of similarity perception involving XML Schemas. We then discuss applications of XClustXS in Supply Chain Management and eBusiness. Subject Descriptors H.2.1 Logical Design H.3.1 Content Analysis and Indexing H.3.3 Information Search and Retrieval Keywords XML Schema, Similarity measure, Similarity perception, Schema trees, Supply Chain Management, eBusiness Implementation Hardware and Software Pentium III, 256 Mb RAM Java 2 SDK, Apache Xerces-J2, WordNet 1.7.1, Microsoft Office Suite Acknowledgement I am grateful to my project advisor Dr Lee Mong Li for her help, patience and guidance throughout the project and Dr Wynne Hsu who has always been providing useful insights into the project. Dr Yang Liang Huai also helped immensely by providing reading materials from his paper and an initial impetus by providing some Java code. I would like to thank the Apache Foundation, Sun Microsystems and WordNet (Princeton University) for providing free software used in this project. Table of Contents Title Abstract Acknowledgement List of Figures List of Tables 1.1 Why XML 1.2 DTD vs. XML Schema 1.3 Motivation 1.4 Contribution 1.5 Thesis Organisation 2.1 Related Work 2.2 XClustXS and XClust 2.3 Analytical Comparison of other Algorithms 2.4 Summary 3.1 DTDs and DTD Trees 3.2 XML Schema 3.3 Proposed XST model 4. Computing Similarity in XML Schema 4.1 Similarity Perception 4.2 Similarity Computation 4.3 Similarity Measures 4.4 Dynamic Weights and Thresholds 4.5 Clustering 5.1 Platform and Language 5.2 System Components 5.3 Test Set 5.4 XST viewer 5.5 Similarity Benchmark 5.6 Comparing XML Schema and DTD similarity 6. Schema Integration, Supply Chain Management and eBusiness 6.1 Schema Integration 6.2 Supply Chain 6.3 eBusiness and XClustXS 7.1 Summary 7.2 Limitations 7.3 Future Work Appendix A Built-in Datatype Hierarchy List of Figures
List of Tables
Available at School of Computing, National University of Singapore, Digital Library (here) Also available for download here. Reproduction is Prohibited without permission. All rights reserved. No part of this Thesis or Technical Report may be reproduced, stored in a retrieval system or transmitted in any form or by any means, without the prior written permission of School of Computing, National University of Singapore, except in the case of brief quotations embodied in critical articles or reviews. |
|||||