Wednesday, January 1, 2003

Scalable Integration of XML Schemas - Chapter 1

1.1  Why XML
HTML has been around for years and created a new economy and culture so why XML? XML is not designed to replace it, rather create a solid and flexible foundation for any kind of documents and power the ‘e’ buzzwords. To mention a few key issues for the growing popularity of XML:
  • More flexible than HTML
  • Fewer optional features than SGML
  • Based on Unicode which creates limitless writing systems
  • Links more flexible than HTML
  • Standardized stylesheets – same data may be presented in any format
  • Can be easily generated from relational databases with almost all new systems providing native support
  • Provides standardized data interchange method for applications and across businesses where heterogeneous relational databases are prevalent
  • Creates self-describing data
  • Open standard not restricted by propriety ownership/licenses
Figure 1.1-1 Timeline of XML development
The timeline shows how there has been a tremendous growth in new technologies focusing on B2B application since XML as we will discuss later in applications of work done for Supply Chain Management systems. XML is associated with its structural/semantic definition file usually a Document Type Definition (DTD) or the more recent XML Schema.
1.2  DTD vs. XML Schema
Document Type Definitions (DTDs) have been around since SGML to define the document structure and enforce semantics so their use in XML was only a logical step. DTDs help standardize application information exchange and also provide a popular way for verifying data. However their limited abilities give rise to XML Schema [W3CXS0] is new and largely seen as an alternative for DTDs. There are a few other recommendations also but XML Schemas are an emerging standard as they are powerful and written as a XML document itself making it self documenting and simpler to write. However XML Schemas are much more complex than DTDs and representing the new semantic and structural relations is a new challenge. XML Schema was designed to take on complex applications and issues that DTDs cannot address. As per the W3C XML Schema Requirements Note [] adapted by [GUL02] a XML Schema should be:
  1. more expressive than XML DTDs;
  2. expressed in XML;
  3. self-describing;
  4. usable by a wide variety of applications that employ XML;
  5. straightforwardly usable on the Internet;
  6. optimized for interoperability;
  7. simple enough to implement with modest design and runtime resources;
  8. coordinated with relevant W3C specs (XML Information Set, Links, Namespaces, Pointers, Style and Syntax, as well as DOM, HTML, and RDF Schema)
This results in the design of XML Schemas that are powerful and simple and provide the following:
  1. mechanisms for constraining document structure (namespaces, elements, attributes) and content (datatypes, entities, notations);
  2. mechanisms to enable inheritance for element, attribute, and datatype definitions;
  3. mechanism for URI reference to standard semantic understanding of a construct;
  4. mechanism for embedded documentation;
  5. mechanism for application-specific constraints and descriptions;
  6. mechanisms for addressing the evolution of schemata;
  7. mechanisms to enable integration of structural schemas with primitive data types
  1. provide for primitive data typing, including byte, date, integer, sequence, SQL & Java primitive data types, etc.;
  2. define a type system that is adequate for import/export from database systems (e.g., relational, object, OLAP);
  3. distinguish requirements relating to lexical data representation vs. those governing an underlying information set;
  4. allow creation of user-defined datatypes, such as datatypes that are derived from existing datatypes and which may constrain certain of its properties (e.g., range, precision, length, mask)
However both DTDs and XML Schemas have the common functional goal of providing a framework for the structure of data and its validation. Special features of XML Schemas are used in similarity computations.
1.3  Motivation
The DTDs and XML Schemas come from diverse sources and different persons design them giving rise to personal choices in the use of names and structures. For instance two competing companies offering similar products and services may well be using entirely different structures and in the event of a merger much work is necessary in affecting the merger of the data sources! Another more important aspect is the case of Supply Chain Management and eBusiness where large corporations must communicate efficiently using diverse and often proprietary systems to conduct business. This brings XML into the picture and the capability to map one business’s schemas to another becomes important.
Such semantic and structurally heterogeneity will always be  present as applying standards to millions of ever changing sources all the world is next to impossible and coming up with standards acceptable to all is impossible!
1.4  Contribution
In this thesis, we developed an algorithm to compute the similarity of XML Schemas. Our XClustXS algorithm is based on [XCLUST], but takes into consideration the following features that are specific to XML Schema.
·        Computation of Linguistic Similarity using Domain Dictionary
·        Computation of Type Similarity and Namespace Similarity
·        Use of Dynamic Weights and Thresholds
 We have also developed XML Schema Trees (XST), a Similarity Benchmark to gauge the effectiveness of our algorithm against human perception and evaluated the application of our work to Supply Chain Management and eBusiness.
1.5  Thesis Organisation
The following pages in this thesis are organized as follows:
Chapter 2, Related Work, describes briefly other algorithms and how they compare with XClustXS
Chapter 3, Modeling XML Schema, gives and overview proposed XML Schema modeling.
Chapter 4, Computing Similarity in XML Schema, describes how we judge similarity and explains the algorithms developed with specific details and assumptions made.
Chapter 5, Experimental Study, describes implementation using Java a various other tools available free on the Internet, the experimental data, tests undertaken, results in coming up with a Similarity Benchmark and comparison of XClustXS to previous similar works.
Chapter 6, Schema Integration, Supply Chain Management and eBusiness, gives an overview for properties of Schema Integration, briefly explains concepts of supply chain systems and eBusiness and investigates how XClustXS comes into the picture of B2Bi.
Chapter 7, Conclusion, contains a summary and concludes the report.
In addition the appendixes contain some factual data, diagrams and other information.

No comments: