Wednesday, January 1, 2003

Scalable Integration of XML Schemas - Appendix C


Appendix C
XClustXS UML Class Diagram

Scalable Integration of XML Schemas - Appendix B



Appendix B
XST UML Class Diagram
XSTViewer UML Class Diagram

Scalable Integration of XML Schemas - Appendix A


Appendix A
Built-in Datatype Hierarchy

Scalable Integration of XML Schemas - Chapter 7


Conclusion
7.1 Summary
During this project we have examined the W3C XML Schema recommendations very closely and using the new set of information available, modeled it and successfully implemented algorithms for computing their similarity using the new features, dynamic weights and thresholds and finally clustering them. We have created a similarity benchmark judging against human perception and the results of matching by XClustXS have been very encouraging. We show the usefulness of our approach to real world applications in Supply Chain Management and eBusiness.
7.2 Limitations
Observed limitations of XClustXS algorithm are described here.
  • Trees with less height are unable to provide sufficient context information for matching. Thus trying to match flat schema structures proves difficult as it get reduced to mostly linguistic matching.
  • From the examples in Figure 5.5.-1 we see that XClustXS wrongly matches <hotel><description/></hotel> = <location><description/></location>
This is the linguistic matching in play and given the use of PCC the effect maybe nullified. However the is subjective and also depends on the weights assigned.
  • XClustXS will not recognize a match for
<location><from/></location> = <car><pickupLocation/></car>
The domain dictionary used cannot be used in this context since putting “from” equivalent to “pickupLocation” cannot be justified.
These cases present some problems in schema matching however results are very encouraging when considered in a larger context.
7.3 Future Work
Future work can be aimed at refining the algorithm and overcoming the observed limitations and ways to successfully map between DTDs and XML Schemas since DTDs will continue to be in use. We also need to develop algorithms that may automatically perform the task of Schema integration preferably with no user interference required. This will be very helpful in fully realizing the goals for applications in Supply Chain Management and Querying across multiple XML repositories.

References
[BCV] Bergamaschi S, Castano S, Vincini M. Semantic Integration of Semistructured and Structured Data Sources. SIGMOD Record 28(1): 54-59 (1999)
[BCVMV] Bergamaschi S, Castano S, De Capitani di Vimercati, Montanari S, Vincini M. Exploiting Schema Knowledge for Integration of Heterogeneous Sources. Sistemi Evoluti per Basi di Dati (SEBD): 103-120 (1998)
[CRO79] John S. Croucher, Eddie Oliver. Statistics: An Introduction, McGraw-Hill Australia 1979.
[EV00] Eric van der Vlist. W3C Schema Structure Reference, http://www.xml.com/pub/a/2000/11/29/schemas/structuresref.html
[EVE74] Everitt, Brian. Cluster Analysis. Hienemann Educational Books Ltd, London, 1974.
[GH] Ganeshan, Ram and Harrison, Terry P. An Introduction to Supply Chain Management. Pennsylvania State University.
[GUL02] Gulbransen, David. Using XML Schema, Que 2002.
[HSTAT] http://davidmlane.com/hyperstat/
[MRB01] Jayant Madhavan, PA Bernstein, Erhard Rahm. Generic Schema Matching with Cupid. VLDB 2001, Italy: 49-58
[NJ] Andrew Nierman and HV Jagadish. Evaluating Structural Similarity in XML Documents. WebDB Workshop 2002
[ORASS] Gillian Dobbie, Xiaoying Wu, Tok Wang Ling, Mong Li Lee. A Semantically Richer Data Model for Semi-structured Data, Technical Report TR21/00 National University of Singapore
[RB01] Erhard Rahm, PA Bernstein. On Matching Schemas Automatically. MSR-TR-2001-17. VLDB Journal 10(4): 334-350 (2001)
[RNET] http://www.rosettanet.org
[RT97] Rune Teigen. Information Flow in Supply Chain Management System. University of Toronto 1997.
[SPL] Hong Su, Padmanabhan S, Ming-Ling Lo. Identification of Syntactically Similar DTD Elements for Schema Matching. WAIM 2001: 145-159
[STG02] Semantic Heterogeneity Among Document Encoding Schemes, David Durand and Paul Caton. Scholarly Technology Group, Brown University
[W3CNS] Namespaces in XML, http://www.w3.org/TR/REC-xml-names/
[W3CXS0] XML Schemas, http://www.w3.org/TR/xmlschema-0/
[W3CXS1] XML Schemas Structures, http://www.w3.org/TR/xmlschema-1/
[W3CXS2] XML Schemas Datatypes, http://www.w3.org/TR/xmlschema-2/
[WN] WordNet, Cognitive Science Laboratory, Princeton University. http://www.cogsci.princeton.edu/~wn/
[XCLUST] Mong Li Lee, LiangHuai Yang, Wynne Hsu, Yang Xia. XClust: Clustering XML Schemas for Effective Integration, CIKM 2002.
[XER2] Apace Xerces2 Parser, URL: http://xml.apache.org/xerces2-j/index.html
[XYLEME] Reynaud C, Sirot JP, Vodislav D. Semantic Integration of XML Heterogeneous Data Sources. International Database Engineering and Applications Symposium (IDEAS), IEEE 2001

Scalable Integration of XML Schemas - Chapter 6


Schema Integration, Supply Chain Management and eBusiness
6.1 Schema Integration
As we discussed in our motivations, information integration for companies dealing with other companies is essential and processing such information individually for every partner is not practical. For example there are hundreds of airlines operating and ticketing agents need an integrated view to the ever changing information to process real-time transactions. And this may only be part of a bigger picture, for instance involving credit verification and then another vendor providing delivery service.
Figure 6.1-1 Why Schema Integration
Another example is Meteorological information and Satellite imagery services and the use of this data by aviation and media organisations. Such large scale transactional processes are rapidly moving towards global integration using XML. Using XML for communication has many implications, since XML provides open standard and extensible definitions diverse and changing group of companies can easily plug into existing systems. XML Schemas provide a way to enforce data definitions conforming to standards and using XML Schema similarity computation we can cluster similar Schemas making integration within clusters easier. Schematically the process can be shown as in the following diagram.
Figure 6.1-2 Integration Process
Ideally we may obtain one integrated schema when working on specific domains. This integrated schema can then be used instead, helping an organisation provide better services to its partners.
Integration process for XML Schemas
A cluster contains semantically and structurally similar Schemas. This is very useful since conflicts present will be minimal and easy to resolve. An integration process will involve the following steps:
  1. Tabulation of matching pairs of objects using similarity values as computed
  2. Resolving Name conflicts: Different sources may use different names to express the same object in the real word
  3. Resolving Cardinality conflicts: Different sources may have different cardinalities for the same object
  4. Resolving source importance: Some sources are more important than others. Users want the final global schema to be more similar with the more important source. User queries may be mostly related to those sources, so the rewriting cost may be reduced using this way.
  5. Resolving element/attribute conflict: Information may be expressed as an attribute or element by different authors, and its final representation in the integrated schemas needs thought
  6. Resolving Datatype: Resolving simple primitive types will be easier while complex types present structural changes.
  7. Resolving structural conflict:  This is probably the most important and difficult issue. Different sources may use totally different structures to express the same relationship.
Schema integration transformations must be Information Preserving. [STG02]
Definition: A transformation meets the round-trip criterion if there exists an inverse transformation -1 such that -1(f(x)) = x for all documents x in the domain of f, and where = represents equivalence of XML information sets.
Definition: A transformation is an information-preserving transformation iff it meets the round-trip criterion.
Let us look at Supply Chain first and then apply XClustXS to the problem.
6.2 Supply Chain
A supply chain is a network of facilities and distribution options that performs the functions of procurement of materials, transformation of these materials into intermediate  and finished products, and the distribution of these finished products to customers [GH]. Figure 6.2-1 shows an example of a supply chain. Materials flow downstream, from raw material sources through a manufacturing level transforming the raw materials to intermediate products that are assembled on the next level to form products. The products are shipped to distribution centers and from there on to retailers and customers.
Traditionally, marketing, distribution, planning, manufacturing, and the purchasing organizations along the supply chain operated independently. These organizations have their own objectives and these are often conflicting. The result of these factors is that there is not a single, integrated plan for the organization - there were as many plans as businesses.
Supply chain management is a strategy through which such integration can be achieved. Supply chain management is typically viewed to lie between fully vertically integrated firms, where the entire material flow is owned by a single firm and those where each channel member operates independently.
Figure 6.2-1 Supply Chain Example
Coordination between the various players in the chain is the key in its effective management. The classic objective of logistics is to be able to have the right products in the right quantities (at the right place) at the right moment at minimal cost. This gives rise to a Just In Time (JIT) SCM. Maintaining a Supply Chain System requires descision making which is classified into three level; the strategic, the tactical, or the operational level. [RT97]
Figure 6.2-2 Supply Chain Descisions
Strategic decisions are made typically over a longer time horizon and are closely linked to the corporate strategy and guide supply chain policies from a design perspective. On the other hand, operational decisions are short term, and focus on activities over a day-to-day basis.
6.3 eBusiness and XClustXS
All effective Supply Chain Management decisions are data driven which brings in computers and with the increasing global data gives rise to the new term Electronic Business (eBusiness). Fully vertically integrated firms are rare bringing in B2B (business to business) communication and creating the supply chains. Traditional use of purchase orders, invoicing and inventory management are being replaced by large databases (for example SAP systems) which even include employee management, adding an “e” in every system. In the latest trend mobile users and their needs have also become increasingly important. These systems are specific to each organization and often even proprietary. XML comes in as a key facilitating B2B communications. B2B Communication involves PAIN:
  • Privacy: Ensuring that unauthorized parties cannot read the information that is being transmitted.
  • Authentication: Ensures the identity of a user or source of a transaction to prevent fraudulent use.
  • Integrity: Ensuring that the message content cannot be changed (intentionally or accidentally) or if it is changed, that change can be detected.
  • Non-repudiation: Ensures that the sender cannot deny having sent a transaction, and the recipient cannot deny having received the transaction.
There are numerous such proposed XML standards such as:
  • BizTalk Framework – Microsoft
  • Commerce XML (cXML) – Ariba, HP, Microsoft, webMethods and Sterling Commerce
  • Electronic Business XML (ebXML) – UN/CEFACT and OASIS
  • XML Common Business Library (xCBL) – Commerce One, OASIS, Microsoft, and UN/CEFACT
  • RosettaNet Partner Interface Processes (PIP) – CISCO, Intel, IBM, Dell, FedEx, Ericsson and Dun & Bradstreet
  • Financial Information Exchange Markup Language (FixML) – Goldman Sachs, Solomon Smith Barney and State Street Global Advisors
  • News Industry Text Format (NITF) – International Press Telecommunication Council
Each has its own proprietary methods and tools and specific areas of use and describing them individually is beyond our scope. The underlying principle involved in setting up of standards is that businesses can communicate effectively and increase productivity – EDI (Electronic Data Interchange). However adoption of a standard by all industry players is never the case since new technologies are constantly emerging, thus integration methods for diverse propriety structures are always needed.
Application to Application Integration (A2Ai)
Application to Application Integration (A2Ai) can be described as integration of heterogeneous systems within a corporations firewall. The transformation and communication middleware facilitate the conversion of data suitable from one application to the other. The generic architecture may be represented diagrammatically as below:
Figure 6.3-1 A2Ai Architecture
In this context XML comes into the picture for easy transformation (XClustXS for transformation mappings) of data leading to EAI (Enterprise Application Integration) followed by B2Bi (Business to Business Integration) across corporations and geographies. There may be XML Native applications that use XML for all communications and represent a newer trend. Enabling communication between two such systems can be accomplished by a single transformation process. Older system use proprietary communication protocols. These Non-Native XML applications require additional transformations to convert between these proprietary formats. Diagrammatically these are shown below:
Figure 6.3-2 XML Native Applications
 
Figure 6.3-3 XML Non-Native Applications
Business to Business Integration (B2Bi)
B2Bi can be thought of as similar to A2Ai except that it is outside a corporations firewall and integrates applications in different corporations. It brings in a number of issues related to security and data integrity addressed by XML Signature and XML Encryption.
Figure 6.3-4 B2Bi example
XClustXS
XClustXS can serve as a transformation tool to map from one XML source to another serving in the data translation process. Let us continue with our Airlines example in Section 6.1. The business process model can be seen as a high volume information transaction system with Information Subscribers (like Expedia) and Information Distributors (like Singapore Airlines).
Figure 6.3-5 Information Subscriber/Distributor Business Process Model [RNET]
If there was only one airline things would be very simple for the providers and could easily align their schemas. However there are hundreds of information providers in this case consequently bringing in Data Translation into the picture. Suppose myOrg represents an organisation like Expedia with Org1….Org n being the information providers like the Airlines.
Figure 6.3-6 Data Translation
myOrg would need to connect to seamlessly to the providers and give accurate results to its customers. XClustXS can be used to map and cluster diverse set of schemas and ultimately integrate them and for instance enable faster queries. It can also be used to monitor changes as they occur in information provider schema, for example if an airline makes a change in its Schemas they need to be updated by the subscriber as well. Such integrated clusters may be resolved into a single Schema or each cluster may be handled as an individual. For example consider the following Schema fragments:
Figure 6.3-7 Example Schema Samples and Integration
   <xs:element name="flights_search">
      <xs:complexType>
         <xs:sequence>
           <xs:element name="from" type="xs:string"/>
           <xs:element name="to" type="xs:string"/>
           <xs:element name="departing" type="xs:date"/>
           <xs:element name="returning" type="xs:date"/>
           <xs:element name="adults" type="xs:int"/>
           <xs:element name="children" type="xs:int"/>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
   <xs:element name="flights_search">
      <xs:complexType>
         <xs:sequence>
           <xs:element name="from" type="xs:string"/>
           <xs:element name="to" type="xs:string"/>
           <xs:element name="depart_dtm" type="xs:date"/>
           <xs:element name="return_dtm" type="xs:date"/>
           <xs:element name="who">
              <xs:complexType>
                 <xs:sequence>
                    <xs:element name="adults" type="xs:int"/>
                    <xs:element name="children" type="xs:int"/>
                    <xs:element name="infants" type="xs:int"/>
                 </xs:sequence>
              </xs:complexType>
           </xs:element>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
   <xs:element name="flight_search">
      <xs:complexType>
         <xs:sequence>
           <xs:element name="per_person" type="xs:decimal"/>
           <xs:element name="airline" type="xs:string"/>
           <xs:element name="when_where">
              <xs:complexType>
                 <xs:sequence>
                    <xs:element name="from" type="xs:string"/>
                    <xs:element name="to" type="xs:string"/>
                    <xs:element name="departureDate" type="xs:date"/>
                    <xs:element name="returnDate" type="xs:date"/>
                 </xs:sequence>
              </xs:complexType>
           </xs:element>
           <xs:element name="stop" type="xs:string"/>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
 
Integrated Schema
   <xs:element name="flight_search">
      <xs:complexType>
         <xs:sequence>
           <xs:element name="per_person" type="xs:decimal"/>
           <xs:element name="airline" type="xs:string"/>
           <xs:element name="when_where">
              <xs:complexType>
                 <xs:sequence>
                    <xs:element name="from" type="xs:string"/>
                    <xs:element name="to" type="xs:string"/>
                    <xs:element name="departure" type="xs:date"/>
                    <xs:element name="return" type="xs:date"/>
                 </xs:sequence>
              </xs:complexType>
           </xs:element>
           <xs:element name="who">
              <xs:complexType>
                 <xs:sequence>
                    <xs:element name="adults" type="xs:int"/>
                    <xs:element name="children" type="xs:int"/>
                    <xs:element name="infants" type="xs:int"/>
                 </xs:sequence>
              </xs:complexType>
           </xs:element>
           <xs:element name="stop" type="xs:string"/>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
These examples (from our flight domain test cases) can be easily reconciled into a larger Schema. There can thus be numerous applications of our XClustXS capability creating value for semi-structured data storage, communication, query and retrieval systems.