Honors Year Project - Scalable
Integration of XML Schemas |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 Computing Similarity in XML Schema 4.1 Similarity Perception Similarity is the quality of resemblance between objects though not necessarily being identical objects. For example in the context of geometry there is a precise definition for similar triangles which can be applied to any triangle with a universally acceptable answer. In context of comparing an apple, orange and a tomato we may face a dilemma, comparison may be based on size, shape, color, taste or even personal fancies leading to a subjective similarity measure. In the context of DTDs/XML Schema we try to identify facets of similarity which may lead to a consensus in identifying similar data definitions. It mainly focuses on data and its structure, where the data is of prime importance since it can be represented using numerous structures. Similarity Facets The above points provide useful insights into the underlying framework of design of a schema. All documents may be defined as containing data and structure for that data: Document = Data + Structure This leads to a document similarity based on the weighted average of similarity of its components. Data Similarity facets
Structural Similarity facets
Suppose there are two documents p1=d1+s1 and p2=d2+s2. Borrowing notation from geometry Case 1: Congruent (identical) p1 @ p2 if d1=d2 and s1=s2 and Case 2: Similar p1 ~ p2 if | sim(d1,d2) – sim(s1,s2) | > 0 where the function sim computes a similarity measure based on the facets described. The degree of similarity in this case is determined by the value and given appropriate weights a higher value indicates a higher degree of similarity. The taxonomy of matching techniques [MRB01, RB01] provides general guide to techniques available and these have numerous adaptations, and [MRB01] uses a number of them. Structural similarity computation by [NJ] and works cited by them provide a reasonable measure of similarity, but they fail to consider nodes in path context. The similarity computation in XClustXS is based on all the facets described and we experimentally prove our results in Section 6. Similarity among Schema Objects Schema Objects refer to Elements, Attributes, Groups and other structural constructs found in DTDs and particularly in XML Schemas. When calculating schema similarity many cases can be encountered. It is possible that most of the objects are generally similar presenting a symmetric distribution [HSTAT]
or a particular section is a subset of the other presenting a skewed distribution of object similarity values.
The Skew can be calculated as:
4.2 Similarity Computation Our algorithm to compute the similarity between XML Schema considers the following aspects:
Using <cat> for <dog> or <horse> in an instance document will be valid and similarity value should be identical in this case, though linguistic matching would have failed.
We make the following assumptions
These cases can be nested to any level however that would not affect our strategy in any way. To explain more clearly, let us consider the following example. The first use of OR is a trivial case and while the second case demonstrates enumeration. Writing as DTD
Writing as a XML SchemaUsing the W3C XML Schema recommendation, the trivial case be represented as <all></all> while an enumeration using <choice></choice> or the <enumeration></enumeration> facet of a simpleType. The following constrains are applicable to the <all> grouping element [W3CXS0] [W3CXS1]
The grouping element <choice> can be used to represent enumeration, and only one element within it can occur in an instance document. The trivial use of OR can be easily overlooked. The elements address and phone can occur in any order and this is irrelevant. The enumeration aspect is better captured in the enumeration facet by using User-defined Simple Type
Thus we assume that XML Schemas under consideration do not use <all> which represents a trivial case of OR. 4.3 Similarity Measures The similarity measures are based on the XML Schema and similarity facets discussed previously. They consider both data and structural similarity and attempt to resemble human perception of similarity. We now describe the algorithm in detail. Schematically the hierarchy of similarity computation for XML Schemas may be represented as follows.
This figure shows the relative organisation of the Similarity components. In computing the similarity the objects under consideration are elements and attributes of XML Schemas. A XML Schema is defined as a set of objects: XS = {p1, p2, p3….pn} where p1,p2….pn are the objects. 4.3.1 LinguisticSim and Domain Dictionaries The first phase of matching elements is based on correspondence of object name. This linguistic matching must be accurate and domain specific and can even be multi-lingual. A domain specific thesaurus provides a kind linguistic context matching especially when names involve highly polysemous words or even cultural differences. For example the name ‘address’ may mean ‘email address’, ‘IP address’, ‘memory address’ and of course ‘postal address’. Also such matching may not be specific (address ↔ address or address ↔ homeAddress or address ↔ officeAddress) in which case context is need.
Cultural differences of designers are common (bill ↔ cheque) and also need to be considered. A Domain Dictionary is a list of domain dependent pairs of names. For example in the context of flight bookings date ↔ travel_date. Such domain specific dictionaries need to be manually created that can then be applied during Linguistic Matching. Linguistic matching involves:
DDMatch(string1, string2){ for(each pair in DD){ if(pair.equals(pair(string1,string2))) return 1; } }
matchAcronymSubtring(string1,string2){ if(string1==substring(string2) || string2==substring(string1) return 1; if(string1==acronym(string2) || string2==acronym(string1) return 1; return 0; } The acronym matching is domain specific and based on a domain dictionary.
matchWords(string1,string 2){ if(string1=string2) return 1; depth=1; T=Æ;U=Æ; Set S=Synonym(string1); Set H=HypernymHyponym(string1); if (string2 Î S ) return 0.8; if (string2 Î H ) return 0.6; while(depth<maxDepth){ depth++; for (each str Î S){ if (string2 Î Synonym(str) return 0.8depth else T=T È Synonym(str); } for (each str Î S){ if (string2 Î HypernymHyponym(str) return 0.6depth else U=U È HypernymHyponym(str); } S=T;H=U;T=Æ;U=Æ; } } The result of varying linguistic synonym value over 0.2, 0.5, 0.6, 0.7, 0.8 and 0.9 is shown (the other values are not plotted for clarity) in the following graph as a frequency of similarity value occurrence when using only Linguistic similarity and scaling it to make up for the narrow range of values obtained.
As we see from this graph using a value 0.7 to 0.8 gives more of higher occurring Sim Values so we use the value of 0.8 for synonyms computations.
resolve(p1.name,p2.name,depth) { //identify type NOTE: maybe multiple for (each type identified){ switch(type){ case notation: depth++ if(depth>maxDepth) return permutedTokenSet1=permute(tokenize(p1.name)) permutedTokenSet2=permute(tokenize(p2.name)) for (each permutation perm1 Î set1, perm2 Î set2){ sim=wt1*matchWord(perm1,perm2) + wt2*matchAcronymSubstring(perm1,perm2) if(sim > threshold) return sim resolve(permutedTokenSet1, permutedTokenSet2, depth) } case plural: p1.name=makeSingular(p1.name) p2.name=makeSingular(p2.name) case vogue: p1.name=makeGeneric(VOGUE, p1.name) p2.name=makeGeneric(VOGUE, p2.name) case designerSpecific: p1.name=makeGeneric(DSGN, p1.name) p2.name=makeGeneric(DSGN, p2.name) case appSpecific: p1.name=makeGeneric(APP, p1.name) p2.name=makeGeneric(APP, p2.name) } } return 0 } Linguistic Similarity is thus combination of the above strategies. 4.3.2 CardinalitySim Cardinality for attributes is fixed at minOccurs=0 and maxOccurs=1. This is implicitly fixed and cannot be changed by users. Cardinality for other elements is arbitrary and can be any value, default values are minOccurs=0 and maxOccurs=1. Cardinality is sometimes useful is identifying matching objects and thus the similarity function is defined as the following CardinalitySim(p1,p2) = table_lookup Consider two objects p1 and p2 such that a1 ≤ |p1| ≤ b1 and a2 ≤ |p2| ≤ b2 where a1 ≤ b1 and a2 ≤ b2 Thus there are five possibilities which are assigned the similarity values. These values are empirically determined.
4.3.3 TypeSim A simple organization of types can be represented as follows
Comparing data types: Different approaches are necessary to compare between complex, built-in and user-defined simple types. This is necessary because complex types can be considered as nested XML Schemas in themselves and involves considering type information of the child nodes also. Simple Types Built-in types represent the primitive types for XML Schema and are organized as given in Appendix A [W3CXS2]. They are summarized in the table below. The Base type indicates the primitive type from which that particular type has been derived, and for TypeSim measurement we treat it as the base type. This is necessary as there can be infinite derived types. The values obtained in Table 3.3-3 are determined by experiment but beginning with values that appeal to common sense since the small effect attributed to them cannot be very accurately judged using statistical means.
Using this hierarchy the following similarity values are assigned for primitive types NOTE: Table is symmetric, and for clarity values not filled are zero. The numbers of columns and rows correspond to Table 3.2.3-1
To summarize, the built-in derived types are considered same as the primitive they use as base. User-defined Simple types are derived from these built-in types, they are treated the same as the type they use as base. The values used are empirically determined. Example TypeSim(float,double)=0.8 TypeSim(string,int)=0 Complex Types These are groups of objects which have their own types. These types may be other complex types, however they must all conclude with a Simple Type. TypeSim(p1,p2)=ImmediateDescSim(p1,p2) This becomes a recursive definition. Similarity between Simple and Complex type is set to zero.
4.3.4 NamespaceSim A partial similarity based on namespace cannot be derived since a Namespace is globally unique and generally is only an indication that the Schemas under consideration from the same source/author or the same application. NamespaceSim(p1,p2)
= It is advisable to keep the weight for namespace similarity low since there can be limitless objects with the same namespace. 4.3.5 BasicSim This is the weighted similarity of Linguistic, Cardinality, Type and Namespace Similarity and used for computation of the other similarity measures. BasicSim(p1,p2) = wt1*OntologySim(p1,p2) + wt2*CardinalitySim(|p1|,|p2|) + wt3* TypeSim(p1,p2) + wt4 * NamespaceSim(p1,p2) where wt1 + wt2 + wt3 + wt4 = 1 4.3.6 PathContextCoefficient This concept measures how similar the paths are for two objects. Consider <owner><name/></owner> and <owner><pet><name/></pet></owner> which in the absence of path context can be considered the same leading to incorrect notion of <name/> and similarity measure. Given objects p and q the path from p to q is given by: q.path(p) ={p, p1, p2, … , pm, q} This path can be in context to the ancestor if traversing towards root or descendant context if traversing towards leaves. In computing the PCC (Path Context Coefficient) these path nodes must be considered. A similarity matrix (SimMat) is computed for each pair of objects in the path by creating a triplets {p, q, Sim(p, q)} The Sim function here is used as a generic reference for all the similarity measures described in Sections 4.3.6 through 4.3.11. After this a statistical measure of relative location is applied to get a standard score for the similarity value. [CRO79]
The SimMat created describes a m:n (many-many) relation for the objects and then a function LocalMatch is used to reduce it to a 1:1 (one-one) relation. LocalMatch is defined by first eliminating triplets with Sim value below the threshold and then transforming the SimMat to a 1:1 relation. LocalMatch introduced here serves as a generic function and is used in all the following similarity components to reduce SimMat. LocalMatch(SimMat){ while (SimMat ¹ Æ ) { MatchList = MatchList È {max(p1,p2,v) | v > threshold} SimMat = SimMat – {(p1,any,any)} – {(any, p2, any)} } return MatchList; } PCC is then computed as follows using the similarity matrix and the LocalMatch function: PCC(p1,q1,p2,q2,threshold){ for (each p1i Î q1.path(p1)){ for (each p2j Î q2.path(p2)){ SimMat=SimMat È (p1i, p2j, BasicSim(p1i,p2j)); } } MatchList=LocalMatch(SimMat, |q1.path(p1)|, |q2.path(p2)|, threshold) return
} 4.3.7 SemanticSim The semantic similarity is a measure of how similar two objects are in name, constraints and context information. It captures the overall semantics of comparing two objects and is one of the top level similarity measures for calculating the object similarity. SemanticSim (p1, p2, threshold) = PCC (p1, p1.root, p2, p2.root, threshold) * BasicSim(p1,p2) 4.3.8 ImmediateDescendantSim By considering the child nodes (immediate descendants) this measure captures the vicinity context similarity. This is another top level measure for computing object similarity. ImmediateDescSim(p1,p2,threshold){ for (each ci Î p1.child){ for (each cj Î p2.child){ SimMat=SimMat È (ci, cj, BasicSim(ci,cj)); } } MatchList=LocalMatch(SimMat, |p1.children|, |p2.children|, threshold) return
} 4.3.9 LeafContextSim It measures the semantic and context similarity for leaves of nodes rooted at p1and p2. This is another top level measure for computing object similarity. LeafContextSim(p1, p2, threshold){ for (each li Î p1.leaf){ for (each lj Î p2.leaf){ SimMat=SimMat È (li, lj, LeafSim(li,p1,lj,p2, threshold)); } } MatchList=LocalMatch(SimMat, |p1.leaves|, |p2.leaves|, threshold) return
} where LeafSim(li,p1,lj,p2, threshold) = PCC(li,p1,lj,p2, threshold) * BasicSim(li, lj) which is the measure of semantic similarity. 4.3.10 ObjectSim This is the weighted measure of similarity of Descendant Context, Semantic Context and Leaf Context Similarity. Special Cases (i) Recursion:
(ii) Comparing Leaf and Internal objects: For two objects p1, p2 if they are both leaf nodes then they have no leaves or descendants so LeafContextSim(p1, p2) = 0 and ImmediateDescSim(p1, p2) = 0. If only one is a leaf node then its context is established from the root. ObjectSim (p1, p2, threshold, wt1, wt2, wt3) { //Recursion special case if (either p1, p2 recursive) return 0; else if (p1 && p2 recursive) return ObjectSim(p1.ref, p2.ref); //Leaf and Internal Nodes if(p1 && p2 leaves) return SemanticSim(p1,p2,threshold); else if (either p1 or p2 a leaf) LeafContextSim = SematicSim return wt1 * SemanticSim (p1, p2, threshold) + wt2 * ImmediateDescSim (p1, p2, threshold) + wt3 * LeafContextSim (p1, p2, threshold) 4.3.11 SchemaSim Overall similarity of two schemas is given as a normalized ObjectSim measure. For two Schemas s1, s2 the similarity is computed as follows SchemaSim(s1,s2){ for (each pi Î s1){ for (each pj Î s2){ SimMat=SimMat È ObjectSim(pi,pj,threshold,wt1,wt2,wt3) } } MatchList=LocalMatch(SimMat) return
} 4.4 Dynamic Weights and Thresholds The algorithms described above use a number of weights and thresholds. These values play an important role in the similarity computation and could be statically assigned or dynamically generated for each set of Schemas under consideration. Dynamic weights and thresholds will help customize the algorithm for each set automatically based on an algorithm devised as shown below and employs statistical measures such as mean, median etc. However this will result in considerable overhead as similarity computation cannot start until these values have been determined. Results show that using dynamic values we get a broader spectrum of values though the relative ranking for static and dynamic values is almost the same. Thresholds For determining a threshold a suitable measure of central tendency is used using all the similarity values. We use the following measures:
By calculating the skew first as described in Section 4.1 the appropriate method can then be applied to calculate the threshold.
Weights In determining the weights or ranking of similarity components, the size (number of objects) and the skew value of the Schemas are used. The weights for DTD/XML Schema Similarity for Semantic, Immediate Descendant and Leaf Context components are calculated as:
countLeaf //total number of leaves countInternal //total number of internal nodes leafPercent //percentage of leaf nodes internalPercent // percentage of internal nodes if (skew >= MEDIUM) WT_SEMANTIC=0.4 else WT_SEMANTIC=0.3 if (leafPercent – internalPercent > 55%) { WT_LEAF= WT_IMMEDDESC=1 – WT_LEAF } else{ WT_IMMEDDESC= WT_LEAF=1 – WT_IMMEDDESC } If the summation of weights is not 1 it may be adjusted relatively to make the sum 1.
The threshold value at 55% is an empirically determined as seen above and is the Median value calculated using data in the test set as described in Section 5.3.
As we see from above both follow the same trend however using dynamic measures we get a broader spread as seen below (actual numbers):
This is a significant advantage when comparing a smaller range of values. The weights for Basic Similarity for Linguistic Similarity, Cardinality Similarity, Type Similarity and Namespace Similarity cannot be determined dynamically in practice. This is because these weights are needed for every pair of objects that are compared and even for relatively small schemas this number is very large! Also empirically determined static weights would give acceptable results for most schema sets. 4.5 Clustering Integrating large numbers of XML Schemas is difficult, however we believe that similar XML Schemas grouped together would help in the integration process. Designers performing integration would then have to focus within these clusters helping in the integration process. These clusters can be generated by the Hierarchical Agglomerative Clustering methods [EVE74] using the similarity value we calculate using XClustXS. A dendrogram can also be drawn for visual representation of the clusters. Agglomerative methods proceed by a series of successive fusions of the entities into groups and at each step those with the highest similarity are fused together. Using the complete linkage method we give the following example with a dendrogram to show how the clustering technique works. Consider the Schema similarity graph below, it shows Schemas A, B, C, and D and their similarity values.
Listing the pairs and similarities as an array and clustering:
Representing the above as a dendrogram it shows the clusters {A, B}, {{A, B}, C}, {{{A, B}, C}, D}
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||