Wednesday, January 1, 2003

Scalable Integration of XML Schemas - Chapter 4


Computing Similarity in XML Schema
4.1  Similarity Perception
Similarity is the quality of resemblance between objects though not necessarily being identical objects. For example in the context of geometry there is a precise definition for similar triangles which can be applied to any triangle with a universally acceptable answer. In context of comparing an apple, orange and a tomato we may face a dilemma, comparison may be based on size, shape, color, taste or even personal fancies leading to a subjective similarity measure. In the context of DTDs/XML Schema we try to identify facets of similarity which may lead to a consensus in identifying similar data definitions. It mainly focuses on data and its structure, where the data is of prime importance since it can be represented using numerous structures.
Similarity Facets
The above points provide useful insights into the underlying framework of design of a schema. All documents may be defined as containing data and structure for that data:
Document = Data + Structure
This leads to a document similarity based on the weighted average of similarity of its components.
Data Similarity facets
  1. Linguistics: Use of synonyms, this can also use domain specific thesauri
  2. Type: Datatype information
  3. Lexical Representations: For example 100, “one hundred” and 1.0E2 all mean the same.
  4. Namespace: This serves to uniquely identify the source of the data.
Structural Similarity facets
  1. Structures:  Elements, Attributes, Groups, use of inheritance and other structural constructs.
  2. Cardinality: *, +, ? for DTD and minOccurs, maxOccurs for XSD
  3. Path/Path Context: Context based on path from root or another node
Suppose there are two documents p1=d1+s1 and p2=d2+s2. Borrowing notation from geometry
Case 1: Congruent (identical)
p1 @ p2 if d1=d2 and s1=s2 and
Case 2: Similar
p1 ~ p2 if | sim(d1,d2) – sim(s1,s2) | > 0 where the function sim computes a similarity measure based on the facets described.
The degree of similarity in this case is determined by the value and given appropriate weights a higher value indicates a higher degree of similarity.
The taxonomy of matching techniques [MRB01, RB01] provides general guide to techniques available and these have numerous adaptations, and [MRB01] uses a number of them. Structural similarity computation by [NJ] and works cited by them provide a reasonable measure of similarity, but they fail to consider nodes in path context. The similarity computation in XClustXS is based on all the facets described and we experimentally prove our results in Section 6.
Similarity among Schema Objects
Schema Objects refer to Elements, Attributes, Groups and other structural constructs found in DTDs and particularly in XML Schemas. When calculating schema similarity many cases can be encountered. It is possible that most of the objects are generally similar presenting a symmetric distribution [HSTAT]
Figure 4.1-1 Symmetric Distribution
or a particular section is a subset of the other presenting a skewed distribution of object similarity values.
Figure 4.1-2 Skewed Distribution
The Skew can be calculated as:
, where x is the object similarity, m is the mean s is the standard deviation and N is the total number of objects. This measure is very useful and used subsequently in our algorithms.
4.2 Similarity Computation
Our algorithm to compute the similarity between XML Schema considers the following aspects:
  • Namespaces: When comparing Schemas objects having the same namespace are most probably created by the same author/group. Since namespaces [W3CNS] are globally unique, this distinction can serve as a measure of similarity though it must be weighted appropriately in the overall similarity computation.
  • Elements, Attributes: Depending on the schema designer’s preference information may be represented as an element or attribute, so when comparing across Schemas these may be used interchangeably and upon integration we may need to resolve one into another.
  • Derived Elements: Elements can be derived from abstract elements. This together with substitution groups creates the need for a different approach.
  • Substitution Groups: These provide groups of elements that can be substituted for other elements. Suppose the following Schema snippet
<xs:element name="animal" type="xs:string"/>
<xs:element name="cat" type="xs:string" substitutionGroup="animal"/>
<xs:element name="dog" type="xs:string" substitutionGroup="animal"/>
<xs:element name="horse" type="xs:string" substitutionGroup="animal"/>
Using <cat> for <dog> or <horse> in an instance document will be valid and similarity value should be identical in this case, though linguistic matching would have failed.
  • Recursion: Expanding recursive nodes is impractical and similarity value of recursive nodes is determined by their reference object’s similarity.
  • Cardinality Constraints: The cardinality of elements in XML Schemas can be arbitrarily defined (minOccurs,maxOccurs). Since these can have any values weights attached to them in computing the overall similarity must be appropriately adjusted.
  • Type information: Data type information will affect similarity computation and integration. Complex types and user-defined simple types need to be resolved separately.
  • Derived Types: Types can be derived by extension, restriction, union and lists from existing simple types. Since there can be limitless possibilities derived types need to be treated as their base types. Complex types are derived types and a different approach needs to be applied to them.
  • Structural Context Information: Context information is very important since designers can use different levels of abstractions to represent information. This aspect is generally due to the nature of application or designer specific, for example, one can model name simply as <name> while another can have <name><first/><last/></name>. This shows the need to consider context of descendants and leaves.
  • Referenced Objects: Referenced objects (elements, types, attributes and groups) give rise to a graph since multiple objects can refer to one object. Resolving this many-one relation creates new challenges.
  • Global and Local Declaration: Global declarations create reusable objects within a schema that can be referenced by other objects.  Global and Local objects can have the same name thus context of declaration is important in computing the similarity
Possible Global structures
Affect similarity computation
Remark
annotation
NO
This is for comments for humans/bots
attribute
YES
 
attributeGroup
YES
 
complexType
YES
 
element
YES
 
group
YES
 
import
NO
Imports a W3C Schema (new namespace)
include
NO
Include a W3C Schema (same namespace)
notation
NO
Declares Notation
redefine
NO
Similar to include
simpleType
YES
 
We make the following assumptions
  • Schemas under consideration do not have import or include statements or otherwise any reference to external sources. This can be worked out since externally referenced sources can be easily accommodated within a XML Schema
  • Exclusion of OR. Occurrence of trivial cases of OR has been excluded from consideration in XML Schemas since it can be better represented in other ways available and use for convenience may be overlooked.
    • Using OR to avoid ordering is a trivial case (convenience)
    • OR can be better represented as a data type (enumeration) in XML Schema (better representation)
    • W3C XML Schema Definition is very restrictive concerning OR [W3CXS0] [EV00] (restrictions for use in XML Schema recommendation)
Table 4.2-1: Assumptions
Trivial Case
 
(a|b)* ↔ (a*,b*) ↔ (b*,a*)
(a|b)+ ↔ (a+,b+) ↔ (b+,a+)
((a|b)*)+ ↔ ((a|b)+)* ↔ (a*,b*) ↔ (b*,a*)
Such cases only affect ordering and have no impact on structural/semantic relations of the Schema and are thus not taken into consideration.
 
Enumeration
 
(a|b) ↔ (a|b)?
((a|b)*)? ↔ ((a|b)?)*
((a|b)+)? ↔ ((a|b)?)+
One case of the many available is applicable and a choice has to be made. This can be represented usingenumerationchoice as available in the XML Schema recommendation
These cases can be nested to any level however that would not affect our strategy in any way. To explain more clearly, let us consider the following example. The first use of OR is a trivial case and while the second case demonstrates enumeration.

Writing as DTD

A DTD using OR ( | )
<!ELEMENT contact (name, (address | phone)*) >
<!ELEMENT name (title, first, last)>
<!ELEMENT title (mr | mrs | dr | prof )>
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT mr EMPTY>
<!ELEMENT mrs EMPTY>
<!ELEMENT dr EMPTY>
<!ELEMENT prof EMPTY>
<!ELEMENT address (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
A better alternative
<!ELEMENT contact (name, (address | phone)*) >
<!ELEMENT name (first, last)>
<!ATTLIST name title (mr | mrs | dr | prof ) #REQUIRED>
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT phone (#PCDATA)>

Writing as a XML Schema

Using the W3C XML Schema recommendation, the trivial case be represented as <all></all> while an enumeration using<choice></choice> or the <enumeration></enumeration> facet of a simpleType.
The following constrains are applicable to the <all> grouping element [W3CXS0] [W3CXS1]
  • Every elements inside it must occur once or not at all (minOccurs=0, maxOccurs=1 is fixed), thus the cases (a|b)* and (a|b)+ cannot arise directly to XML Schemas
  • It must be at the top level in the content (complexType, group etc)
  • No sub-groups can occur inside it
The grouping element <choice> can be used to represent enumeration, and only one element within it can occur in an instance document. The trivial use of OR can be easily overlooked. The elements address and phone can occur in any order and this is irrelevant. The enumeration aspect is better captured in the enumeration facet by using User-defined Simple Type
A Valid XML Schema (contacts.xsd)
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="contact">
    <xs:complexType>
       <xs:sequence>
         <xs:element name="name">
           <xs:complexType>
             <xs:sequence>
                <xs:element name="first" type="xs:string"/>
                <xs:element name="last" type="xs:string"/>
             </xs:sequence>
             <xs:attribute name="title">
                <xs:simpleType>
                  <xs:restriction base="xs:string">
                    <xs:enumeration value="mr"/>
                    <xs:enumeration value="mrs"/>
                    <xs:enumeration value="dr"/>
                    <xs:enumeration value="prof"/>
                  </xs:restriction>
                </xs:simpleType>
             </xs:attribute>
           </xs:complexType>
         </xs:element>
         <xs:element name="address" type="xs:string"/>
         <xs:element name="phone" type="xs:string"/>
       </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>
Thus we assume that XML Schemas under consideration do not use <all> which represents a trivial case of OR.
4.3 Similarity Measures
The similarity measures are based on the XML Schema and similarity facets discussed previously. They consider both data and structural similarity and attempt to resemble human perception of similarity. We now describe the algorithm in detail. Schematically the hierarchy of similarity computation for XML Schemas may be represented as follows.
Figure 4.3-1: Similarity Measures
This figure shows the relative organisation of the Similarity components.
In computing the similarity the objects under consideration are elements and attributes of XML Schemas.
A XML Schema is defined as a set of objects:
XS = {p1, p2, p3….pnwhere p1,p2….pn are the objects.
4.3.1 LinguisticSim and Domain Dictionaries
The first phase of matching elements is based on correspondence of object name. This linguistic matching must be accurate and domain specific and can even be multi-lingual. A domain specific thesaurus provides a kind linguistic context matching especially when names involve highly polysemous words or even cultural differences. For example the name ‘address’ may mean ‘email address’, ‘IP address’, ‘memory address’ and of course ‘postal address’. Also such matching may not be specific (address ↔ address or address ↔ homeAddress or address ↔ officeAddress) in which case context is need.
Figure 4.3-2: Match ambiguity – where is the match?
Cultural differences of designers are common (bill ↔ cheque) and also need to be considered. A Domain Dictionary is a list of domain dependent pairs of names. For example in the context of flight bookings date ↔ travel_date. Such domain specific dictionaries need to be manually created that can then be applied during Linguistic Matching. Linguistic matching involves:
  • Domain Dictionary: This list of human identified pairs in domain context is considered first and if a successful match is found the rest of the strategies can be ignored.
DDMatch(string1, string2){
     for(each pair in DD){
if(pair.equals(pair(string1,string2)))
return 1;
     }
}
  • Acronyms and Substring: Acronyms are generally well recognized while substring use is arbitrary, for example card_no ↔ card_num. However in the context of schema names there is a thin line between the two.
matchAcronymSubtring(string1,string2){
     if(string1==substring(string2) || string2==substring(string1)
          return 1;
     if(string1==acronym(string2) || string2==acronym(string1)
          return 1;
     return 0;
}
The acronym matching is domain specific and based on a domain dictionary.
  • Synonyms and Hypernyms/Hyponym: Synonym matching (address ↔ residence) is most common and at times can be extended to hypernyms (For example, cat → animal, dog → animal Þ cat ↔ dog though not for this case) and vice-versa for hyponyms.
matchWords(string1,string 2){
     if(string1=string2) return 1;
     depth=1; T=Æ;U=Æ;
     Set S=Synonym(string1);
     Set H=HypernymHyponym(string1);
     if (string2 Î S ) return 0.8;
     if (string2 Î H ) return 0.6;
     while(depth<maxDepth){
          depth++;
          for (each str Î S){
              if (string2 Î Synonym(str) return 0.8depth
              else T=T È Synonym(str);
          }
          for (each str Î S){
              if (string2 Î HypernymHyponym(str) return 0.6depth
              else U=U È HypernymHyponym(str);
          }
          S=T;H=U;T=Æ;U=Æ;
     }
}
The result of varying linguistic synonym value over 0.2, 0.5, 0.6, 0.7, 0.8 and 0.9 is shown (the other values are not plotted for clarity) in the following graph as a frequency of similarity value occurrence when using only Linguistic similarity and scaling it to make up for the narrow range of values obtained.
As we see from this graph using a value 0.7 to 0.8 gives more of higher occurring Sim Values so we use the value of 0.8 for synonyms computations.
  • Resolution: The names under consideration for example user_id, userID, user.id are the same though written in differentnotation conventions (underscore, camel and dot) and need to be resolved before applying any comparison method. The words need to be tokenized followed by a permuted comparison. For example id_user and user_id need permutation. Other cases are use of plurals (link ↔ links), vogue (mail2Addr ↔ mailToAddr), designer specific names (myUserID ↔ userID) or even application specific (XClustUID ↔ userID). In many such cases it is impossible to recognize automatically if a match exists. Using a the domain dictionary in this scenario is very useful.
resolve(p1.name,p2.name,depth) {
     //identify type NOTE: maybe multiple
     for (each type identified){
          switch(type){
              case notation:
                   depth++
                   if(depth>maxDepth) return
                   permutedTokenSet1=permute(tokenize(p1.name))
                   permutedTokenSet2=permute(tokenize(p2.name))
                   for (each permutation permΠset1, permΠset2){
                        sim=wt1*matchWord(perm1,perm2) +
 wt2*matchAcronymSubstring(perm1,perm2)
                        if(sim > threshold) return sim
                        resolve(permutedTokenSet1,
permutedTokenSet2, depth)
                        }
              case plural:
                   p1.name=makeSingular(p1.name)
                   p2.name=makeSingular(p2.name)
               case vogue:
p1.name=makeGeneric(VOGUE, p1.name)
                   p2.name=makeGeneric(VOGUE, p2.name)
              case designerSpecific:
p1.name=makeGeneric(DSGN, p1.name)
                   p2.name=makeGeneric(DSGN, p2.name)
              case appSpecific:
p1.name=makeGeneric(APP, p1.name)
                   p2.name=makeGeneric(APP, p2.name)
          }
}
return 0
}
Linguistic Similarity is thus combination of the above strategies.
4.3.2 CardinalitySim
Cardinality for attributes is fixed at minOccurs=0 and maxOccurs=1. This is implicitly fixed and cannot be changed by users. Cardinality for other elements is arbitrary and can be any value, default values are minOccurs=0 and maxOccurs=1. Cardinality is sometimes useful is identifying matching objects and thus the similarity function is defined as the following
CardinalitySim(p1,p2) = table_lookup
Consider two objects p1 and p2 such that a≤ |p1| ≤ b1 and a≤ |p2| ≤ b2 where a≤ band a≤ b2
Thus there are five possibilities which are assigned the similarity values. These values are empirically determined.
Table 4.3-1: Cardinality Lookup
Possibility
Value
a= a1 AND b= b2
1
a≤ a2 AND b< b2
0.9
a≤ a2 AND b< b1
0.7
a< a1 AND b≤ b1
0.9
a< a1 AND b≤ b2
0.7
4.3.3 TypeSim
A simple organization of types can be represented as follows
Figure 4.3-3: Type Organisation
Comparing data types: Different approaches are necessary to compare between complex, built-in and user-defined simple types. This is necessary because complex types can be considered as nested XML Schemas in themselves and involves considering type information of the child nodes also.
Simple Types
Built-in types represent the primitive types for XML Schema and are organized as given in Appendix A [W3CXS2]. They are summarized in the table below. The Base type indicates the primitive type from which that particular type has been derived, and for TypeSim measurement we treat it as the base type. This is necessary as there can be infinite derived types.
The values obtained in Table 3.3-3 are determined by experiment but beginning with values that appeal to common sense since the small effect attributed to them cannot be very accurately judged using statistical means.
Table 4.3-2 Datatype Hierarchy
  
Datatype
Base (Top Level)
primitive
1
String
-
2
Boolean
-
3
base64Binary
-
4
hexBinary
-
5
float
-
6
decimal
-
7
double
-
8
anyURI
-
9
QName
-
10
NOTATION
-
11
gMonth
-
12
gDay
-
13
gMonthDay
-
14
gYear
-
15
gYearMonth
-
16
date
-
17
time
-
18
dateTime
-
19
duration
-
Derived
 
normalizedString
string
 
token
string
 
language
string
 
IDREFS
string
 
ENTITIES
string
 
NMTOKEN
string
 
NMTOKENS
string
 
Name
string
 
NCName
string
 
ID
string
 
IDREF
string
 
ENTITY
string
 
integer
decimal
 
nonPositiveInteger
decimal
 
negativeInteger
decimal
 
long
decimal
 
int
decimal
 
short
decimal
 
byte
decimal
 
nonNegativeInteger
decimal
 
unsignedLong
decimal
 
unsignedInt
decimal
 
unsignedShort
decimal
 
unsignedByte
decimal
 
positiveInteger
decimal
Using this hierarchy the following similarity values are assigned for primitive types
NOTE: Table is symmetric, and for clarity values not filled are zero. The numbers of columns and rows correspond to Table 3.2.3-1
Table 4.3-3: Primitive Type Sim Values
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1
1
                  
2
.5
1
                 
3
  
1
                
4
   
1
               
5
    
1
              
6
    
.2
1
             
7
    
.8
.2
1
            
8
       
1
           
9
       
.2
1
          
10
         
1
         
11
          
1
        
12
           
1
       
13
          
.2
.2
1
      
14
             
1
     
15
          
.2
 
.2
.2
1
    
16
            
.2
 
.2
1
   
17
                
1
  
18
               
.8
.8
1
 
19
           
.4
.4
.4
.4
.4
.2
.2
1
To summarize, the built-in derived types are considered same as the primitive they use as base.
User-defined Simple types are derived from these built-in types, they are treated the same as the type they use as base. The values used are empirically determined.
Example
TypeSim(float,double)=0.8
TypeSim(string,int)=0
Complex Types
These are groups of objects which have their own types. These types may be other complex types, however they must all conclude with a Simple Type.
TypeSim(p1,p2)=ImmediateDescSim(p1,p2)
This becomes a recursive definition.
Similarity between Simple and Complex type is set to zero.
4.3.4 NamespaceSim
A partial similarity based on namespace cannot be derived since a Namespace is globally unique and generally is only an indication that the Schemas under consideration from the same source/author or the same application.
NamespaceSim(p1,p2) = 
It is advisable to keep the weight for namespace similarity low since there can be limitless objects with the same namespace.
4.3.5 BasicSim
This is the weighted similarity of Linguistic, Cardinality, Type and Namespace Similarity and used for computation of the other similarity measures.
BasicSim(p1,p2) = wt1*OntologySim(p1,p2) + wt2*CardinalitySim(|p1|,|p2|) + wt3* TypeSim(p1,p2) + wt4 * NamespaceSim(p1,p2where wt+ wt+ wt+ wt= 1
4.3.6 PathContextCoefficient
This concept measures how similar the paths are for two objects. Consider <owner><name/></owner> and <owner><pet><name/></pet></owner>  which in the absence of path context can be considered the same leading to incorrect notion of <name/> and similarity measure.  Given objects p and q the path from p to q is given by:
q.path(p) ={p, p1, p2, … , pm, q}
This path can be in context to the ancestor if traversing towards root or descendant context if traversing towards leaves. In computing the PCC (Path Context Coefficient) these path nodes must be considered. A similarity matrix (SimMat) is computed for each pair of objects in the path by creating a triplets {p, q, Sim(p, q)} The Sim function here is used as a generic reference for all the similarity measures described in Sections 4.3.6 through 4.3.11. After this a statistical measure of relative location is applied to get a standard score for the similarity value. [CRO79]
 where x is a sim value, m is the mean and s is the standard deviation. This is necessary since if suppose we were matching a document having few objects with another having a large number the match would not be accurate as described by the skew measure in Section 4.1
The SimMat created describes a m:n (many-many) relation for the objects and then a function LocalMatch is used to reduce it to a 1:1 (one-one) relation. LocalMatch is defined by first eliminating triplets with Sim value below the threshold and then transforming the SimMat to a 1:1 relation. LocalMatch introduced here serves as a generic function and is used in all the following similarity components to reduce SimMat.
LocalMatch(SimMat){
     while (SimMat ¹ Æ ) {
          MatchList = MatchList È {max(p1,p2,v) | v > threshold}
          SimMat = SimMat – {(p1,any,any)} – {(any, p2, any)}
     }
return MatchList;
}
PCC is then computed as follows using the similarity matrix and the LocalMatch function:
PCC(p1,q1,p2,q2,threshold){
for (each p1i Î q1.path(p1)){
          for (each p2j Î q2.path(p2)){
              SimMat=SimMat È (p1i, p2j, BasicSim(p1i,p2j));
          }
     }
     MatchList=LocalMatch(SimMat, |q1.path(p1)|, |q2.path(p2)|, threshold)
     return 
}
4.3.7 SemanticSim
The semantic similarity is a measure of how similar two objects are in name, constraints and context information.
It captures the overall semantics of comparing two objects and is one of the top level similarity measures for calculating the object similarity.
SemanticSim (p1, p2, threshold) = PCC (p1, p1.root, p2, p2.root, threshold) * BasicSim(p1,p2)
4.3.8 ImmediateDescendantSim
By considering the child nodes (immediate descendants) this measure captures the vicinity context similarity. This is another top level measure for computing object similarity.
ImmediateDescSim(p1,p2,threshold){
     for (each ci Î p1.child){
          for (each cj Î p2.child){
              SimMat=SimMat È (ci, cj, BasicSim(ci,cj));
          }
     }
     MatchList=LocalMatch(SimMat, |p1.children|, |p2.children|, threshold)
     return 
}
4.3.9 LeafContextSim
It measures the semantic and context similarity for leaves of nodes rooted at p1and p2. This is another top level measure for computing object similarity.
LeafContextSim(p1, p2, threshold){
     for (each li Î p1.leaf){
          for (each lj Î p2.leaf){
              SimMat=SimMat È (li, lj, LeafSim(li,p1,lj,p2, threshold));
          }
     }
     MatchList=LocalMatch(SimMat, |p1.leaves|, |p2.leaves|, threshold)
     return 
}
where
LeafSim(li,p1,lj,p2, threshold) = PCC(li,p1,lj,p2, threshold) * BasicSim(li, lj)
which is the measure of semantic similarity.
4.3.10 ObjectSim
This is the weighted measure of similarity of Descendant Context, Semantic Context and Leaf Context Similarity.
Special Cases
(i)                  Recursion:
Comparing a recursive with non-recursive node is a zero similarity value.
Comparing two recursive nodes is the same comparing their referenced nodes, for exampleObjectSim(sub-part, sub-section) ºObjectSim(part, section)
(ii)                Comparing Leaf and Internal objects: For two objects p1, p2 if they are both leaf nodes then they have no leaves or descendants so LeafContextSim(p1, p2) = 0 and ImmediateDescSim(p1, p2) = 0. If only one is a leaf node then its context is established from the root.
ObjectSim (p1, p2, threshold, wt1, wt2, wt3) {
     //Recursion special case
     if (either p1, p2 recursive) return 0;
     else if  (p1 && p2 recursive) return ObjectSim(p1.ref, p2.ref);
     //Leaf and Internal Nodes
     if(p1 && p2 leaves) return SemanticSim(p1,p2,threshold);
     else if (either por p2 a leaf) LeafContextSim = SematicSim
     return  wt1 * SemanticSim (p1, p2, threshold) + wt2 *
            ImmediateDescSim (p1, p2, threshold) + wt3 *
            LeafContextSim (p1, p2, threshold)
4.3.11 SchemaSim
Overall similarity of two schemas is given as a normalized ObjectSim measure. For two Schemas s1, s2 the similarity is computed as follows
SchemaSim(s1,s2){
     for (each pi Î s1){
          for (each pj Î s2){
              SimMat=SimMat È ObjectSim(pi,pj,threshold,wt1,wt2,wt3)
          }
     }
     MatchList=LocalMatch(SimMat)
     return 
}
4.4 Dynamic Weights and Thresholds
The algorithms described above use a number of weights and thresholds. These values play an important role in the similarity computation and could be statically assigned or dynamically generated for each set of Schemas under consideration. Dynamic weights and thresholds will help customize the algorithm for each set automatically based on an algorithm devised as shown below and employs statistical measures such as mean, median etc. However this will result in considerable overhead as similarity computation cannot start until these values have been determined. Results show that using dynamic values we get a broader spectrum of values though the relative ranking for static and dynamic values is almost the same.
Thresholds
For determining a threshold a suitable measure of central tendency is used using all the similarity values. We use the following measures:
  • Arithmetic Mean
  • Geometric Mean
  • Median
  • Mode: This is not possible to use since 64bit precision values result in practically no two values being the same except zero which is not an acceptable threshold.
By calculating the skew first as described in Section 4.1 the appropriate method can then be applied to calculate the threshold.
Skew
Use
Reason
Low ( < 0.5)
Arithmetic Mean
Since the distribution is almost symmetric
Medium (0.5 – 0.8)
Median
Being slightly skewed the median will pick a value close to and less than where the skew is
High ( > 0.8)
Geometric Mean
The geometric mean is less affected by extreme values and will thus serve as a better measure
Weights
In determining the weights or ranking of similarity components, the size (number of objects) and the skew value of the Schemas are used. The weights for DTD/XML Schema Similarity for Semantic, Immediate Descendant and Leaf Context components are calculated as:
  • Semantic: Generally this represents the most important measure as it considers the object as a whole and also in context and is a factor of the degree of skew.
  • Immediate Descendant: Useful for schemas with a large number of internal nodes. Leaf nodes are always more however in relative consideration binary tree would have more internal nodes as opposed to say a tree with five siblings for each node.
  • Leaf Context: Useful for schemas with a large number of leaf nodes in relative consideration
countLeaf //total number of leaves
countInternal //total number of internal nodes
leafPercent //percentage of leaf nodes
internalPercent // percentage of internal nodes
if (skew >= MEDIUM) WT_SEMANTIC=0.4 else WT_SEMANTIC=0.3
if (leafPercent – internalPercent > 55%) {
     WT_LEAF=
     WT_IMMEDDESC=1 – WT_LEAF
}
else{
     WT_IMMEDDESC=
     WT_LEAF=1 – WT_IMMEDDESC
}
If the summation of weights is not 1 it may be adjusted relatively to make the sum 1.
Experimental Values in Determining 55% cutoff value
RANGE VALUES
 
10
336
 
20
487
 
30
612
 
40
710
 
50
1270
 
60
1372
 
70
1899
 
80
1277
 
90
38
 
100
0
 
   
50.93906
MEAN
84
MAX
55.55556
MEDIAN
The threshold value at 55% is an empirically determined as seen above and is the Median value calculated using data in the test set as described in Section 5.3.
Dynamic vs. Static
As we see from above both follow the same trend however using dynamic measures we get a broader spread as seen below (actual numbers):
Range
Dynamic
Static
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
4451
988
620
466
406
362
356
225
117
10
5878
1079
558
248
129
64
32
7
4
2
This is a significant advantage when comparing a smaller range of values.  The weights for Basic Similarity for Linguistic Similarity, Cardinality Similarity, Type Similarity and Namespace Similarity cannot be determined dynamically in practice. This is because these weights are needed for every pair of objects that are compared and even for relatively small schemas this number is very large! Also empirically determined static weights would give acceptable results for most schema sets.
4.5 Clustering
Integrating large numbers of XML Schemas is difficult, however we believe that similar XML Schemas grouped together would help in the integration process. Designers performing integration would then have to focus within these clusters helping in the integration process. These clusters can be generated by the Hierarchical Agglomerative Clustering methods [EVE74] using the similarity value we calculate using XClustXS. A dendrogram can also be drawn for visual representation of the clusters.
Agglomerative methods proceed by a series of successive fusions of the entities into groups and at each step those with the highest similarity are fused together. Using the complete linkage method we give the following example with a dendrogram to show how the clustering technique works.
Consider the Schema similarity graph below, it shows Schemas A, B, C, and D and their similarity values.
Figure 4.5-1 Example Schema Graph
Listing the pairs and similarities as an array and clustering:
Table 4.5-1 Agglomerative Clustering Example
 
A
B
C
D
A
0.0
0.9
0.8
0.5
B
0.9
0.0
0.7
0.6
C
0.8
0.7
0.0
0.4
D
0.5
0.6
0.4
0.0
The maximum similarity is between AB so fusing them:
(AB)C=max{AC,BC}=AC=0.8
(AB)D=max{AD,BD}=BD=0.6
  
 
AB
C
D
AB
0.0
0.8
0.6
C
0.8
0.0
0.4
D
0.6
0.4
0.0
The maximum similarity is between (AB)C so fusing them:
((AB)C)D=max{(AB)D,CD}=(AB)D=0.6
  
 
(AB)C
D
(AB)C
0.0
0.6
D
0.6
0.0
This is the final clustering result.
Representing the above as a dendrogram it shows the clusters {A, B}, {{A, B}, C}, {{{A, B}, C}, D}
Figure 4.5-2 Dendrogram

No comments: