On the Impact of sameAs on Schema Matching

On the Impact of sameAs
on Schema Matching
JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH
Knowledge Representation & Reasoning Group
KCAP 2019 - November 20, 2019 - Marina del Rey

More and more Linked Open Data...
2
(crawled from ~650K datasets in 2015)

...More and more (overlapping) Schemas
3

Schema Matching is Inevitable
4
• It is not possible (neither desired) to have a unique schema covering all domains
• In order to exploit this wealth of available knowledge and enhance knowledge-based
systems (e.g., search engines, virtual assistants, etc.), we need to match these
overlapping schemas
• Schema Matching: finding relationships between entities of different schemas
• equivalence relations
• subsumption
• disjointness
• ....

Schema Matching over the years
5
• Active area of research from several communities, including the SW
• Ontology Alignment Evaluation Initiative (OAEI) ongoing for 15 years
• [Euzenat and Shvaiko, 2013] reviews ~100 schema-matching systems
50%
rely mostly on schema-level information
(i.e. schema-based approaches)
25%
rely mostly on instance-level information
(i.e. instance-based approaches)
25%
rely on schema + instance-level information
(i.e. mixed approaches)

6
Instance-based Schema Matching
• All instance-based schema-matching approaches share two essential ideas:
1. The semantics of a concept is better determined by its members rather by its annotations
Concepts refer to sets that possibly have named instances as members
• ext(C) refer to the set of instances which are explicitly stated as members of C
ext(foaf:Person) = {ex:i1, ex:i2}
• ext⊑(C) refer to the set of instances which are explicitly or implicitly
stated as members of C
(i.e. either explicit members or derived through concept subsumption)
ext⊑(foaf:Person) = {ex:i1, ex:i2, ex:i3}
foaf:Person
ex:i1 ex:i2
ex:i3
rdf:type
dbo:Scientist
rdfs:subClassOf
rdf:type

7
Instance-based Schema Matching
• All instance-based schema-matching approaches share two essential ideas:
2. The more significant the overlap between two concepts’ members is, the more related
these concepts are
• Multiple techniques to measure the overlap between concepts’ members
• Formal concept analysis techniques
• Machine learning
• Jaccard index
• ....

Instance-based Schema Matching using Jaccard Index
8
• The Jaccard index is a commonly used measure to score the similarity between two sets
• The higher the similarity of two sets is, the greater the Jaccard index
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4}
ext(schema:Person) = {ex:i3, ex:i4, ex:i5}
J(ext(foaf:Person), ext(schema:Person)) =
.
/
= 0.4

9
With more than 558 million explicitly asserted owl:sameAs [Beek et al., ESWC 2018]
(or 35 billion after transitive closure), the reality in the Web of Data looks more like this:
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
owl:sameAs*
ext~(foaf:Person) = {ex:i1, eq{2,5}, ex:i3, ex:i4}
ext~(schema:Person) = {ex:i3, ex:i4, eq{2,5}}
J(ext~(foaf:Person), ext~(schema:Person)) =
3
4
= 0.75
.
/
= 0.4
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
Scenario 1 where J increases

10
Or possibly like this:
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
owl:sameAs*
ext~(foaf:Person) = {ex:i1, ex:i2, eq{3,4}}
ext~(schema:Person) = {eq{3,4}, ex:i5}
J(ext~(foaf:Person), ext~(schema:Person)) =
7
4
= 0.25
.
/
= 0.4
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
Scenario 2 where J decreases

Research Question 1
Does the inclusion of instance-level interlinks positively impact instance-based schema
alignments ?
1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts?
1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts?
12

Why should we care?
• Providing empirical evidences for schema-matching designers on whether exploiting a large
external collection of instance-level interlinks (e.g. from the LOD Cloud) is beneficial for
improving the accuracy of schema-matching techniques
• Particularly showing the risk/interest of using owl:sameAs after a number of studies suggesting
that a large* number of the existing owl:sameAs links in the Web are actually erroneous
* 3% of evaluated owl:sameAs are erroneous [Hogan et al., JWS 2012]
* 4% of evaluated owl:sameAs are erroneous [Raad et al., ISWC 2018]
* 20% of evaluated owl:sameAs are erroneous [Halpin et al., ISWC 2010]
13

Research Question 2
How does the quality of the instance-level interlinks impact the quality of the resulting
schema-alignments ?
2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of equivalent concepts?
2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
14

Dataset
16
(crawled from ~650K datasets in 2015)

Dataset
17
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674

Size distribution of the Concepts’ members
18
• 23% of the concepts have one explicit member
• 92% of the concepts have ≤ 100 explicit members
• 618 concepts have more than 100M explicit or implicit members
• 5 concepts have more than 100M explicit members

Concepts with more than >100M explicit members
19
Concept Cardinality %
http://purl.org/linked-data/cube#Observation 1,306,389,396 39.3
http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry 304,878,654 9.2
http://geovocab.org/geometry#Geometry 167,808,111 5
http://knoesis.wright.edu/ssw/ont/ sensorobservation.owl#MeasureData 144,044,989 4.3
http://xmlns.com/foaf/0.1/Person 132,919,327 4
Total 2,056,040,477 61.9

Research Question 1
Does the inclusion of instance-level interlinks impact positively instance-based schema
alignments ?
21

Dataset
22
# triples 28,362,198,927

1. a. Does the inclusion of owl:sameAs increase the Jaccard
• 1,051,979 owl:equivalentClass statements in the LOD-a-lot
Hypothesis: all these statements are correct alignments
Only 972 owl:equivalentClass statements where both concepts have explicit members
208 reflexive alignments (C1, owl:equivalentClass, C1)
22 duplicate symmetric alignments (C1, owl:equivalentClass, C2) and (C2, owl:equivalentClass, C1)
742 alignments between 1,357 distinct concepts (i.e. gold standard)
23

24
Size distribution of the Concepts’ members of our Gold Standard

25
Jaccard Index distribution for the 742 alignments
Compare
J(ext⊑(C1), ext⊑(C2))
with
J(ext⊑
~(C1), ext⊑
~(C2))
such that
(C1, owl:equivalentClass, C2)
Runtime: <4 hours on 64GB SSD disk

26
Jaccard Index variation for the 742 alignments
• When owl:sameAs is considered, the Jaccard index
increases for 381/ 742 of the correct alignments (52%)
• Out of these 381 cases, Jaccard increases from 0 to 1 in
44 cases (7%)
decreases for 25/ 742 of the correct alignments (3%)
• Slight drop in impact when only explicit members are
considered
owl:sameAs increases the overlap of two equivalent
concepts in half of the cases

Research Question 1
Does the inclusion of instance-level interlinks impact positively instance-based schema
alignments ?
27

Dataset
28
# triples 28,362,198,927

1. b. Does the inclusion of owl:sameAs increase the Jaccard
• 833,232 concepts with explicit members in the LOD-a-lot
Created a random alignment for each concept such that each concept is paired only once
Hypothesis: all these random alignments are erroneous
416,616 random alignments
29

30
Jaccard Index variation for the 416,616 random alignments
• When owl:sameAs is considered, the Jaccard index increases
for only 94 / 416,616 of the random alignments (0.02%)
decreases for 3 / 416,616 of the random alignments (~0%)
owl:sameAs rarely increases the overlap of two non-
equivalent concepts

Research Question 2
How does the quality of the instance-level interlinks affect the quality of the resulting
schema-alignments ?
31

Dataset
32
# triples 28,362,198,927

2. a. Does the inclusion of a higher quality subset of owl:sameAs
increase the Jaccard Index of equivalent concepts?
33
owl:sameAs quality indicator:
• [Raad et al., ISWC 2018] computed an error degree for each owl:sameAs link
• Error degree [0.0, 1.0] based on the community structure of the owl:sameAs network
and the symmetry of the links
• Manual evaluation shows that owl:sameAs links with error degree >0.99 have high
probability of being erroneous
• Manual evaluation shows that owl:sameAs links with error degree ≤0.4 have high
probability of being correct
• Transitive closure (a): after discarding the ~1M owl:sameAs link with error degree >0.99
• Transitive closure (b): after discarding the ~150M owl:sameAs link with error degree >0.4

34
Jaccard Index variation for the 742 alignments
• When owl:sameAs with error degree >0.99 are discarded, there is
a slight decrease in performance:
• Jaccard increases in 51% of the cases (instead of 52%)
• Jaccard increases from 0 to 1 in 37 cases (instead of 44)
• When owl:sameAs with error degree >0.4 are discarded, there is
a big decrease in performance:
• Jaccard increases in only 13% of the cases (instead of 52%)
• Jaccard increases from 0 to 1 in two cases (instead of 44)
• Jaccard decreases in 39 cases (instead of 25)

Research Question 2
How does the quality of the instance-level interlinks affect the quality of the resulting
schema-alignments ?
35

36
Jaccard Index variation for the 416,616 random alignments
• When owl:sameAs with error degree >0.99 are discarded,
there is an increase in performance:
• Jaccard increases in 27 of the cases (instead of 94)
• When owl:sameAs with error degree >0.4 are discarded,
there is an important increase in performance:
• Jaccard increases in 2 of the cases (instead of 94)

Take away message
This work provides an empirical study on the impact of including instance-level interlinks
on the overlap between concepts members
• Including instance-level interlinks can enhance the performance of instance-based schema alignments
Increases the overlap for 52% of the existing alignments in the LOD-a-lot
Increases the overlap for less than 0.3% of randomly created alignments
• Inference does positively impact instance-based schema alignments
Considering also the implicit members enhances the results on the Gold Standard by 3 pp
• Discarding only isolated owl:sameAs links in the network can increase the quality of instance-based schema
alignments (owl:sameAs links are probably not as bad as we first thought)
Reduces the cases where Jaccard index increases for non-equivalent concepts by 71%
38

Future Work
• Use a larger gold standard to validate the here presented results
• Investigate alternative better-tailored instance-based measures to detect new schema
alignments at the scale of the Web of data, exploiting:
the higher quality subset of the owl:sameAs links;
the concepts’ both explicit and implicit
39

On the Impact of sameAs
on Schema Matching
JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH
Knowledge Representation & Reasoning Group
Thank you for your attention...
@joeraad_
Code + Results
https://github.com/raadjoe/impact-sameAs-schema-matching

On the Impact of sameAs on Schema Matching

More Related Content

What's hot (14)

Similar to On the Impact of sameAs on Schema Matching (20)

Recently uploaded (20)

On the Impact of sameAs on Schema Matching