SlideShare a Scribd company logo
On the Impact of sameAs
on Schema Matching
JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH
Knowledge Representation & Reasoning Group
KCAP 2019 - November 20, 2019 - Marina del Rey
More and more Linked Open Data...
2
(crawled from ~650K datasets in 2015)
...More and more (overlapping) Schemas
3
Schema Matching is Inevitable
4
• It is not possible (neither desired) to have a unique schema covering all domains
• In order to exploit this wealth of available knowledge and enhance knowledge-based
systems (e.g., search engines, virtual assistants, etc.), we need to match these
overlapping schemas
• Schema Matching: finding relationships between entities of different schemas
• equivalence relations
• subsumption
• disjointness
• ....
Schema Matching over the years
5
• Active area of research from several communities, including the SW
• Ontology Alignment Evaluation Initiative (OAEI) ongoing for 15 years
• [Euzenat and Shvaiko, 2013] reviews ~100 schema-matching systems
50%
rely mostly on schema-level information
(i.e. schema-based approaches)
25%
rely mostly on instance-level information
(i.e. instance-based approaches)
25%
rely on schema + instance-level information
(i.e. mixed approaches)
6
Instance-based Schema Matching
• All instance-based schema-matching approaches share two essential ideas:
1. The semantics of a concept is better determined by its members rather by its annotations
Concepts refer to sets that possibly have named instances as members
• ext(C) refer to the set of instances which are explicitly stated as members of C
ext(foaf:Person) = {ex:i1, ex:i2}
• ext⊑(C) refer to the set of instances which are explicitly or implicitly
stated as members of C
(i.e. either explicit members or derived through concept subsumption)
ext⊑(foaf:Person) = {ex:i1, ex:i2, ex:i3}
foaf:Person
ex:i1 ex:i2
ex:i3
rdf:type
dbo:Scientist
rdfs:subClassOf
rdf:type
7
Instance-based Schema Matching
• All instance-based schema-matching approaches share two essential ideas:
2. The more significant the overlap between two concepts’ members is, the more related
these concepts are
• Multiple techniques to measure the overlap between concepts’ members
• Formal concept analysis techniques
• Machine learning
• Jaccard index
• ....
Instance-based Schema Matching using Jaccard Index
8
• The Jaccard index is a commonly used measure to score the similarity between two sets
• The higher the similarity of two sets is, the greater the Jaccard index
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4}
ext(schema:Person) = {ex:i3, ex:i4, ex:i5}
J(ext(foaf:Person), ext(schema:Person)) =
.
/
= 0.4
Instance-based Schema Matching using Jaccard Index
9
With more than 558 million explicitly asserted owl:sameAs [Beek et al., ESWC 2018]
(or 35 billion after transitive closure), the reality in the Web of Data looks more like this:
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
owl:sameAs*
ext~(foaf:Person) = {ex:i1, eq{2,5}, ex:i3, ex:i4}
ext~(schema:Person) = {ex:i3, ex:i4, eq{2,5}}
J(ext~(foaf:Person), ext~(schema:Person)) =
3
4
= 0.75
ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4}
ext(schema:Person) = {ex:i3, ex:i4, ex:i5}
J(ext(foaf:Person), ext(schema:Person)) =
.
/
= 0.4
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
Scenario 1 where J increases
Instance-based Schema Matching using Jaccard Index
10
Or possibly like this:
𝐽(𝐴, 𝐵) =
| * ∩ , |
| * ∪ , |
owl:sameAs*
ext~(foaf:Person) = {ex:i1, ex:i2, eq{3,4}}
ext~(schema:Person) = {eq{3,4}, ex:i5}
J(ext~(foaf:Person), ext~(schema:Person)) =
7
4
= 0.25
ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4}
ext(schema:Person) = {ex:i3, ex:i4, ex:i5}
J(ext(foaf:Person), ext(schema:Person)) =
.
/
= 0.4
foaf:Person
ex:i1 ex:i2
schema:Person
ex:i3 ex:i4
rdf:type
ex:i5
rdf:type
Scenario 2 where J decreases
11
Research Questions
Research Question 1
Does the inclusion of instance-level interlinks positively impact instance-based schema
alignments ?
1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts?
1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts?
12
Why should we care?
• Providing empirical evidences for schema-matching designers on whether exploiting a large
external collection of instance-level interlinks (e.g. from the LOD Cloud) is beneficial for
improving the accuracy of schema-matching techniques
• Particularly showing the risk/interest of using owl:sameAs after a number of studies suggesting
that a large* number of the existing owl:sameAs links in the Web are actually erroneous
* 3% of evaluated owl:sameAs are erroneous [Hogan et al., JWS 2012]
* 4% of evaluated owl:sameAs are erroneous [Raad et al., ISWC 2018]
* 20% of evaluated owl:sameAs are erroneous [Halpin et al., ISWC 2010]
13
Research Question 2
How does the quality of the instance-level interlinks impact the quality of the resulting
schema-alignments ?
2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of equivalent concepts?
2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
14
15
Dataset Description
Dataset
16
(crawled from ~650K datasets in 2015)
Dataset
17
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674
Size distribution of the Concepts’ members
18
• 23% of the concepts have one explicit member
• 92% of the concepts have ≤ 100 explicit members
• 618 concepts have more than 100M explicit or implicit members
• 5 concepts have more than 100M explicit members
Concepts with more than >100M explicit members
19
Concept Cardinality %
http://purl.org/linked-data/cube#Observation 1,306,389,396 39.3
http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry 304,878,654 9.2
http://geovocab.org/geometry#Geometry 167,808,111 5
http://knoesis.wright.edu/ssw/ont/ sensorobservation.owl#MeasureData 144,044,989 4.3
http://xmlns.com/foaf/0.1/Person 132,919,327 4
Total 2,056,040,477 61.9
20
Experiments
Research Question 1
Does the inclusion of instance-level interlinks impact positively instance-based schema
alignments ?
1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts?
1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts?
21
Dataset
22
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674
1. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
• 1,051,979 owl:equivalentClass statements in the LOD-a-lot
­ Hypothesis: all these statements are correct alignments
­ Only 972 owl:equivalentClass statements where both concepts have explicit members
­ 208 reflexive alignments (C1, owl:equivalentClass, C1)
­ 22 duplicate symmetric alignments (C1, owl:equivalentClass, C2) and (C2, owl:equivalentClass, C1)
­ 742 alignments between 1,357 distinct concepts (i.e. gold standard)
23
1. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
24
Size distribution of the Concepts’ members of our Gold Standard
1. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
25
Jaccard Index distribution for the 742 alignments
Compare
J(ext⊑(C1), ext⊑(C2))
with
J(ext⊑
~(C1), ext⊑
~(C2))
such that
(C1, owl:equivalentClass, C2)
Runtime: <4 hours on 64GB SSD disk
1. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
26
Jaccard Index variation for the 742 alignments
• When owl:sameAs is considered, the Jaccard index
increases for 381/ 742 of the correct alignments (52%)
• Out of these 381 cases, Jaccard increases from 0 to 1 in
44 cases (7%)
• When owl:sameAs is considered, the Jaccard index
decreases for 25/ 742 of the correct alignments (3%)
• Slight drop in impact when only explicit members are
considered
owl:sameAs increases the overlap of two equivalent
concepts in half of the cases
Research Question 1
Does the inclusion of instance-level interlinks impact positively instance-based schema
alignments ?
1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts?
1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts?
27
Dataset
28
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674
1. b. Does the inclusion of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
• 833,232 concepts with explicit members in the LOD-a-lot
­ Created a random alignment for each concept such that each concept is paired only once
­ Hypothesis: all these random alignments are erroneous
­ 416,616 random alignments
29
1. b. Does the inclusion of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
30
Jaccard Index variation for the 416,616 random alignments
• When owl:sameAs is considered, the Jaccard index increases
for only 94 / 416,616 of the random alignments (0.02%)
• When owl:sameAs is considered, the Jaccard index
decreases for 3 / 416,616 of the random alignments (~0%)
owl:sameAs rarely increases the overlap of two non-
equivalent concepts
Research Question 2
How does the quality of the instance-level interlinks affect the quality of the resulting
schema-alignments ?
2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of equivalent concepts?
2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
31
Dataset
32
# triples 28,362,198,927
# rdf:type statements 3,321,354,308
# rdfs:subClassOf statements 4,461,717
# owl:equivalentClass statements 1,051,979
# explicit owl:sameAs statements 558,943,116
# implicit owl:sameAs statements 35,201,120,188
# equivalence classes (after closure of owl:sameAs) 48,999,148
# concepts with explicit members |C| 833,232
# concepts with explicit or implicit members |C⊑| 976,674
2. a. Does the inclusion of a higher quality subset of owl:sameAs
increase the Jaccard Index of equivalent concepts?
33
owl:sameAs quality indicator:
• [Raad et al., ISWC 2018] computed an error degree for each owl:sameAs link
• Error degree [0.0, 1.0] based on the community structure of the owl:sameAs network
and the symmetry of the links
• Manual evaluation shows that owl:sameAs links with error degree >0.99 have high
probability of being erroneous
• Manual evaluation shows that owl:sameAs links with error degree ≤0.4 have high
probability of being correct
• Transitive closure (a): after discarding the ~1M owl:sameAs link with error degree >0.99
• Transitive closure (b): after discarding the ~150M owl:sameAs link with error degree >0.4
2. a. Does the inclusion of owl:sameAs increase the Jaccard
Index of equivalent concepts?
34
Jaccard Index variation for the 742 alignments
• When owl:sameAs with error degree >0.99 are discarded, there is
a slight decrease in performance:
• Jaccard increases in 51% of the cases (instead of 52%)
• Jaccard increases from 0 to 1 in 37 cases (instead of 44)
• When owl:sameAs with error degree >0.4 are discarded, there is
a big decrease in performance:
• Jaccard increases in only 13% of the cases (instead of 52%)
• Jaccard increases from 0 to 1 in two cases (instead of 44)
• Jaccard decreases in 39 cases (instead of 25)
Research Question 2
How does the quality of the instance-level interlinks affect the quality of the resulting
schema-alignments ?
2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of equivalent concepts?
2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
35
2. b. Does the inclusion of owl:sameAs increase the Jaccard
Index of non-equivalent concepts?
36
Jaccard Index variation for the 416,616 random alignments
• When owl:sameAs with error degree >0.99 are discarded,
there is an increase in performance:
• Jaccard increases in 27 of the cases (instead of 94)
• When owl:sameAs with error degree >0.4 are discarded,
there is an important increase in performance:
• Jaccard increases in 2 of the cases (instead of 94)
37
Take away message
Take away message
This work provides an empirical study on the impact of including instance-level interlinks
on the overlap between concepts members
• Including instance-level interlinks can enhance the performance of instance-based schema alignments
­ Increases the overlap for 52% of the existing alignments in the LOD-a-lot
­ Increases the overlap for less than 0.3% of randomly created alignments
• Inference does positively impact instance-based schema alignments
­ Considering also the implicit members enhances the results on the Gold Standard by 3 pp
• Discarding only isolated owl:sameAs links in the network can increase the quality of instance-based schema
alignments (owl:sameAs links are probably not as bad as we first thought)
­ Reduces the cases where Jaccard index increases for non-equivalent concepts by 71%
38
Future Work
• Use a larger gold standard to validate the here presented results
• Investigate alternative better-tailored instance-based measures to detect new schema
alignments at the scale of the Web of data, exploiting:
­ the higher quality subset of the owl:sameAs links;
­ the concepts’ both explicit and implicit
39
On the Impact of sameAs
on Schema Matching
JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH
Knowledge Representation & Reasoning Group
Thank you for your attention...
@joeraad_
Code + Results
https://github.com/raadjoe/impact-sameAs-schema-matching

More Related Content

PDF
Sybrandt Thesis Proposal Presentation
PDF
Distribution Aligning Refinery of Pseudo-label for Imbalanced Semi-supervised...
PPTX
Using Graph and Transformer Embeddings for Vector Based Retrieval
PDF
Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...
PPTX
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
PPT
AI_Paper_Presentation
PPTX
Building a Microblog Corpus for Search Result Diversification
PDF
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Sybrandt Thesis Proposal Presentation
Distribution Aligning Refinery of Pseudo-label for Imbalanced Semi-supervised...
Using Graph and Transformer Embeddings for Vector Based Retrieval
Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Luc...
AI_Paper_Presentation
Building a Microblog Corpus for Search Result Diversification
Developing A Big Data Search Engine - Where we have gone. Where we are going:...

What's hot (14)

PDF
Sina presentation in IBM
PDF
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
PDF
Materials Informatics and Python
PPT
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
PDF
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
PDF
Haystacks slides
PPTX
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
PDF
Data Tactics Analytics Brown Bag (Aug 22, 2013)
PDF
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
PPTX
Using the search engine as recommendation engine
PDF
Amrapali Zaveri Defense
PPTX
Hierarchical clustering in Python and beyond
PPT
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
Sina presentation in IBM
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Materials Informatics and Python
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Haystacks slides
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
Using the search engine as recommendation engine
Amrapali Zaveri Defense
Hierarchical clustering in Python and beyond
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
Ad

Similar to On the Impact of sameAs on Schema Matching (20)

PDF
Integrating Conversational Agents and Knowledge Graphs within the Scholarly D...
PDF
Exploratory Data Analysis
PPTX
Social Phrases Having Impact in Altmetrics - SOPHIA
PDF
Profile-based Dataset Recommendation for RDF Data Linking
PDF
Generating domain specific sentiment lexicons using the Web Directory
PPTX
Discovery Hub: on-the-fly linked data exploratory search
PPTX
Knowledge graphs for knowing more and knowing for sure
PDF
ICELW Conference Slides
PDF
Engaging Information Professionals in the Process of Authoritative Interlinki...
PPTX
Utilizing Neo4j with AI Applications
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PDF
Profile Analysis of Users in Data Analytics Domain
PPTX
Knowledge Graph Construction and the Role of DBPedia
PDF
Marvin_Capstone
PDF
H2O World - Intro to Data Science with Erin Ledell
PDF
De carlo rizk 2010 icelw
PDF
AI Beyond Deep Learning
PPTX
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
PPTX
Introduction to Machine Learning
PPT
Data Mining and the Web_Past_Present and Future
Integrating Conversational Agents and Knowledge Graphs within the Scholarly D...
Exploratory Data Analysis
Social Phrases Having Impact in Altmetrics - SOPHIA
Profile-based Dataset Recommendation for RDF Data Linking
Generating domain specific sentiment lexicons using the Web Directory
Discovery Hub: on-the-fly linked data exploratory search
Knowledge graphs for knowing more and knowing for sure
ICELW Conference Slides
Engaging Information Professionals in the Process of Authoritative Interlinki...
Utilizing Neo4j with AI Applications
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Profile Analysis of Users in Data Analytics Domain
Knowledge Graph Construction and the Role of DBPedia
Marvin_Capstone
H2O World - Intro to Data Science with Erin Ledell
De carlo rizk 2010 icelw
AI Beyond Deep Learning
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Introduction to Machine Learning
Data Mining and the Web_Past_Present and Future
Ad

Recently uploaded (20)

PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
The scientific heritage No 166 (166) (2025)
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
. Radiology Case Scenariosssssssssssssss
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
Sciences of Europe No 170 (2025)
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Biophysics 2.pdffffffffffffffffffffffffff
Taita Taveta Laboratory Technician Workshop Presentation.pptx
neck nodes and dissection types and lymph nodes levels
The scientific heritage No 166 (166) (2025)
ECG_Course_Presentation د.محمد صقران ppt
. Radiology Case Scenariosssssssssssssss
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
2. Earth - The Living Planet Module 2ELS
Classification Systems_TAXONOMY_SCIENCE8.pptx
2Systematics of Living Organisms t-.pptx
Comparative Structure of Integument in Vertebrates.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Sciences of Europe No 170 (2025)
Introduction to Cardiovascular system_structure and functions-1
Cell Membrane: Structure, Composition & Functions
Introduction to Fisheries Biotechnology_Lesson 1.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
HPLC-PPT.docx high performance liquid chromatography
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...

On the Impact of sameAs on Schema Matching

  • 1. On the Impact of sameAs on Schema Matching JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH Knowledge Representation & Reasoning Group KCAP 2019 - November 20, 2019 - Marina del Rey
  • 2. More and more Linked Open Data... 2 (crawled from ~650K datasets in 2015)
  • 3. ...More and more (overlapping) Schemas 3
  • 4. Schema Matching is Inevitable 4 • It is not possible (neither desired) to have a unique schema covering all domains • In order to exploit this wealth of available knowledge and enhance knowledge-based systems (e.g., search engines, virtual assistants, etc.), we need to match these overlapping schemas • Schema Matching: finding relationships between entities of different schemas • equivalence relations • subsumption • disjointness • ....
  • 5. Schema Matching over the years 5 • Active area of research from several communities, including the SW • Ontology Alignment Evaluation Initiative (OAEI) ongoing for 15 years • [Euzenat and Shvaiko, 2013] reviews ~100 schema-matching systems 50% rely mostly on schema-level information (i.e. schema-based approaches) 25% rely mostly on instance-level information (i.e. instance-based approaches) 25% rely on schema + instance-level information (i.e. mixed approaches)
  • 6. 6 Instance-based Schema Matching • All instance-based schema-matching approaches share two essential ideas: 1. The semantics of a concept is better determined by its members rather by its annotations Concepts refer to sets that possibly have named instances as members • ext(C) refer to the set of instances which are explicitly stated as members of C ext(foaf:Person) = {ex:i1, ex:i2} • ext⊑(C) refer to the set of instances which are explicitly or implicitly stated as members of C (i.e. either explicit members or derived through concept subsumption) ext⊑(foaf:Person) = {ex:i1, ex:i2, ex:i3} foaf:Person ex:i1 ex:i2 ex:i3 rdf:type dbo:Scientist rdfs:subClassOf rdf:type
  • 7. 7 Instance-based Schema Matching • All instance-based schema-matching approaches share two essential ideas: 2. The more significant the overlap between two concepts’ members is, the more related these concepts are • Multiple techniques to measure the overlap between concepts’ members • Formal concept analysis techniques • Machine learning • Jaccard index • ....
  • 8. Instance-based Schema Matching using Jaccard Index 8 • The Jaccard index is a commonly used measure to score the similarity between two sets • The higher the similarity of two sets is, the greater the Jaccard index 𝐽(𝐴, 𝐵) = | * ∩ , | | * ∪ , | foaf:Person ex:i1 ex:i2 schema:Person ex:i3 ex:i4 rdf:type ex:i5 rdf:type ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4} ext(schema:Person) = {ex:i3, ex:i4, ex:i5} J(ext(foaf:Person), ext(schema:Person)) = . / = 0.4
  • 9. Instance-based Schema Matching using Jaccard Index 9 With more than 558 million explicitly asserted owl:sameAs [Beek et al., ESWC 2018] (or 35 billion after transitive closure), the reality in the Web of Data looks more like this: 𝐽(𝐴, 𝐵) = | * ∩ , | | * ∪ , | owl:sameAs* ext~(foaf:Person) = {ex:i1, eq{2,5}, ex:i3, ex:i4} ext~(schema:Person) = {ex:i3, ex:i4, eq{2,5}} J(ext~(foaf:Person), ext~(schema:Person)) = 3 4 = 0.75 ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4} ext(schema:Person) = {ex:i3, ex:i4, ex:i5} J(ext(foaf:Person), ext(schema:Person)) = . / = 0.4 foaf:Person ex:i1 ex:i2 schema:Person ex:i3 ex:i4 rdf:type ex:i5 rdf:type Scenario 1 where J increases
  • 10. Instance-based Schema Matching using Jaccard Index 10 Or possibly like this: 𝐽(𝐴, 𝐵) = | * ∩ , | | * ∪ , | owl:sameAs* ext~(foaf:Person) = {ex:i1, ex:i2, eq{3,4}} ext~(schema:Person) = {eq{3,4}, ex:i5} J(ext~(foaf:Person), ext~(schema:Person)) = 7 4 = 0.25 ext(foaf:Person) = {ex:i1, ex:i2 , ex:i3 , ex:i4} ext(schema:Person) = {ex:i3, ex:i4, ex:i5} J(ext(foaf:Person), ext(schema:Person)) = . / = 0.4 foaf:Person ex:i1 ex:i2 schema:Person ex:i3 ex:i4 rdf:type ex:i5 rdf:type Scenario 2 where J decreases
  • 12. Research Question 1 Does the inclusion of instance-level interlinks positively impact instance-based schema alignments ? 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 12
  • 13. Why should we care? • Providing empirical evidences for schema-matching designers on whether exploiting a large external collection of instance-level interlinks (e.g. from the LOD Cloud) is beneficial for improving the accuracy of schema-matching techniques • Particularly showing the risk/interest of using owl:sameAs after a number of studies suggesting that a large* number of the existing owl:sameAs links in the Web are actually erroneous * 3% of evaluated owl:sameAs are erroneous [Hogan et al., JWS 2012] * 4% of evaluated owl:sameAs are erroneous [Raad et al., ISWC 2018] * 20% of evaluated owl:sameAs are erroneous [Halpin et al., ISWC 2010] 13
  • 14. Research Question 2 How does the quality of the instance-level interlinks impact the quality of the resulting schema-alignments ? 2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of equivalent concepts? 2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 14
  • 16. Dataset 16 (crawled from ~650K datasets in 2015)
  • 17. Dataset 17 # triples 28,362,198,927 # rdf:type statements 3,321,354,308 # rdfs:subClassOf statements 4,461,717 # owl:equivalentClass statements 1,051,979 # explicit owl:sameAs statements 558,943,116 # implicit owl:sameAs statements 35,201,120,188 # equivalence classes (after closure of owl:sameAs) 48,999,148 # concepts with explicit members |C| 833,232 # concepts with explicit or implicit members |C⊑| 976,674
  • 18. Size distribution of the Concepts’ members 18 • 23% of the concepts have one explicit member • 92% of the concepts have ≤ 100 explicit members • 618 concepts have more than 100M explicit or implicit members • 5 concepts have more than 100M explicit members
  • 19. Concepts with more than >100M explicit members 19 Concept Cardinality % http://purl.org/linked-data/cube#Observation 1,306,389,396 39.3 http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry 304,878,654 9.2 http://geovocab.org/geometry#Geometry 167,808,111 5 http://knoesis.wright.edu/ssw/ont/ sensorobservation.owl#MeasureData 144,044,989 4.3 http://xmlns.com/foaf/0.1/Person 132,919,327 4 Total 2,056,040,477 61.9
  • 21. Research Question 1 Does the inclusion of instance-level interlinks impact positively instance-based schema alignments ? 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 21
  • 22. Dataset 22 # triples 28,362,198,927 # rdf:type statements 3,321,354,308 # rdfs:subClassOf statements 4,461,717 # owl:equivalentClass statements 1,051,979 # explicit owl:sameAs statements 558,943,116 # implicit owl:sameAs statements 35,201,120,188 # equivalence classes (after closure of owl:sameAs) 48,999,148 # concepts with explicit members |C| 833,232 # concepts with explicit or implicit members |C⊑| 976,674
  • 23. 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? • 1,051,979 owl:equivalentClass statements in the LOD-a-lot ­ Hypothesis: all these statements are correct alignments ­ Only 972 owl:equivalentClass statements where both concepts have explicit members ­ 208 reflexive alignments (C1, owl:equivalentClass, C1) ­ 22 duplicate symmetric alignments (C1, owl:equivalentClass, C2) and (C2, owl:equivalentClass, C1) ­ 742 alignments between 1,357 distinct concepts (i.e. gold standard) 23
  • 24. 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 24 Size distribution of the Concepts’ members of our Gold Standard
  • 25. 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 25 Jaccard Index distribution for the 742 alignments Compare J(ext⊑(C1), ext⊑(C2)) with J(ext⊑ ~(C1), ext⊑ ~(C2)) such that (C1, owl:equivalentClass, C2) Runtime: <4 hours on 64GB SSD disk
  • 26. 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 26 Jaccard Index variation for the 742 alignments • When owl:sameAs is considered, the Jaccard index increases for 381/ 742 of the correct alignments (52%) • Out of these 381 cases, Jaccard increases from 0 to 1 in 44 cases (7%) • When owl:sameAs is considered, the Jaccard index decreases for 25/ 742 of the correct alignments (3%) • Slight drop in impact when only explicit members are considered owl:sameAs increases the overlap of two equivalent concepts in half of the cases
  • 27. Research Question 1 Does the inclusion of instance-level interlinks impact positively instance-based schema alignments ? 1. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 27
  • 28. Dataset 28 # triples 28,362,198,927 # rdf:type statements 3,321,354,308 # rdfs:subClassOf statements 4,461,717 # owl:equivalentClass statements 1,051,979 # explicit owl:sameAs statements 558,943,116 # implicit owl:sameAs statements 35,201,120,188 # equivalence classes (after closure of owl:sameAs) 48,999,148 # concepts with explicit members |C| 833,232 # concepts with explicit or implicit members |C⊑| 976,674
  • 29. 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? • 833,232 concepts with explicit members in the LOD-a-lot ­ Created a random alignment for each concept such that each concept is paired only once ­ Hypothesis: all these random alignments are erroneous ­ 416,616 random alignments 29
  • 30. 1. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 30 Jaccard Index variation for the 416,616 random alignments • When owl:sameAs is considered, the Jaccard index increases for only 94 / 416,616 of the random alignments (0.02%) • When owl:sameAs is considered, the Jaccard index decreases for 3 / 416,616 of the random alignments (~0%) owl:sameAs rarely increases the overlap of two non- equivalent concepts
  • 31. Research Question 2 How does the quality of the instance-level interlinks affect the quality of the resulting schema-alignments ? 2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of equivalent concepts? 2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 31
  • 32. Dataset 32 # triples 28,362,198,927 # rdf:type statements 3,321,354,308 # rdfs:subClassOf statements 4,461,717 # owl:equivalentClass statements 1,051,979 # explicit owl:sameAs statements 558,943,116 # implicit owl:sameAs statements 35,201,120,188 # equivalence classes (after closure of owl:sameAs) 48,999,148 # concepts with explicit members |C| 833,232 # concepts with explicit or implicit members |C⊑| 976,674
  • 33. 2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of equivalent concepts? 33 owl:sameAs quality indicator: • [Raad et al., ISWC 2018] computed an error degree for each owl:sameAs link • Error degree [0.0, 1.0] based on the community structure of the owl:sameAs network and the symmetry of the links • Manual evaluation shows that owl:sameAs links with error degree >0.99 have high probability of being erroneous • Manual evaluation shows that owl:sameAs links with error degree ≤0.4 have high probability of being correct • Transitive closure (a): after discarding the ~1M owl:sameAs link with error degree >0.99 • Transitive closure (b): after discarding the ~150M owl:sameAs link with error degree >0.4
  • 34. 2. a. Does the inclusion of owl:sameAs increase the Jaccard Index of equivalent concepts? 34 Jaccard Index variation for the 742 alignments • When owl:sameAs with error degree >0.99 are discarded, there is a slight decrease in performance: • Jaccard increases in 51% of the cases (instead of 52%) • Jaccard increases from 0 to 1 in 37 cases (instead of 44) • When owl:sameAs with error degree >0.4 are discarded, there is a big decrease in performance: • Jaccard increases in only 13% of the cases (instead of 52%) • Jaccard increases from 0 to 1 in two cases (instead of 44) • Jaccard decreases in 39 cases (instead of 25)
  • 35. Research Question 2 How does the quality of the instance-level interlinks affect the quality of the resulting schema-alignments ? 2. a. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of equivalent concepts? 2. b. Does the inclusion of a higher quality subset of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 35
  • 36. 2. b. Does the inclusion of owl:sameAs increase the Jaccard Index of non-equivalent concepts? 36 Jaccard Index variation for the 416,616 random alignments • When owl:sameAs with error degree >0.99 are discarded, there is an increase in performance: • Jaccard increases in 27 of the cases (instead of 94) • When owl:sameAs with error degree >0.4 are discarded, there is an important increase in performance: • Jaccard increases in 2 of the cases (instead of 94)
  • 38. Take away message This work provides an empirical study on the impact of including instance-level interlinks on the overlap between concepts members • Including instance-level interlinks can enhance the performance of instance-based schema alignments ­ Increases the overlap for 52% of the existing alignments in the LOD-a-lot ­ Increases the overlap for less than 0.3% of randomly created alignments • Inference does positively impact instance-based schema alignments ­ Considering also the implicit members enhances the results on the Gold Standard by 3 pp • Discarding only isolated owl:sameAs links in the network can increase the quality of instance-based schema alignments (owl:sameAs links are probably not as bad as we first thought) ­ Reduces the cases where Jaccard index increases for non-equivalent concepts by 71% 38
  • 39. Future Work • Use a larger gold standard to validate the here presented results • Investigate alternative better-tailored instance-based measures to detect new schema alignments at the scale of the Web of data, exploiting: ­ the higher quality subset of the owl:sameAs links; ­ the concepts’ both explicit and implicit 39
  • 40. On the Impact of sameAs on Schema Matching JOE RAAD ◉ ERMAN ACAR ◉ STEFAN SCHLOBACH Knowledge Representation & Reasoning Group Thank you for your attention... @joeraad_ Code + Results https://github.com/raadjoe/impact-sameAs-schema-matching