SlideShare a Scribd company logo
A Statistical and Schema Independent
Approach to Identify Equivalent Properties
on Linked Data
†Kno.e.sis Center
Wright State University
Dayton OH, USA
‡IBM T J Watson Research Center
Yorktown Heights
New York NY, USA
Kalpa Gunaratna†, Krishnaprasad Thirunarayan†, Prateek Jain‡, Amit Sheth†, and Sanjaya Wijeratne†
{kalpa,tkprasad,amit,sanjaya}@knoesis.org, jainpr@us.ibm.com
iSemantics 2013, Graz, Austria
09/06/2013 2
Motivation
Why we need property alignment and it is so
important?
iSemantics 2013
09/06/2013 3
Many datasets. We can query!
iSemantics 2013
09/06/2013 3iSemantics 2013
09/06/2013 3
Same information in different names.
Therefore, data integration for better presentation is required.
iSemantics 2013
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 4
Roadmap
iSemantics 2013
Existing techniques for property alignment fall into three
categories.
I. Syntactic/dictionary based
– Uses string manipulation techniques, external dictionaries and
lexical databases like WordNet.
II. Schema dependent
– Uses schema information such as, domain and range, definitions.
III. Schema independent
– Uses instance level information for the alignment.
 Our approach falls under schema independent.
09/06/2013 5
Background
iSemantics 2013
Properties capture meaning of triples and hence they are
complex in nature.



09/06/2013 6iSemantics 2013
Properties capture meaning of triples and hence they are
complex in nature.
Syntactic or dictionary based approaches analyze property
names for equivalence. But in LOD, name heterogeneities exist.


09/06/2013 6iSemantics 2013
Properties capture meaning of triples and hence they are
complex in nature.
Syntactic or dictionary based approaches analyze property
names for equivalence. But in LOD, name heterogeneities exist.
Therefore, syntactic or dictionary based approaches have
limited coverage in property alignment.

09/06/2013 6iSemantics 2013
Properties capture meaning of triples and hence they are
complex in nature.
Syntactic or dictionary based approaches analyze property
names for equivalence. But in LOD, name heterogeneities exist.
Therefore, syntactic or dictionary based approaches have
limited coverage in property alignment.
Schema dependent approaches including processing domain
and range, class level tags do not capture semantics of
properties well.
09/06/2013 6iSemantics 2013
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 7
Roadmap
iSemantics 2013
 Statistical Equivalence is based on analyzing owl:equivalentProperty.
 owl:equivalentProperty - properties that have same property
extensions.









09/06/2013 8
Statistical Equivalence of properties
iSemantics 2013
 Statistical Equivalence is based on analyzing owl:equivalentProperty.
 owl:equivalentProperty - properties that have same property
extensions.
Example 1:
Property P is defined by the triples, { a P b, c P d, e P f }
Property Q is defined by the triples, { a Q b, c Q d, e Q f }
P and Q are owl:equivalentProperty, because they have the same extension,
{ {a,b}, {c,d}, {e,f} }




09/06/2013 8
Statistical Equivalence of properties
iSemantics 2013
 Statistical Equivalence is based on analyzing owl:equivalentProperty.
 owl:equivalentProperty - properties that have same property
extensions.
Example 1:
Property P is defined by the triples, { a P b, c P d, e P f }
Property Q is defined by the triples, { a Q b, c Q d, e Q f }
P and Q are owl:equivalentProperty, because they have the same extension,
{ {a,b}, {c,d}, {e,f} }
Example 2:
Property P is defined by the triples, { a P b, c P d, e P f }
Property Q is defined by the triples, { a Q b, c Q d, e Q h }
Then, P and Q are not owl:equivalentProperty, because their extensions are not the
same. But they provide statistical evidence in support of equivalence.
09/06/2013 8
Statistical Equivalence of properties
iSemantics 2013
Intuition
 Higher rate of subject-object matches in extensions leads to
equivalent properties. In practice, it is hard to have exact same
extensions for matching properties. Because,
– Datasets are incomplete.
– Same instance may be modelled differently in different datasets.
 Therefore, we analyze the property extensions to identify equivalent
properties between datasets.
 We define the following notions. Let the statement below be true
for all the definitions.
S1P1O1 and S2P2O2 be two triples in Dataset D1 and D2 respectively.
09/06/2013 iSemantics 2013 9
Definition 1: Candidate Match
The two properties P1 and P2 are a candidate match iff S1
𝐸𝐶𝑅∗
S2 and O1
𝐸𝐶𝑅∗
O2.
We say two instances are connected by an ECR* link if there is a link path between the
instances using ECR links (* is the Kleene star notation). ECR links are Entity Co-reference
Relationships such as those formalized using owl:sameAs and skos:exactMatch.






09/06/2013 iSemantics 2013 10
Definition 1: Candidate Match
The two properties P1 and P2 are a candidate match iff S1
𝐸𝐶𝑅∗
S2 and O1
𝐸𝐶𝑅∗
O2.
We say two instances are connected by an ECR* link if there is a link path between the
instances using ECR links (* is the Kleene star notation). ECR links are Entity Co-reference
Relationships such as those formalized using owl:sameAs and skos:exactMatch.
Example
The two datasets DBpedia(d) and Freebase(f)
d:Arthur Purdy Stout d:place of birth d:New York City
f:Arthur Purdy Stout f:place of death f:New York City


09/06/2013 iSemantics 2013 10
Definition 1: Candidate Match
The two properties P1 and P2 are a candidate match iff S1
𝐸𝐶𝑅∗
S2 and O1
𝐸𝐶𝑅∗
O2.
We say two instances are connected by an ECR* link if there is a link path between the
instances using ECR links (* is the Kleene star notation). ECR links are Entity Co-reference
Relationships such as those formalized using owl:sameAs and skos:exactMatch.
Example
The two datasets DBpedia(d) and Freebase(f)
d:Arthur Purdy Stout d:place of birth d:New York City
f:Arthur Purdy Stout f:place of death f:New York City
 The above is a candidate match, but not equivalent, because intensions are different
(coincidental match).
 We need further analysis to decide on equivalence.
09/06/2013 iSemantics 2013 10
Match Count μ(P1,P2) – Number of triple pairs for P1 and P2 that participate in
candidate matches.
μ 𝑃1, 𝑃2 = | 𝑆1 𝑃1 𝑂1 ϵ 𝐷1 | ∃ 𝑆2 𝑃2 𝑂2 ϵ 𝐷2 ˄ 𝑆1
𝐸𝐶𝑅∗
𝑆2 ˄ 𝑂1
𝐸𝐶𝑅∗
𝑂2 |
Co-appearance Count λ(P1,P2) – Number of triple pairs for P1 and P2 that have
matching subjects.
λ 𝑃1, 𝑃2 = | 𝑆1 𝑃1 𝑂1 ϵ 𝐷1 | ∃ 𝑆2 𝑃2 𝑂2 ϵ 𝐷2 ˄ 𝑆1
𝐸𝐶𝑅∗
𝑆2 |
Definition 2: Statistically Equivalent Properties
The pair of properties P1 and P2 are statistically equivalent to degree (α, k) iff,
𝐹 =
μ(𝑃1
,𝑃2
)
λ(𝑃1
,𝑃2
)
≥ 𝛼,
Where, μ(P1,P2) ≥ k, and 0 ˂ α ≤ 1, k > 1
09/06/2013 iSemantics 2013 11
09/06/2013 iSemantics 2013 12
Dataset 2Dataset 1
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 12
I1
Dataset 2Dataset 1
I1=d1:Willis_Lamb
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 12
I1 I2
owl:sameAs
Dataset 2Dataset 1
I1=d1:Willis_Lamb I2 =d2:willis_lamb
Step 1
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 12
I1 I2
I2
owl:sameAs
P1=d1:doctoralStudent
P2=d2:education.
academic.advisees
Dataset 2Dataset 1
d2:theodore_harold_maiman
I1=d1:Willis_Lamb I2 =d2:willis_lamb
I1
I1
d1:Theodore_Maiman
triple 1
triple 2
triple 3
triple 4
triple 5
Step 1
Step2
Step2
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 12
I1 I2
I2
matching resources
owl:sameAs
P1=d1:doctoralStudent
P2=d2:education.
academic.advisees
Dataset 2Dataset 1
property P1 and property P2 are a candidate match
d2:theodore_harold_maiman
I1=d1:Willis_Lamb I2 =d2:willis_lamb
I1
I1
d1:Theodore_Maiman
triple 1
triple 2
triple 3
triple 4
triple 5
Step 1
Step2
Step2
Step 3
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 13
Complexity:
If the average number of properties for an entity is x and for each property, average
number of objects is j. For n subjects, it requires n*j2*x2+2n comparisons. Since n > j,
n > x, and x and j are independent of n, O(n).
Example:
09/06/2013 iSemantics 2013 14
Parallel computation (Map-Reduce implementation)
 Generating candidate matches can be done for each instance
independently. Hence, we implemented the algorithm in Hadoop 1.0.3
framework.
 Generating candidate matches for instances is distributed among mappers
and each mapper outputs μ and λ to the reducer for property pairs.
• Map Phase
– Let the number of subject instances in dataset D1 be X and namespace of dataset
D2 be ns. For each subject i ϵ X, start a mapper job for
GenerateCandidateMatches(i, ns).
– Each mapper outputs (key,value) pairs as (p:q, μ(p,q):λ(p,q)). pϵD1 and qϵD2.
 The reducer collects all μ and λ values and aggregate them for final
analysis.
• Reduce phase
– Collects output from mappers and aggregates μ(p,q) and λ(p,q) for each key p:q.
 The map reduce version on a 14 node cluster was able to achieve a speed
up of 833% compared to the desktop version.
09/06/2013 iSemantics 2013 28
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 16
Roadmap
iSemantics 2013
Objectives of the evaluation
– Show the effectiveness of the approach in linked datasets
– Compare with existing aligning techniques


09/06/2013 iSemantics 2013 17
Evaluation
Objectives of the evaluation
– Show the effectiveness of the approach in linked datasets
– Compare with existing aligning techniques
We selected 5000 instance samples from DBpedia, Freebase,
LinkedMDB, DBLP L3S , and DBLP RKB Explorer datasets.

09/06/2013 iSemantics 2013 17
Evaluation
Objectives of the evaluation
– Show the effectiveness of the approach in linked datasets
– Compare with existing aligning techniques
We selected 5000 instance samples from DBpedia, Freebase,
LinkedMDB, DBLP L3S , and DBLP RKB Explorer datasets.
These datasets have,
– Complete data for instances in different viewpoints
– Many inter-links
– Complex properties
09/06/2013 iSemantics 2013 17
Evaluation
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
09/06/2013 iSemantics 2013 18
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
09/06/2013 iSemantics 2013 18
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
– k can be estimated using the data as follows,
– Set α = 0.5 and k = 2 (lowest positive values).
– Get exact matching property (property names) pairs not identified by
the algorithm and their μ
– Get the average of those μ values
09/06/2013 iSemantics 2013 18
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
– k can be estimated using the data as follows,
– Set α = 0.5 and k = 2 (lowest positive values).
– Get exact matching property (property names) pairs not identified by
the algorithm and their μ
– Get the average of those μ values
– 0.92 for string similarity algorithms.
09/06/2013 iSemantics 2013 18
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
– k can be estimated using the data as follows,
– Set α = 0.5 and k = 2 (lowest positive values).
– Get exact matching property (property names) pairs not identified by
the algorithm and their μ
– Get the average of those μ values
– 0.92 for string similarity algorithms.
– 0.8 for WordNet similarity.
09/06/2013 iSemantics 2013 18
Measure
type
DBpedia –
Freebase
(Person)
DBpedia –
Freebase
(Film)
DBpedia –
Freebase
(Software)
DBpedia –
LinkedMDB
(Film)
DBLP_RKB –
DBLP_L3S
(Article)
Average
Extension
Based
Algorithm
Precision 0.8758 0.9737 0.6478 0.7560 1.0000 0.8427
Recall 0.8089* 0.5138 0.4339 0.8157 1.0000 0.7145
F measure 0.8410* 0.6727 0.5197 0.7848 1.0000 0.7656
WordNet
Similarity
Precision 0.5200 0.8620 0.7619 0.8823 1.0000 0.8052
Recall 0.4140* 0.3472 0.3018 0.3947 0.3333 0.3582
F measure 0.4609* 0.4950 0.4324 0.5454 0.5000 0.4867
Dice
Similarity
Precision 0.8064 0.9666 0.7659 1.0000 0.0000 0.7078
Recall 0.4777* 0.4027 0.3396 0.3421 0.0000 0.3124
F measure 0.6000* 0.5686 0.4705 0.5098 0.0000 0.4298
Jaro
Similarity
Precision 0.6774 0.8809 0.7755 0.9411 0.0000 0.6550
Recall 0.5350* 0.5138 0.3584 0.4210 0.0000 0.3656
F measure 0.5978* 0.6491 0.4903 0.5818 0.0000 0.4638
09/06/2013 iSemantics 2013 19
Alignment results
* Marks estimated values for experiment 1 because of very large comparisons to check manually. Boldface
marks highest result for each experiment.
Example identifications
09/06/2013 iSemantics 2013 20
Property pair
types
Dataset 1 (DBpedia) Dataset 2 (Freebase)
Simple string
similarity matches
db:nationality fb:nationality
db:religion fb:religion
Synonymous
matches
db:occupation fb:profession
db:battles fb:participated_in_conflicts
Complex matches db:screenplay fb:written_by
db:doctoralStudent fb:advisees
Example identifications
09/06/2013 iSemantics 2013 20
Property pair
types
Dataset 1 (DBpedia) Dataset 2 (Freebase)
Simple string
similarity matches
db:nationality fb:nationality
db:religion fb:religion
Synonymous
matches
db:occupation fb:profession
db:battles fb:participated_in_conflicts
Complex matches db:screenplay fb:written_by
db:doctoralStudent fb:advisees
WordNet similarity failed to identify any of these
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 21
Roadmap
iSemantics 2013
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.




09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.
 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.



09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.
 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.
 Some properties that are identified are intentionally different, e.g.,
db:distributor vs fb:production_companies.
– This is because many companies produce and also distribute their
films.


09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.
 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.
 Some properties that are identified are intentionally different, e.g.,
db:distributor vs fb:production_companies.
– This is because many companies produce and also distribute their
films.
 Some identified pairs are incorrect due to errors in data modeling.
– For example, db:issue and fb:children.

09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.
 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.
 Some properties that are identified are intentionally different, e.g.,
db:distributor vs fb:production_companies.
– This is because many companies produce and also distribute their
films.
 Some identified pairs are incorrect due to errors in data modeling.
– For example, db:issue and fb:children.
 owl:sameAs linking issues in LOD (not linking exact same thing), e.g.,
linking London and Greater London.
– We believe few misused links wont affect the algorithm as it decides
on a match after analyzing many matches for a pair.
09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Less number of interlinks.
– Evolve over time.
– Look for possible other types of ECR links (i.e., rdf:seeAlso).


09/06/2013 iSemantics 2013 23
 Less number of interlinks.
– Evolve over time.
– Look for possible other types of ECR links (i.e., rdf:seeAlso).
 Properties do not have uniform distribution in a dataset.
– Hence, some properties do not have enough matches or appearances.
– This is due to rare classes and domains they belong to.
– We can run the algorithm on instances that these less frequent
properties appear iteratively.

09/06/2013 iSemantics 2013 23
 Less number of interlinks.
– Evolve over time.
– Look for possible other types of ECR links (i.e., rdf:seeAlso).
 Properties do not have uniform distribution in a dataset.
– Hence, some properties do not have enough matches or appearances.
– This is due to rare classes and domains they belong to.
– We can run the algorithm on instances that these less frequent
properties appear iteratively.
Current limitations,
– Requires ECR links
– Requires overlapping datasets
– Object-type properties
– Inability to identify property – sub property relationships
09/06/2013 iSemantics 2013 23
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 24
Roadmap
iSemantics 2013
We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property extensions,
which is schema independent.



09/06/2013 iSemantics 2013 25
Conclusion
We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property extensions,
which is schema independent.
This novel extension based approach works well with
interlinked datasets.


09/06/2013 iSemantics 2013 25
Conclusion
We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property extensions,
which is schema independent.
This novel extension based approach works well with
interlinked datasets.
The extension based approach outperforms syntax or
dictionary based approaches. F measure gain in the range of
57% - 78%.

09/06/2013 iSemantics 2013 25
Conclusion
We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property extensions,
which is schema independent.
This novel extension based approach works well with
interlinked datasets.
The extension based approach outperforms syntax or
dictionary based approaches. F measure gain in the range of
57% - 78%.
It requires many comparisons, but can be easily parallelized
evidenced by our Map-Reduce implementation.
09/06/2013 iSemantics 2013 25
Conclusion
26
Thank You
http://knoesis.wright.edu/researchers/kalpa
kalpa@knoesis.org
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Questions ?
09/06/2013 iSemantics 2013

More Related Content

PPTX
Property Alignment on Linked Open Data
PPT
R-programming-training-in-mumbai
PDF
Query Optimization - Brandon Latronica
PDF
R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners ...
PPTX
Mapping Graph Queries to PostgreSQL
PPTX
Neural Learning to Rank
PPT
Query optimization
PDF
record_linking
Property Alignment on Linked Open Data
R-programming-training-in-mumbai
Query Optimization - Brandon Latronica
R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners ...
Mapping Graph Queries to PostgreSQL
Neural Learning to Rank
Query optimization
record_linking

What's hot (20)

PPTX
Datastructures using c++
PPS
Data Structure
PDF
UNIT I LINEAR DATA STRUCTURES – LIST
PDF
Data structure using c++
PPT
Data structures using C
PPTX
Data Structure
PPTX
Rattle Graphical Interface for R Language
PDF
Incomplete Information in RDF
PDF
Exchanging more than Complete Data
PPTX
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
PDF
Introduction to Data Structure
PPT
Introduction of data structure
PDF
Data Structures (BE)
PDF
Data Structures
PDF
What's next in Julia
PDF
R training2
PPT
Data structures using c
PDF
Elementary data structure
PDF
Introduction to R Graphics with ggplot2
PDF
Data Structures Notes 2021
Datastructures using c++
Data Structure
UNIT I LINEAR DATA STRUCTURES – LIST
Data structure using c++
Data structures using C
Data Structure
Rattle Graphical Interface for R Language
Incomplete Information in RDF
Exchanging more than Complete Data
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
Introduction to Data Structure
Introduction of data structure
Data Structures (BE)
Data Structures
What's next in Julia
R training2
Data structures using c
Elementary data structure
Introduction to R Graphics with ggplot2
Data Structures Notes 2021
Ad

Similar to A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data (20)

PPT
Data Integration Ontology Mapping
PDF
Information Retrieval using Semantic Similarity
PDF
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
PDF
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
PDF
Conceptual similarity measurement algorithm for domain specific ontology[
PDF
Conceptual Similarity Measurement Algorithm For Domain Specific Ontology
PDF
CONCEPTUAL SIMILARITY MEASUREMENT ALGORITHM FOR DOMAIN SPECIFIC ONTOLOGY
PDF
On the Impact of sameAs on Schema Matching
PPTX
Towards a Distributional Semantic Web Stack
PPT
Identifying Value Mappings for Data Integration_PVERConf_May2011
PPTX
Logical Detection of Invalid SameAs Statements in RDF Data
PDF
Semantic Analysis Using MapReduce
PDF
PDF
Document Retrieval System, a Case Study
PDF
Hybrid approach for generating non overlapped substring using genetic algorithm
PDF
Measure Term Similarity Using a Semantic Network Approach
PDF
Exploiting Distributional Semantic Models in Question Answering
PDF
Pay-as-you-go Reconciliation in Schema Matching Networks
PDF
A Primer on Entity Resolution
PDF
Semantic Similarity Measures for Semantic Relation Extraction
Data Integration Ontology Mapping
Information Retrieval using Semantic Similarity
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual Similarity Measurement Algorithm For Domain Specific Ontology
CONCEPTUAL SIMILARITY MEASUREMENT ALGORITHM FOR DOMAIN SPECIFIC ONTOLOGY
On the Impact of sameAs on Schema Matching
Towards a Distributional Semantic Web Stack
Identifying Value Mappings for Data Integration_PVERConf_May2011
Logical Detection of Invalid SameAs Statements in RDF Data
Semantic Analysis Using MapReduce
Document Retrieval System, a Case Study
Hybrid approach for generating non overlapped substring using genetic algorithm
Measure Term Similarity Using a Semantic Network Approach
Exploiting Distributional Semantic Models in Question Answering
Pay-as-you-go Reconciliation in Schema Matching Networks
A Primer on Entity Resolution
Semantic Similarity Measures for Semantic Relation Extraction
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Spectroscopy.pptx food analysis technology
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation theory and applications.pdf
PDF
KodekX | Application Modernization Development
NewMind AI Weekly Chronicles - August'25 Week I
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
sap open course for s4hana steps from ECC to s4
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectroscopy.pptx food analysis technology
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation theory and applications.pdf
KodekX | Application Modernization Development

A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data

  • 1. A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data †Kno.e.sis Center Wright State University Dayton OH, USA ‡IBM T J Watson Research Center Yorktown Heights New York NY, USA Kalpa Gunaratna†, Krishnaprasad Thirunarayan†, Prateek Jain‡, Amit Sheth†, and Sanjaya Wijeratne† {kalpa,tkprasad,amit,sanjaya}@knoesis.org, jainpr@us.ibm.com iSemantics 2013, Graz, Austria
  • 2. 09/06/2013 2 Motivation Why we need property alignment and it is so important? iSemantics 2013
  • 3. 09/06/2013 3 Many datasets. We can query! iSemantics 2013
  • 5. 09/06/2013 3 Same information in different names. Therefore, data integration for better presentation is required. iSemantics 2013
  • 6. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 4 Roadmap iSemantics 2013
  • 7. Existing techniques for property alignment fall into three categories. I. Syntactic/dictionary based – Uses string manipulation techniques, external dictionaries and lexical databases like WordNet. II. Schema dependent – Uses schema information such as, domain and range, definitions. III. Schema independent – Uses instance level information for the alignment.  Our approach falls under schema independent. 09/06/2013 5 Background iSemantics 2013
  • 8. Properties capture meaning of triples and hence they are complex in nature.    09/06/2013 6iSemantics 2013
  • 9. Properties capture meaning of triples and hence they are complex in nature. Syntactic or dictionary based approaches analyze property names for equivalence. But in LOD, name heterogeneities exist.   09/06/2013 6iSemantics 2013
  • 10. Properties capture meaning of triples and hence they are complex in nature. Syntactic or dictionary based approaches analyze property names for equivalence. But in LOD, name heterogeneities exist. Therefore, syntactic or dictionary based approaches have limited coverage in property alignment.  09/06/2013 6iSemantics 2013
  • 11. Properties capture meaning of triples and hence they are complex in nature. Syntactic or dictionary based approaches analyze property names for equivalence. But in LOD, name heterogeneities exist. Therefore, syntactic or dictionary based approaches have limited coverage in property alignment. Schema dependent approaches including processing domain and range, class level tags do not capture semantics of properties well. 09/06/2013 6iSemantics 2013
  • 12. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 7 Roadmap iSemantics 2013
  • 13.  Statistical Equivalence is based on analyzing owl:equivalentProperty.  owl:equivalentProperty - properties that have same property extensions.          09/06/2013 8 Statistical Equivalence of properties iSemantics 2013
  • 14.  Statistical Equivalence is based on analyzing owl:equivalentProperty.  owl:equivalentProperty - properties that have same property extensions. Example 1: Property P is defined by the triples, { a P b, c P d, e P f } Property Q is defined by the triples, { a Q b, c Q d, e Q f } P and Q are owl:equivalentProperty, because they have the same extension, { {a,b}, {c,d}, {e,f} }     09/06/2013 8 Statistical Equivalence of properties iSemantics 2013
  • 15.  Statistical Equivalence is based on analyzing owl:equivalentProperty.  owl:equivalentProperty - properties that have same property extensions. Example 1: Property P is defined by the triples, { a P b, c P d, e P f } Property Q is defined by the triples, { a Q b, c Q d, e Q f } P and Q are owl:equivalentProperty, because they have the same extension, { {a,b}, {c,d}, {e,f} } Example 2: Property P is defined by the triples, { a P b, c P d, e P f } Property Q is defined by the triples, { a Q b, c Q d, e Q h } Then, P and Q are not owl:equivalentProperty, because their extensions are not the same. But they provide statistical evidence in support of equivalence. 09/06/2013 8 Statistical Equivalence of properties iSemantics 2013
  • 16. Intuition  Higher rate of subject-object matches in extensions leads to equivalent properties. In practice, it is hard to have exact same extensions for matching properties. Because, – Datasets are incomplete. – Same instance may be modelled differently in different datasets.  Therefore, we analyze the property extensions to identify equivalent properties between datasets.  We define the following notions. Let the statement below be true for all the definitions. S1P1O1 and S2P2O2 be two triples in Dataset D1 and D2 respectively. 09/06/2013 iSemantics 2013 9
  • 17. Definition 1: Candidate Match The two properties P1 and P2 are a candidate match iff S1 𝐸𝐶𝑅∗ S2 and O1 𝐸𝐶𝑅∗ O2. We say two instances are connected by an ECR* link if there is a link path between the instances using ECR links (* is the Kleene star notation). ECR links are Entity Co-reference Relationships such as those formalized using owl:sameAs and skos:exactMatch.       09/06/2013 iSemantics 2013 10
  • 18. Definition 1: Candidate Match The two properties P1 and P2 are a candidate match iff S1 𝐸𝐶𝑅∗ S2 and O1 𝐸𝐶𝑅∗ O2. We say two instances are connected by an ECR* link if there is a link path between the instances using ECR links (* is the Kleene star notation). ECR links are Entity Co-reference Relationships such as those formalized using owl:sameAs and skos:exactMatch. Example The two datasets DBpedia(d) and Freebase(f) d:Arthur Purdy Stout d:place of birth d:New York City f:Arthur Purdy Stout f:place of death f:New York City   09/06/2013 iSemantics 2013 10
  • 19. Definition 1: Candidate Match The two properties P1 and P2 are a candidate match iff S1 𝐸𝐶𝑅∗ S2 and O1 𝐸𝐶𝑅∗ O2. We say two instances are connected by an ECR* link if there is a link path between the instances using ECR links (* is the Kleene star notation). ECR links are Entity Co-reference Relationships such as those formalized using owl:sameAs and skos:exactMatch. Example The two datasets DBpedia(d) and Freebase(f) d:Arthur Purdy Stout d:place of birth d:New York City f:Arthur Purdy Stout f:place of death f:New York City  The above is a candidate match, but not equivalent, because intensions are different (coincidental match).  We need further analysis to decide on equivalence. 09/06/2013 iSemantics 2013 10
  • 20. Match Count μ(P1,P2) – Number of triple pairs for P1 and P2 that participate in candidate matches. μ 𝑃1, 𝑃2 = | 𝑆1 𝑃1 𝑂1 ϵ 𝐷1 | ∃ 𝑆2 𝑃2 𝑂2 ϵ 𝐷2 ˄ 𝑆1 𝐸𝐶𝑅∗ 𝑆2 ˄ 𝑂1 𝐸𝐶𝑅∗ 𝑂2 | Co-appearance Count λ(P1,P2) – Number of triple pairs for P1 and P2 that have matching subjects. λ 𝑃1, 𝑃2 = | 𝑆1 𝑃1 𝑂1 ϵ 𝐷1 | ∃ 𝑆2 𝑃2 𝑂2 ϵ 𝐷2 ˄ 𝑆1 𝐸𝐶𝑅∗ 𝑆2 | Definition 2: Statistically Equivalent Properties The pair of properties P1 and P2 are statistically equivalent to degree (α, k) iff, 𝐹 = μ(𝑃1 ,𝑃2 ) λ(𝑃1 ,𝑃2 ) ≥ 𝛼, Where, μ(P1,P2) ≥ k, and 0 ˂ α ≤ 1, k > 1 09/06/2013 iSemantics 2013 11
  • 21. 09/06/2013 iSemantics 2013 12 Dataset 2Dataset 1 Candidate Matching Algorithm Process
  • 22. 09/06/2013 iSemantics 2013 12 I1 Dataset 2Dataset 1 I1=d1:Willis_Lamb Candidate Matching Algorithm Process
  • 23. 09/06/2013 iSemantics 2013 12 I1 I2 owl:sameAs Dataset 2Dataset 1 I1=d1:Willis_Lamb I2 =d2:willis_lamb Step 1 Candidate Matching Algorithm Process
  • 24. 09/06/2013 iSemantics 2013 12 I1 I2 I2 owl:sameAs P1=d1:doctoralStudent P2=d2:education. academic.advisees Dataset 2Dataset 1 d2:theodore_harold_maiman I1=d1:Willis_Lamb I2 =d2:willis_lamb I1 I1 d1:Theodore_Maiman triple 1 triple 2 triple 3 triple 4 triple 5 Step 1 Step2 Step2 Candidate Matching Algorithm Process
  • 25. 09/06/2013 iSemantics 2013 12 I1 I2 I2 matching resources owl:sameAs P1=d1:doctoralStudent P2=d2:education. academic.advisees Dataset 2Dataset 1 property P1 and property P2 are a candidate match d2:theodore_harold_maiman I1=d1:Willis_Lamb I2 =d2:willis_lamb I1 I1 d1:Theodore_Maiman triple 1 triple 2 triple 3 triple 4 triple 5 Step 1 Step2 Step2 Step 3 Candidate Matching Algorithm Process
  • 26. 09/06/2013 iSemantics 2013 13 Complexity: If the average number of properties for an entity is x and for each property, average number of objects is j. For n subjects, it requires n*j2*x2+2n comparisons. Since n > j, n > x, and x and j are independent of n, O(n).
  • 28. Parallel computation (Map-Reduce implementation)  Generating candidate matches can be done for each instance independently. Hence, we implemented the algorithm in Hadoop 1.0.3 framework.  Generating candidate matches for instances is distributed among mappers and each mapper outputs μ and λ to the reducer for property pairs. • Map Phase – Let the number of subject instances in dataset D1 be X and namespace of dataset D2 be ns. For each subject i ϵ X, start a mapper job for GenerateCandidateMatches(i, ns). – Each mapper outputs (key,value) pairs as (p:q, μ(p,q):λ(p,q)). pϵD1 and qϵD2.  The reducer collects all μ and λ values and aggregate them for final analysis. • Reduce phase – Collects output from mappers and aggregates μ(p,q) and λ(p,q) for each key p:q.  The map reduce version on a 14 node cluster was able to achieve a speed up of 833% compared to the desktop version. 09/06/2013 iSemantics 2013 28
  • 29. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 16 Roadmap iSemantics 2013
  • 30. Objectives of the evaluation – Show the effectiveness of the approach in linked datasets – Compare with existing aligning techniques   09/06/2013 iSemantics 2013 17 Evaluation
  • 31. Objectives of the evaluation – Show the effectiveness of the approach in linked datasets – Compare with existing aligning techniques We selected 5000 instance samples from DBpedia, Freebase, LinkedMDB, DBLP L3S , and DBLP RKB Explorer datasets.  09/06/2013 iSemantics 2013 17 Evaluation
  • 32. Objectives of the evaluation – Show the effectiveness of the approach in linked datasets – Compare with existing aligning techniques We selected 5000 instance samples from DBpedia, Freebase, LinkedMDB, DBLP L3S , and DBLP RKB Explorer datasets. These datasets have, – Complete data for instances in different viewpoints – Many inter-links – Complex properties 09/06/2013 iSemantics 2013 17 Evaluation
  • 33. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. 09/06/2013 iSemantics 2013 18
  • 34. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. 09/06/2013 iSemantics 2013 18
  • 35. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. – k can be estimated using the data as follows, – Set α = 0.5 and k = 2 (lowest positive values). – Get exact matching property (property names) pairs not identified by the algorithm and their μ – Get the average of those μ values 09/06/2013 iSemantics 2013 18
  • 36. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. – k can be estimated using the data as follows, – Set α = 0.5 and k = 2 (lowest positive values). – Get exact matching property (property names) pairs not identified by the algorithm and their μ – Get the average of those μ values – 0.92 for string similarity algorithms. 09/06/2013 iSemantics 2013 18
  • 37. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. – k can be estimated using the data as follows, – Set α = 0.5 and k = 2 (lowest positive values). – Get exact matching property (property names) pairs not identified by the algorithm and their μ – Get the average of those μ values – 0.92 for string similarity algorithms. – 0.8 for WordNet similarity. 09/06/2013 iSemantics 2013 18
  • 38. Measure type DBpedia – Freebase (Person) DBpedia – Freebase (Film) DBpedia – Freebase (Software) DBpedia – LinkedMDB (Film) DBLP_RKB – DBLP_L3S (Article) Average Extension Based Algorithm Precision 0.8758 0.9737 0.6478 0.7560 1.0000 0.8427 Recall 0.8089* 0.5138 0.4339 0.8157 1.0000 0.7145 F measure 0.8410* 0.6727 0.5197 0.7848 1.0000 0.7656 WordNet Similarity Precision 0.5200 0.8620 0.7619 0.8823 1.0000 0.8052 Recall 0.4140* 0.3472 0.3018 0.3947 0.3333 0.3582 F measure 0.4609* 0.4950 0.4324 0.5454 0.5000 0.4867 Dice Similarity Precision 0.8064 0.9666 0.7659 1.0000 0.0000 0.7078 Recall 0.4777* 0.4027 0.3396 0.3421 0.0000 0.3124 F measure 0.6000* 0.5686 0.4705 0.5098 0.0000 0.4298 Jaro Similarity Precision 0.6774 0.8809 0.7755 0.9411 0.0000 0.6550 Recall 0.5350* 0.5138 0.3584 0.4210 0.0000 0.3656 F measure 0.5978* 0.6491 0.4903 0.5818 0.0000 0.4638 09/06/2013 iSemantics 2013 19 Alignment results * Marks estimated values for experiment 1 because of very large comparisons to check manually. Boldface marks highest result for each experiment.
  • 39. Example identifications 09/06/2013 iSemantics 2013 20 Property pair types Dataset 1 (DBpedia) Dataset 2 (Freebase) Simple string similarity matches db:nationality fb:nationality db:religion fb:religion Synonymous matches db:occupation fb:profession db:battles fb:participated_in_conflicts Complex matches db:screenplay fb:written_by db:doctoralStudent fb:advisees
  • 40. Example identifications 09/06/2013 iSemantics 2013 20 Property pair types Dataset 1 (DBpedia) Dataset 2 (Freebase) Simple string similarity matches db:nationality fb:nationality db:religion fb:religion Synonymous matches db:occupation fb:profession db:battles fb:participated_in_conflicts Complex matches db:screenplay fb:written_by db:doctoralStudent fb:advisees WordNet similarity failed to identify any of these
  • 41. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 21 Roadmap iSemantics 2013
  • 42.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.     09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 43.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.    09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 44.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.  Some properties that are identified are intentionally different, e.g., db:distributor vs fb:production_companies. – This is because many companies produce and also distribute their films.   09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 45.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.  Some properties that are identified are intentionally different, e.g., db:distributor vs fb:production_companies. – This is because many companies produce and also distribute their films.  Some identified pairs are incorrect due to errors in data modeling. – For example, db:issue and fb:children.  09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 46.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.  Some properties that are identified are intentionally different, e.g., db:distributor vs fb:production_companies. – This is because many companies produce and also distribute their films.  Some identified pairs are incorrect due to errors in data modeling. – For example, db:issue and fb:children.  owl:sameAs linking issues in LOD (not linking exact same thing), e.g., linking London and Greater London. – We believe few misused links wont affect the algorithm as it decides on a match after analyzing many matches for a pair. 09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 47.  Less number of interlinks. – Evolve over time. – Look for possible other types of ECR links (i.e., rdf:seeAlso).   09/06/2013 iSemantics 2013 23
  • 48.  Less number of interlinks. – Evolve over time. – Look for possible other types of ECR links (i.e., rdf:seeAlso).  Properties do not have uniform distribution in a dataset. – Hence, some properties do not have enough matches or appearances. – This is due to rare classes and domains they belong to. – We can run the algorithm on instances that these less frequent properties appear iteratively.  09/06/2013 iSemantics 2013 23
  • 49.  Less number of interlinks. – Evolve over time. – Look for possible other types of ECR links (i.e., rdf:seeAlso).  Properties do not have uniform distribution in a dataset. – Hence, some properties do not have enough matches or appearances. – This is due to rare classes and domains they belong to. – We can run the algorithm on instances that these less frequent properties appear iteratively. Current limitations, – Requires ECR links – Requires overlapping datasets – Object-type properties – Inability to identify property – sub property relationships 09/06/2013 iSemantics 2013 23
  • 50. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 24 Roadmap iSemantics 2013
  • 51. We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent.    09/06/2013 iSemantics 2013 25 Conclusion
  • 52. We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent. This novel extension based approach works well with interlinked datasets.   09/06/2013 iSemantics 2013 25 Conclusion
  • 53. We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent. This novel extension based approach works well with interlinked datasets. The extension based approach outperforms syntax or dictionary based approaches. F measure gain in the range of 57% - 78%.  09/06/2013 iSemantics 2013 25 Conclusion
  • 54. We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent. This novel extension based approach works well with interlinked datasets. The extension based approach outperforms syntax or dictionary based approaches. F measure gain in the range of 57% - 78%. It requires many comparisons, but can be easily parallelized evidenced by our Map-Reduce implementation. 09/06/2013 iSemantics 2013 25 Conclusion
  • 55. 26 Thank You http://knoesis.wright.edu/researchers/kalpa kalpa@knoesis.org Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA Questions ? 09/06/2013 iSemantics 2013

Editor's Notes

  • #34: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.
  • #35: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.
  • #36: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.
  • #37: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.
  • #38: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.