SlideShare a Scribd company logo
A Statistical and Schema Independent
Approach to Identify Equivalent
Properties on Linked Data
†Kno.e.sis Center
Wright State University
Dayton OH, USA
‡IBM T J Watson Research Center
Yorktown Heights
New York NY, USA
Kalpa Gunaratna†, Krishnaprasad Thirunarayan†, Prateek Jain‡, Amit Sheth†, and Sanjaya Wijeratne†
{kalpa,tkprasad,amit,sanjaya}@knoesis.org, jainpr@us.ibm.com
iSemantics 2013, Graz, Austria
09/06/2013 2
Motivation
Why we need property alignment and it is so
important?
iSemantics 2013
09/06/2013 3
Many datasets. We can query!
iSemantics 2013
09/06/2013 3iSemantics 2013
09/06/2013 3
Same information in different names.
Therefore, data integration for better presentation is required.
iSemantics 2013
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 4
Roadmap
iSemantics 2013
Existing techniques for property alignment fall into three
categories.
I. Syntactic/dictionary based
– Uses string manipulation techniques, external dictionaries and
lexical databases like WordNet.
II. Schema dependent
– Uses schema information such as, domain and range, definitions.
III. Schema independent
– Uses instance level information for the alignment.
 Our approach falls under schema independent.
09/06/2013 5
Background
iSemantics 2013
Properties capture meaning of triples and hence they are
complex in nature.



09/06/2013 6iSemantics 2013
Properties capture meaning of triples and hence they are
complex in nature.
Syntactic or dictionary based approaches analyze property
names for equivalence. But in LOD, name heterogeneities exist.


09/06/2013 6iSemantics 2013
Properties capture meaning of triples and hence they are
complex in nature.
Syntactic or dictionary based approaches analyze property
names for equivalence. But in LOD, name heterogeneities exist.
Therefore, syntactic or dictionary based approaches have
limited coverage in property alignment.

09/06/2013 6iSemantics 2013
Properties capture meaning of triples and hence they are
complex in nature.
Syntactic or dictionary based approaches analyze property
names for equivalence. But in LOD, name heterogeneities exist.
Therefore, syntactic or dictionary based approaches have
limited coverage in property alignment.
Schema dependent approaches including processing domain
and range, class level tags do not capture semantics of
properties well.
09/06/2013 6iSemantics 2013
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 7
Roadmap
iSemantics 2013
 Statistical Equivalence is based on analyzing owl:equivalentProperty.
 owl:equivalentProperty - properties that have same property
extensions.









09/06/2013 8
Statistical Equivalence of properties
iSemantics 2013
 Statistical Equivalence is based on analyzing owl:equivalentProperty.
 owl:equivalentProperty - properties that have same property
extensions.
Example 1:
Property P is defined by the triples, { a P b, c P d, e P f }
Property Q is defined by the triples, { a Q b, c Q d, e Q f }
P and Q are owl:equivalentProperty, because they have the same extension,
{ {a,b}, {c,d}, {e,f} }




09/06/2013 8
Statistical Equivalence of properties
iSemantics 2013
 Statistical Equivalence is based on analyzing owl:equivalentProperty.
 owl:equivalentProperty - properties that have same property
extensions.
Example 1:
Property P is defined by the triples, { a P b, c P d, e P f }
Property Q is defined by the triples, { a Q b, c Q d, e Q f }
P and Q are owl:equivalentProperty, because they have the same extension,
{ {a,b}, {c,d}, {e,f} }
Example 2:
Property P is defined by the triples, { a P b, c P d, e P f }
Property Q is defined by the triples, { a Q b, c Q d, e Q h }
Then, P and Q are not owl:equivalentProperty, because their extensions are not the
same. But they provide statistical evidence in support of equivalence.
09/06/2013 8
Statistical Equivalence of properties
iSemantics 2013
Intuition
 Higher rate of subject-object matches in extensions leads to
equivalent properties. In practice, it is hard to have exact same
extensions for matching properties. Because,
– Datasets are incomplete.
– Same instance may be modelled differently in different datasets.
 Therefore, we analyze the property extensions to identify equivalent
properties between datasets.
 We define the following notions. Let the statement below be true
for all the definitions.
S1P1O1 and S2P2O2 be two triples in Dataset D1 and D2 respectively.
09/06/2013 iSemantics 2013 9

09/06/2013 iSemantics 2013 10

09/06/2013 iSemantics 2013 10

09/06/2013 iSemantics 2013 10

09/06/2013 iSemantics 2013 11
09/06/2013 iSemantics 2013 12
Dataset 2Dataset 1
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 12
I1
Dataset 2Dataset 1
I1=d1:Willis_Lamb
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 12
I1 I2
owl:sameAs
Dataset 2Dataset 1
I1=d1:Willis_Lamb I2 =d2:willis_lamb
Step 1
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 12
I1 I2
I2
owl:sameAs
P1=d1:doctoralStudent
P2=d2:education.
academic.advisees
Dataset 2Dataset 1
d2:theodore_harold_maiman
I1=d1:Willis_Lamb I2 =d2:willis_lamb
I1
I1
d1:Theodore_Maiman
triple 1
triple 2
triple 3
triple 4
triple 5
Step 1
Step2
Step2
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 12
I1 I2
I2
matching resources
owl:sameAs
P1=d1:doctoralStudent
P2=d2:education.
academic.advisees
Dataset 2Dataset 1
property P1 and property P2 are a candidate match
d2:theodore_harold_maiman
I1=d1:Willis_Lamb I2 =d2:willis_lamb
I1
I1
d1:Theodore_Maiman
triple 1
triple 2
triple 3
triple 4
triple 5
Step 1
Step2
Step2
Step 3
Candidate Matching Algorithm Process
09/06/2013 iSemantics 2013 13
Complexity:
If the average number of properties for an entity is x and for each property, average
number of objects is j. For n subjects, it requires n*j2*x2+2n comparisons. Since n > j,
n > x, and x and j are independent of n, O(n).
Example:
09/06/2013 iSemantics 2013 14

09/06/2013 iSemantics 2013 28
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 16
Roadmap
iSemantics 2013
Objectives of the evaluation
– Show the effectiveness of the approach in linked datasets
– Compare with existing aligning techniques


09/06/2013 iSemantics 2013 17
Evaluation
Objectives of the evaluation
– Show the effectiveness of the approach in linked datasets
– Compare with existing aligning techniques
We selected 5000 instance samples from
DBpedia, Freebase, LinkedMDB, DBLP L3S , and DBLP RKB
Explorer datasets.

09/06/2013 iSemantics 2013 17
Evaluation
Objectives of the evaluation
– Show the effectiveness of the approach in linked datasets
– Compare with existing aligning techniques
We selected 5000 instance samples from
DBpedia, Freebase, LinkedMDB, DBLP L3S , and DBLP RKB
Explorer datasets.
These datasets have,
– Complete data for instances in different viewpoints
– Many inter-links
– Complex properties
09/06/2013 iSemantics 2013 17
Evaluation
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
09/06/2013 iSemantics 2013 18
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
09/06/2013 iSemantics 2013 18
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
– k can be estimated using the data as follows,
– Set α = 0.5 and k = 2 (lowest positive values).
– Get exact matching property (property names) pairs not identified by
the algorithm and their μ
– Get the average of those μ values
09/06/2013 iSemantics 2013 18
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
– k can be estimated using the data as follows,
– Set α = 0.5 and k = 2 (lowest positive values).
– Get exact matching property (property names) pairs not identified by
the algorithm and their μ
– Get the average of those μ values
– 0.92 for string similarity algorithms.
09/06/2013 iSemantics 2013 18
Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
– k can be estimated using the data as follows,
– Set α = 0.5 and k = 2 (lowest positive values).
– Get exact matching property (property names) pairs not identified by
the algorithm and their μ
– Get the average of those μ values
– 0.92 for string similarity algorithms.
– 0.8 for WordNet similarity.
09/06/2013 iSemantics 2013 18
Measure
type
DBpedia –
Freebase
(Person)
DBpedia –
Freebase
(Film)
DBpedia –
Freebase
(Software)
DBpedia –
LinkedMDB
(Film)
DBLP_RKB –
DBLP_L3S
(Article)
Average
Extension
Based
Algorithm
Precision 0.8758 0.9737 0.6478 0.7560 1.0000 0.8427
Recall 0.8089* 0.5138 0.4339 0.8157 1.0000 0.7145
F measure 0.8410* 0.6727 0.5197 0.7848 1.0000 0.7656
WordNet
Similarity
Precision 0.5200 0.8620 0.7619 0.8823 1.0000 0.8052
Recall 0.4140* 0.3472 0.3018 0.3947 0.3333 0.3582
F measure 0.4609* 0.4950 0.4324 0.5454 0.5000 0.4867
Dice
Similarity
Precision 0.8064 0.9666 0.7659 1.0000 0.0000 0.7078
Recall 0.4777* 0.4027 0.3396 0.3421 0.0000 0.3124
F measure 0.6000* 0.5686 0.4705 0.5098 0.0000 0.4298
Jaro
Similarity
Precision 0.6774 0.8809 0.7755 0.9411 0.0000 0.6550
Recall 0.5350* 0.5138 0.3584 0.4210 0.0000 0.3656
F measure 0.5978* 0.6491 0.4903 0.5818 0.0000 0.4638
09/06/2013 iSemantics 2013 19
Alignment results
* Marks estimated values for experiment 1 because of very large comparisons to check manually. Boldface
marks highest result for each experiment.
Example identifications
09/06/2013 iSemantics 2013 20
Property pair
types
Dataset 1 (DBpedia) Dataset 2 (Freebase)
Simple string
similarity matches
db:nationality fb:nationality
db:religion fb:religion
Synonymous
matches
db:occupation fb:profession
db:battles fb:participated_in_conflicts
Complex matches db:screenplay fb:written_by
db:doctoralStudent fb:advisees
Example identifications
09/06/2013 iSemantics 2013 20
Property pair
types
Dataset 1 (DBpedia) Dataset 2 (Freebase)
Simple string
similarity matches
db:nationality fb:nationality
db:religion fb:religion
Synonymous
matches
db:occupation fb:profession
db:battles fb:participated_in_conflicts
Complex matches db:screenplay fb:written_by
db:doctoralStudent fb:advisees
WordNet similarity failed to identify any of these
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 21
Roadmap
iSemantics 2013
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.




09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.
 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.



09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.
 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.
 Some properties that are identified are intentionally
different, e.g., db:distributor vs fb:production_companies.
– This is because many companies produce and also distribute their
films.


09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.
 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.
 Some properties that are identified are intentionally
different, e.g., db:distributor vs fb:production_companies.
– This is because many companies produce and also distribute their
films.
 Some identified pairs are incorrect due to errors in data modeling.
– For example, db:issue and fb:children.

09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.
 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.
 Some properties that are identified are intentionally
different, e.g., db:distributor vs fb:production_companies.
– This is because many companies produce and also distribute their
films.
 Some identified pairs are incorrect due to errors in data modeling.
– For example, db:issue and fb:children.
 owl:sameAs linking issues in LOD (not linking exact same
thing), e.g., linking London and Greater London.
– We believe few misused links wont affect the algorithm as it decides
on a match after analyzing many matches for a pair.
09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions
 Less number of interlinks.
– Evolve over time.
– Look for possible other types of ECR links (i.e., rdf:seeAlso).


09/06/2013 iSemantics 2013 23
 Less number of interlinks.
– Evolve over time.
– Look for possible other types of ECR links (i.e., rdf:seeAlso).
 Properties do not have uniform distribution in a dataset.
– Hence, some properties do not have enough matches or appearances.
– This is due to rare classes and domains they belong to.
– We can run the algorithm on instances that these less frequent
properties appear iteratively.

09/06/2013 iSemantics 2013 23
 Less number of interlinks.
– Evolve over time.
– Look for possible other types of ECR links (i.e., rdf:seeAlso).
 Properties do not have uniform distribution in a dataset.
– Hence, some properties do not have enough matches or appearances.
– This is due to rare classes and domains they belong to.
– We can run the algorithm on instances that these less frequent
properties appear iteratively.
Current limitations,
– Requires ECR links
– Requires overlapping datasets
– Object-type properties
– Inability to identify property – sub property relationships
09/06/2013 iSemantics 2013 23
Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 24
Roadmap
iSemantics 2013
We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property
extensions, which is schema independent.



09/06/2013 iSemantics 2013 25
Conclusion
We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property
extensions, which is schema independent.
This novel extension based approach works well with
interlinked datasets.


09/06/2013 iSemantics 2013 25
Conclusion
We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property
extensions, which is schema independent.
This novel extension based approach works well with
interlinked datasets.
The extension based approach outperforms syntax or
dictionary based approaches. F measure gain in the range of
57% - 78%.

09/06/2013 iSemantics 2013 25
Conclusion
We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property extensions,
which is schema independent.
This novel extension based approach works well with
interlinked datasets.
The extension based approach outperforms syntax or
dictionary based approaches. F measure gain in the range of
57% - 78%.
It requires many comparisons, but can be easily parallelized
evidenced by our Map-Reduce implementation.
09/06/2013 iSemantics 2013 25
Conclusion
26
Thank You
http://knoesis.wright.edu/researchers/kalpa
kalpa@knoesis.org
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Questions ?
09/06/2013 iSemantics 2013

More Related Content

PDF
Top-K Dominating Queries on Incomplete Data with Priorities
PDF
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
PDF
R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners ...
PDF
Implementing a data science project (R Version) Part1
PPTX
Implementing a data_science_project (Python Version)_part1
PPTX
Deductive databases
PDF
Database Programming using SQL
PPTX
Recommendation Engine Powered by Hadoop
Top-K Dominating Queries on Incomplete Data with Priorities
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners ...
Implementing a data science project (R Version) Part1
Implementing a data_science_project (Python Version)_part1
Deductive databases
Database Programming using SQL
Recommendation Engine Powered by Hadoop

What's hot (20)

PDF
The Status of ML Algorithms for Structure-property Relationships Using Matb...
PDF
Z04506138145
PDF
Analysis of different similarity measures: Simrank
PDF
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
PPTX
Document ranking using qprp with concept of multi dimensional subspace
PPS
Data Structure
PDF
Ginix Generalized Inverted Index for Keyword Search
PDF
geekgap.io webinar #1
PDF
A Primer on Entity Resolution
PDF
Rethinking Data-Intensive Science Using Scalable Analytics Systems
PDF
Mapping Domain Names to Categories
PPTX
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
PPT
Presentation dual inversion-index
PPTX
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
PPTX
3. Stack - Data Structures using C++ by Varsha Patil
PDF
Extracting and Making Use of Materials Data from Millions of Journal Articles...
PPTX
5. Queue - Data Structures using C++ by Varsha Patil
PDF
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
PDF
Automated building of taxonomies for search engines
PDF
Ijetcas14 347
The Status of ML Algorithms for Structure-property Relationships Using Matb...
Z04506138145
Analysis of different similarity measures: Simrank
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Document ranking using qprp with concept of multi dimensional subspace
Data Structure
Ginix Generalized Inverted Index for Keyword Search
geekgap.io webinar #1
A Primer on Entity Resolution
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Mapping Domain Names to Categories
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
Presentation dual inversion-index
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil
Extracting and Making Use of Materials Data from Millions of Journal Articles...
5. Queue - Data Structures using C++ by Varsha Patil
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Automated building of taxonomies for search engines
Ijetcas14 347
Ad

Viewers also liked (17)

PDF
Trust networks tutorial-iicai-12-15-2011
PPTX
Ieee metadata-conf-1999-keynote-amit sheth
PPTX
Semantic Computing in Real-World: Vertical and Horizontal application
PPT
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
PPTX
Semantic Web vision and its relevance to Open Digital Data for MGI
PPTX
kHealth: Proactive Personalized Actionable Information for Better Healthcare
PPTX
Domain Identification for Linked Open Data
PDF
Trust networks infotech2010
PPTX
Inside the Mind of Watson: Cognitive Computing
PPTX
Real Time Semantic Analysis of Streaming Sensor Data
PPS
2011 national geographic_photos
PPTX
Cursing in English on Twitter at CSCW 2014
PPTX
User Interests Identification From Twitter using Hierarchical Knowledge Base
PPTX
Semantic (Web) Technologies for Translational Research in Life Sciences
PDF
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
PPTX
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Trust networks tutorial-iicai-12-15-2011
Ieee metadata-conf-1999-keynote-amit sheth
Semantic Computing in Real-World: Vertical and Horizontal application
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Semantic Web vision and its relevance to Open Digital Data for MGI
kHealth: Proactive Personalized Actionable Information for Better Healthcare
Domain Identification for Linked Open Data
Trust networks infotech2010
Inside the Mind of Watson: Cognitive Computing
Real Time Semantic Analysis of Streaming Sensor Data
2011 national geographic_photos
Cursing in English on Twitter at CSCW 2014
User Interests Identification From Twitter using Hierarchical Knowledge Base
Semantic (Web) Technologies for Translational Research in Life Sciences
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Ad

Similar to Property Alignment on Linked Open Data (20)

PPTX
A Statistical and Schema Independent Approach to Identify Equivalent Properti...
PDF
Mapping Keywords to
PDF
Recommender Systems in the Linked Data era
PDF
Data Science as a Career and Intro to R
PDF
Group13 kdd cup_report_submitted
PDF
IEEE Datamining 2016 Title and Abstract
PDF
Gooey data sets
PDF
IRJET - Student Future Prediction System under Filtering Mechanism
PPTX
Graph_Databases__And_Its_Usage_Presentation.pptx
PPTX
Graph_Database_Prepared_by_Ali_Rajab.pptx
PDF
Partial Object Detection in Inclined Weather Conditions
PDF
Neural Semi-supervised Learning under Domain Shift
PDF
MapReduce and Its Discontents
PDF
[SAC2014]Splitting Approaches for Context-Aware Recommendation: An Empirical ...
PDF
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
PDF
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
PDF
Logistics Data Analyst Internship RRD
PDF
Detecting Gender-bias from Energy Modeling Jobscape
PDF
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY
PDF
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY
A Statistical and Schema Independent Approach to Identify Equivalent Properti...
Mapping Keywords to
Recommender Systems in the Linked Data era
Data Science as a Career and Intro to R
Group13 kdd cup_report_submitted
IEEE Datamining 2016 Title and Abstract
Gooey data sets
IRJET - Student Future Prediction System under Filtering Mechanism
Graph_Databases__And_Its_Usage_Presentation.pptx
Graph_Database_Prepared_by_Ali_Rajab.pptx
Partial Object Detection in Inclined Weather Conditions
Neural Semi-supervised Learning under Domain Shift
MapReduce and Its Discontents
[SAC2014]Splitting Approaches for Context-Aware Recommendation: An Empirical ...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
Logistics Data Analyst Internship RRD
Detecting Gender-bias from Energy Modeling Jobscape
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY
MR – RANDOM FOREST ALGORITHM FOR DISTRIBUTED ACTION RULES DISCOVERY

Recently uploaded (20)

PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Cell Types and Its function , kingdom of life
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Classroom Observation Tools for Teachers
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Pre independence Education in Inndia.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Institutional Correction lecture only . . .
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Renaissance Architecture: A Journey from Faith to Humanism
Complications of Minimal Access Surgery at WLH
Cell Types and Its function , kingdom of life
O5-L3 Freight Transport Ops (International) V1.pdf
Microbial disease of the cardiovascular and lymphatic systems
Classroom Observation Tools for Teachers
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPH.pptx obstetrics and gynecology in nursing
Pre independence Education in Inndia.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
human mycosis Human fungal infections are called human mycosis..pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Abdominal Access Techniques with Prof. Dr. R K Mishra
Institutional Correction lecture only . . .

Property Alignment on Linked Open Data

  • 1. A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data †Kno.e.sis Center Wright State University Dayton OH, USA ‡IBM T J Watson Research Center Yorktown Heights New York NY, USA Kalpa Gunaratna†, Krishnaprasad Thirunarayan†, Prateek Jain‡, Amit Sheth†, and Sanjaya Wijeratne† {kalpa,tkprasad,amit,sanjaya}@knoesis.org, jainpr@us.ibm.com iSemantics 2013, Graz, Austria
  • 2. 09/06/2013 2 Motivation Why we need property alignment and it is so important? iSemantics 2013
  • 3. 09/06/2013 3 Many datasets. We can query! iSemantics 2013
  • 5. 09/06/2013 3 Same information in different names. Therefore, data integration for better presentation is required. iSemantics 2013
  • 6. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 4 Roadmap iSemantics 2013
  • 7. Existing techniques for property alignment fall into three categories. I. Syntactic/dictionary based – Uses string manipulation techniques, external dictionaries and lexical databases like WordNet. II. Schema dependent – Uses schema information such as, domain and range, definitions. III. Schema independent – Uses instance level information for the alignment.  Our approach falls under schema independent. 09/06/2013 5 Background iSemantics 2013
  • 8. Properties capture meaning of triples and hence they are complex in nature.    09/06/2013 6iSemantics 2013
  • 9. Properties capture meaning of triples and hence they are complex in nature. Syntactic or dictionary based approaches analyze property names for equivalence. But in LOD, name heterogeneities exist.   09/06/2013 6iSemantics 2013
  • 10. Properties capture meaning of triples and hence they are complex in nature. Syntactic or dictionary based approaches analyze property names for equivalence. But in LOD, name heterogeneities exist. Therefore, syntactic or dictionary based approaches have limited coverage in property alignment.  09/06/2013 6iSemantics 2013
  • 11. Properties capture meaning of triples and hence they are complex in nature. Syntactic or dictionary based approaches analyze property names for equivalence. But in LOD, name heterogeneities exist. Therefore, syntactic or dictionary based approaches have limited coverage in property alignment. Schema dependent approaches including processing domain and range, class level tags do not capture semantics of properties well. 09/06/2013 6iSemantics 2013
  • 12. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 7 Roadmap iSemantics 2013
  • 13.  Statistical Equivalence is based on analyzing owl:equivalentProperty.  owl:equivalentProperty - properties that have same property extensions.          09/06/2013 8 Statistical Equivalence of properties iSemantics 2013
  • 14.  Statistical Equivalence is based on analyzing owl:equivalentProperty.  owl:equivalentProperty - properties that have same property extensions. Example 1: Property P is defined by the triples, { a P b, c P d, e P f } Property Q is defined by the triples, { a Q b, c Q d, e Q f } P and Q are owl:equivalentProperty, because they have the same extension, { {a,b}, {c,d}, {e,f} }     09/06/2013 8 Statistical Equivalence of properties iSemantics 2013
  • 15.  Statistical Equivalence is based on analyzing owl:equivalentProperty.  owl:equivalentProperty - properties that have same property extensions. Example 1: Property P is defined by the triples, { a P b, c P d, e P f } Property Q is defined by the triples, { a Q b, c Q d, e Q f } P and Q are owl:equivalentProperty, because they have the same extension, { {a,b}, {c,d}, {e,f} } Example 2: Property P is defined by the triples, { a P b, c P d, e P f } Property Q is defined by the triples, { a Q b, c Q d, e Q h } Then, P and Q are not owl:equivalentProperty, because their extensions are not the same. But they provide statistical evidence in support of equivalence. 09/06/2013 8 Statistical Equivalence of properties iSemantics 2013
  • 16. Intuition  Higher rate of subject-object matches in extensions leads to equivalent properties. In practice, it is hard to have exact same extensions for matching properties. Because, – Datasets are incomplete. – Same instance may be modelled differently in different datasets.  Therefore, we analyze the property extensions to identify equivalent properties between datasets.  We define the following notions. Let the statement below be true for all the definitions. S1P1O1 and S2P2O2 be two triples in Dataset D1 and D2 respectively. 09/06/2013 iSemantics 2013 9
  • 21. 09/06/2013 iSemantics 2013 12 Dataset 2Dataset 1 Candidate Matching Algorithm Process
  • 22. 09/06/2013 iSemantics 2013 12 I1 Dataset 2Dataset 1 I1=d1:Willis_Lamb Candidate Matching Algorithm Process
  • 23. 09/06/2013 iSemantics 2013 12 I1 I2 owl:sameAs Dataset 2Dataset 1 I1=d1:Willis_Lamb I2 =d2:willis_lamb Step 1 Candidate Matching Algorithm Process
  • 24. 09/06/2013 iSemantics 2013 12 I1 I2 I2 owl:sameAs P1=d1:doctoralStudent P2=d2:education. academic.advisees Dataset 2Dataset 1 d2:theodore_harold_maiman I1=d1:Willis_Lamb I2 =d2:willis_lamb I1 I1 d1:Theodore_Maiman triple 1 triple 2 triple 3 triple 4 triple 5 Step 1 Step2 Step2 Candidate Matching Algorithm Process
  • 25. 09/06/2013 iSemantics 2013 12 I1 I2 I2 matching resources owl:sameAs P1=d1:doctoralStudent P2=d2:education. academic.advisees Dataset 2Dataset 1 property P1 and property P2 are a candidate match d2:theodore_harold_maiman I1=d1:Willis_Lamb I2 =d2:willis_lamb I1 I1 d1:Theodore_Maiman triple 1 triple 2 triple 3 triple 4 triple 5 Step 1 Step2 Step2 Step 3 Candidate Matching Algorithm Process
  • 26. 09/06/2013 iSemantics 2013 13 Complexity: If the average number of properties for an entity is x and for each property, average number of objects is j. For n subjects, it requires n*j2*x2+2n comparisons. Since n > j, n > x, and x and j are independent of n, O(n).
  • 29. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 16 Roadmap iSemantics 2013
  • 30. Objectives of the evaluation – Show the effectiveness of the approach in linked datasets – Compare with existing aligning techniques   09/06/2013 iSemantics 2013 17 Evaluation
  • 31. Objectives of the evaluation – Show the effectiveness of the approach in linked datasets – Compare with existing aligning techniques We selected 5000 instance samples from DBpedia, Freebase, LinkedMDB, DBLP L3S , and DBLP RKB Explorer datasets.  09/06/2013 iSemantics 2013 17 Evaluation
  • 32. Objectives of the evaluation – Show the effectiveness of the approach in linked datasets – Compare with existing aligning techniques We selected 5000 instance samples from DBpedia, Freebase, LinkedMDB, DBLP L3S , and DBLP RKB Explorer datasets. These datasets have, – Complete data for instances in different viewpoints – Many inter-links – Complex properties 09/06/2013 iSemantics 2013 17 Evaluation
  • 33. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. 09/06/2013 iSemantics 2013 18
  • 34. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. 09/06/2013 iSemantics 2013 18
  • 35. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. – k can be estimated using the data as follows, – Set α = 0.5 and k = 2 (lowest positive values). – Get exact matching property (property names) pairs not identified by the algorithm and their μ – Get the average of those μ values 09/06/2013 iSemantics 2013 18
  • 36. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. – k can be estimated using the data as follows, – Set α = 0.5 and k = 2 (lowest positive values). – Get exact matching property (property names) pairs not identified by the algorithm and their μ – Get the average of those μ values – 0.92 for string similarity algorithms. 09/06/2013 iSemantics 2013 18
  • 37. Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. – k can be estimated using the data as follows, – Set α = 0.5 and k = 2 (lowest positive values). – Get exact matching property (property names) pairs not identified by the algorithm and their μ – Get the average of those μ values – 0.92 for string similarity algorithms. – 0.8 for WordNet similarity. 09/06/2013 iSemantics 2013 18
  • 38. Measure type DBpedia – Freebase (Person) DBpedia – Freebase (Film) DBpedia – Freebase (Software) DBpedia – LinkedMDB (Film) DBLP_RKB – DBLP_L3S (Article) Average Extension Based Algorithm Precision 0.8758 0.9737 0.6478 0.7560 1.0000 0.8427 Recall 0.8089* 0.5138 0.4339 0.8157 1.0000 0.7145 F measure 0.8410* 0.6727 0.5197 0.7848 1.0000 0.7656 WordNet Similarity Precision 0.5200 0.8620 0.7619 0.8823 1.0000 0.8052 Recall 0.4140* 0.3472 0.3018 0.3947 0.3333 0.3582 F measure 0.4609* 0.4950 0.4324 0.5454 0.5000 0.4867 Dice Similarity Precision 0.8064 0.9666 0.7659 1.0000 0.0000 0.7078 Recall 0.4777* 0.4027 0.3396 0.3421 0.0000 0.3124 F measure 0.6000* 0.5686 0.4705 0.5098 0.0000 0.4298 Jaro Similarity Precision 0.6774 0.8809 0.7755 0.9411 0.0000 0.6550 Recall 0.5350* 0.5138 0.3584 0.4210 0.0000 0.3656 F measure 0.5978* 0.6491 0.4903 0.5818 0.0000 0.4638 09/06/2013 iSemantics 2013 19 Alignment results * Marks estimated values for experiment 1 because of very large comparisons to check manually. Boldface marks highest result for each experiment.
  • 39. Example identifications 09/06/2013 iSemantics 2013 20 Property pair types Dataset 1 (DBpedia) Dataset 2 (Freebase) Simple string similarity matches db:nationality fb:nationality db:religion fb:religion Synonymous matches db:occupation fb:profession db:battles fb:participated_in_conflicts Complex matches db:screenplay fb:written_by db:doctoralStudent fb:advisees
  • 40. Example identifications 09/06/2013 iSemantics 2013 20 Property pair types Dataset 1 (DBpedia) Dataset 2 (Freebase) Simple string similarity matches db:nationality fb:nationality db:religion fb:religion Synonymous matches db:occupation fb:profession db:battles fb:participated_in_conflicts Complex matches db:screenplay fb:written_by db:doctoralStudent fb:advisees WordNet similarity failed to identify any of these
  • 41. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 21 Roadmap iSemantics 2013
  • 42.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.     09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 43.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.    09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 44.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.  Some properties that are identified are intentionally different, e.g., db:distributor vs fb:production_companies. – This is because many companies produce and also distribute their films.   09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 45.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.  Some properties that are identified are intentionally different, e.g., db:distributor vs fb:production_companies. – This is because many companies produce and also distribute their films.  Some identified pairs are incorrect due to errors in data modeling. – For example, db:issue and fb:children.  09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 46.  Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.  Some properties that are identified are intentionally different, e.g., db:distributor vs fb:production_companies. – This is because many companies produce and also distribute their films.  Some identified pairs are incorrect due to errors in data modeling. – For example, db:issue and fb:children.  owl:sameAs linking issues in LOD (not linking exact same thing), e.g., linking London and Greater London. – We believe few misused links wont affect the algorithm as it decides on a match after analyzing many matches for a pair. 09/06/2013 iSemantics 2013 22 Discussion, interesting facts, and future directions
  • 47.  Less number of interlinks. – Evolve over time. – Look for possible other types of ECR links (i.e., rdf:seeAlso).   09/06/2013 iSemantics 2013 23
  • 48.  Less number of interlinks. – Evolve over time. – Look for possible other types of ECR links (i.e., rdf:seeAlso).  Properties do not have uniform distribution in a dataset. – Hence, some properties do not have enough matches or appearances. – This is due to rare classes and domains they belong to. – We can run the algorithm on instances that these less frequent properties appear iteratively.  09/06/2013 iSemantics 2013 23
  • 49.  Less number of interlinks. – Evolve over time. – Look for possible other types of ECR links (i.e., rdf:seeAlso).  Properties do not have uniform distribution in a dataset. – Hence, some properties do not have enough matches or appearances. – This is due to rare classes and domains they belong to. – We can run the algorithm on instances that these less frequent properties appear iteratively. Current limitations, – Requires ECR links – Requires overlapping datasets – Object-type properties – Inability to identify property – sub property relationships 09/06/2013 iSemantics 2013 23
  • 50. Background Statistical Equivalence of properties Evaluation Discussion, interesting facts, and future directions Conclusion 09/06/2013 24 Roadmap iSemantics 2013
  • 51. We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent.    09/06/2013 iSemantics 2013 25 Conclusion
  • 52. We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent. This novel extension based approach works well with interlinked datasets.   09/06/2013 iSemantics 2013 25 Conclusion
  • 53. We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent. This novel extension based approach works well with interlinked datasets. The extension based approach outperforms syntax or dictionary based approaches. F measure gain in the range of 57% - 78%.  09/06/2013 iSemantics 2013 25 Conclusion
  • 54. We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent. This novel extension based approach works well with interlinked datasets. The extension based approach outperforms syntax or dictionary based approaches. F measure gain in the range of 57% - 78%. It requires many comparisons, but can be easily parallelized evidenced by our Map-Reduce implementation. 09/06/2013 iSemantics 2013 25 Conclusion
  • 55. 26 Thank You http://knoesis.wright.edu/researchers/kalpa kalpa@knoesis.org Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA Questions ? 09/06/2013 iSemantics 2013

Editor's Notes

  • #34: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.
  • #35: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.
  • #36: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.
  • #37: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.
  • #38: k in some datasets are low because there are low appearing property pairs as well as popular property pairs. Because of this, sometimes lower k values as 2 produces better results but may be with low confidence.