Property Alignment on Linked Open Data

A Statistical and Schema Independent
Approach to Identify Equivalent
Properties on Linked Data
†Kno.e.sis Center
Wright State University
Dayton OH, USA
‡IBM T J Watson Research Center
Yorktown Heights
New York NY, USA
Kalpa Gunaratna†, Krishnaprasad Thirunarayan†, Prateek Jain‡, Amit Sheth†, and Sanjaya Wijeratne†
{kalpa,tkprasad,amit,sanjaya}@knoesis.org, jainpr@us.ibm.com
iSemantics 2013, Graz, Austria

09/06/2013 2
Motivation
Why we need property alignment and it is so
important?
iSemantics 2013

09/06/2013 3
Many datasets. We can query!
iSemantics 2013

09/06/2013 3
Same information in different names.
Therefore, data integration for better presentation is required.
iSemantics 2013

Background
Statistical Equivalence of properties
Evaluation
Discussion, interesting facts, and future directions
Conclusion
09/06/2013 4
Roadmap
iSemantics 2013

Existing techniques for property alignment fall into three
categories.
I. Syntactic/dictionary based
– Uses string manipulation techniques, external dictionaries and
lexical databases like WordNet.
II. Schema dependent
– Uses schema information such as, domain and range, definitions.
III. Schema independent
– Uses instance level information for the alignment.
 Our approach falls under schema independent.
09/06/2013 5
Background
iSemantics 2013

Properties capture meaning of triples and hence they are
complex in nature.



09/06/2013 6iSemantics 2013

complex in nature.
Syntactic or dictionary based approaches analyze property
names for equivalence. But in LOD, name heterogeneities exist.


09/06/2013 6iSemantics 2013

complex in nature.
Therefore, syntactic or dictionary based approaches have
limited coverage in property alignment.

09/06/2013 6iSemantics 2013

complex in nature.
Therefore, syntactic or dictionary based approaches have
limited coverage in property alignment.
Schema dependent approaches including processing domain
and range, class level tags do not capture semantics of
properties well.
09/06/2013 6iSemantics 2013

Background
Evaluation
Conclusion
09/06/2013 7
Roadmap
iSemantics 2013

 Statistical Equivalence is based on analyzing owl:equivalentProperty.
 owl:equivalentProperty - properties that have same property
extensions.









09/06/2013 8
Statistical Equivalence of properties
iSemantics 2013

extensions.
Example 1:
Property P is defined by the triples, { a P b, c P d, e P f }
Property Q is defined by the triples, { a Q b, c Q d, e Q f }
P and Q are owl:equivalentProperty, because they have the same extension,
{ {a,b}, {c,d}, {e,f} }




09/06/2013 8
iSemantics 2013

extensions.
Example 1:
Property Q is defined by the triples, { a Q b, c Q d, e Q f }
P and Q are owl:equivalentProperty, because they have the same extension,
{ {a,b}, {c,d}, {e,f} }
Example 2:
Property Q is defined by the triples, { a Q b, c Q d, e Q h }
Then, P and Q are not owl:equivalentProperty, because their extensions are not the
same. But they provide statistical evidence in support of equivalence.
09/06/2013 8
iSemantics 2013

Intuition
 Higher rate of subject-object matches in extensions leads to
equivalent properties. In practice, it is hard to have exact same
extensions for matching properties. Because,
– Datasets are incomplete.
– Same instance may be modelled differently in different datasets.
 Therefore, we analyze the property extensions to identify equivalent
properties between datasets.
 We define the following notions. Let the statement below be true
for all the definitions.
S1P1O1 and S2P2O2 be two triples in Dataset D1 and D2 respectively.
09/06/2013 iSemantics 2013 9


09/06/2013 iSemantics 2013 10


09/06/2013 iSemantics 2013 11

09/06/2013 iSemantics 2013 12
Dataset 2Dataset 1
Candidate Matching Algorithm Process

09/06/2013 iSemantics 2013 12
I1
Dataset 2Dataset 1
I1=d1:Willis_Lamb

09/06/2013 iSemantics 2013 12
I1 I2
owl:sameAs
Dataset 2Dataset 1
I1=d1:Willis_Lamb I2 =d2:willis_lamb
Step 1

09/06/2013 iSemantics 2013 12
I1 I2
I2
owl:sameAs
P1=d1:doctoralStudent
P2=d2:education.
academic.advisees
Dataset 2Dataset 1
d2:theodore_harold_maiman
I1
I1
d1:Theodore_Maiman
triple 1
triple 2
triple 3
triple 4
triple 5
Step 1
Step2
Step2

09/06/2013 iSemantics 2013 12
I1 I2
I2
matching resources
owl:sameAs
P1=d1:doctoralStudent
P2=d2:education.
academic.advisees
Dataset 2Dataset 1
property P1 and property P2 are a candidate match
d2:theodore_harold_maiman
I1
I1
d1:Theodore_Maiman
triple 1
triple 2
triple 3
triple 4
triple 5
Step 1
Step2
Step2
Step 3

09/06/2013 iSemantics 2013 13
Complexity:
If the average number of properties for an entity is x and for each property, average
number of objects is j. For n subjects, it requires n*j2*x2+2n comparisons. Since n > j,
n > x, and x and j are independent of n, O(n).

Example:
09/06/2013 iSemantics 2013 14


09/06/2013 iSemantics 2013 28

Background
Evaluation
Conclusion
09/06/2013 16
Roadmap
iSemantics 2013

Objectives of the evaluation
– Show the effectiveness of the approach in linked datasets
– Compare with existing aligning techniques


09/06/2013 iSemantics 2013 17
Evaluation

We selected 5000 instance samples from
DBpedia, Freebase, LinkedMDB, DBLP L3S , and DBLP RKB
Explorer datasets.

09/06/2013 iSemantics 2013 17
Evaluation

We selected 5000 instance samples from
DBpedia, Freebase, LinkedMDB, DBLP L3S , and DBLP RKB
Explorer datasets.
These datasets have,
– Complete data for instances in different viewpoints
– Many inter-links
– Complex properties
09/06/2013 iSemantics 2013 17
Evaluation

Experiment details
– α = 0.5 for all experiments (works for LOD) except DBpedia and
Freebase movie alignment where it was 0.7.
09/06/2013 iSemantics 2013 18

– k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and
Software between DBpedia and Freebase, Film between
LinkedMDB and DBpedia, and article between DBLP datasets.
09/06/2013 iSemantics 2013 18

– k can be estimated using the data as follows,
– Set α = 0.5 and k = 2 (lowest positive values).
– Get exact matching property (property names) pairs not identified by
the algorithm and their μ
– Get the average of those μ values
09/06/2013 iSemantics 2013 18

– 0.92 for string similarity algorithms.
09/06/2013 iSemantics 2013 18

– 0.92 for string similarity algorithms.
– 0.8 for WordNet similarity.
09/06/2013 iSemantics 2013 18

Measure
type
DBpedia –
Freebase
(Person)
DBpedia –
Freebase
(Film)
DBpedia –
Freebase
(Software)
DBpedia –
LinkedMDB
(Film)
DBLP_RKB –
DBLP_L3S
(Article)
Average
Extension
Based
Algorithm
Precision 0.8758 0.9737 0.6478 0.7560 1.0000 0.8427
Recall 0.8089* 0.5138 0.4339 0.8157 1.0000 0.7145
F measure 0.8410* 0.6727 0.5197 0.7848 1.0000 0.7656
WordNet
Similarity
Precision 0.5200 0.8620 0.7619 0.8823 1.0000 0.8052
Recall 0.4140* 0.3472 0.3018 0.3947 0.3333 0.3582
F measure 0.4609* 0.4950 0.4324 0.5454 0.5000 0.4867
Dice
Similarity
Precision 0.8064 0.9666 0.7659 1.0000 0.0000 0.7078
Recall 0.4777* 0.4027 0.3396 0.3421 0.0000 0.3124
F measure 0.6000* 0.5686 0.4705 0.5098 0.0000 0.4298
Jaro
Similarity
Precision 0.6774 0.8809 0.7755 0.9411 0.0000 0.6550
Recall 0.5350* 0.5138 0.3584 0.4210 0.0000 0.3656
F measure 0.5978* 0.6491 0.4903 0.5818 0.0000 0.4638
09/06/2013 iSemantics 2013 19
Alignment results
* Marks estimated values for experiment 1 because of very large comparisons to check manually. Boldface
marks highest result for each experiment.

Example identifications
09/06/2013 iSemantics 2013 20
Property pair
types
Dataset 1 (DBpedia) Dataset 2 (Freebase)
Simple string
similarity matches
db:nationality fb:nationality
db:religion fb:religion
Synonymous
matches
db:occupation fb:profession
db:battles fb:participated_in_conflicts
Complex matches db:screenplay fb:written_by
db:doctoralStudent fb:advisees

Example identifications
09/06/2013 iSemantics 2013 20
Property pair
types
Dataset 1 (DBpedia) Dataset 2 (Freebase)
Simple string
similarity matches
db:nationality fb:nationality
db:religion fb:religion
Synonymous
matches
db:occupation fb:profession
db:battles fb:participated_in_conflicts
Complex matches db:screenplay fb:written_by
db:doctoralStudent fb:advisees
WordNet similarity failed to identify any of these

Background
Evaluation
Conclusion
09/06/2013 21
Roadmap
iSemantics 2013

 Our experiment covered multi-domain to multi-domain, multi-
domain to specific domain and specific-domain to specific-domain
dataset property alignment.




09/06/2013 iSemantics 2013 22
Discussion, interesting facts, and future directions

 In every experiment, the extension based algorithm outperformed
others (F measure). F measure gain is in the range of 57% to 78%.



09/06/2013 iSemantics 2013 22

 Some properties that are identified are intentionally
different, e.g., db:distributor vs fb:production_companies.
– This is because many companies produce and also distribute their
films.


09/06/2013 iSemantics 2013 22

films.
 Some identified pairs are incorrect due to errors in data modeling.
– For example, db:issue and fb:children.

09/06/2013 iSemantics 2013 22

films.
 Some identified pairs are incorrect due to errors in data modeling.
– For example, db:issue and fb:children.
 owl:sameAs linking issues in LOD (not linking exact same
thing), e.g., linking London and Greater London.
– We believe few misused links wont affect the algorithm as it decides
on a match after analyzing many matches for a pair.
09/06/2013 iSemantics 2013 22

 Less number of interlinks.
– Evolve over time.
– Look for possible other types of ECR links (i.e., rdf:seeAlso).


09/06/2013 iSemantics 2013 23

 Properties do not have uniform distribution in a dataset.
– Hence, some properties do not have enough matches or appearances.
– This is due to rare classes and domains they belong to.
– We can run the algorithm on instances that these less frequent
properties appear iteratively.

09/06/2013 iSemantics 2013 23

 Properties do not have uniform distribution in a dataset.
– Hence, some properties do not have enough matches or appearances.
– This is due to rare classes and domains they belong to.
– We can run the algorithm on instances that these less frequent
properties appear iteratively.
Current limitations,
– Requires ECR links
– Requires overlapping datasets
– Object-type properties
– Inability to identify property – sub property relationships
09/06/2013 iSemantics 2013 23

Background
Evaluation
Conclusion
09/06/2013 24
Roadmap
iSemantics 2013

We approximate owl:equivalentProperty using Statistical
Equivalence of properties by analyzing property
extensions, which is schema independent.



09/06/2013 iSemantics 2013 25
Conclusion

This novel extension based approach works well with
interlinked datasets.


09/06/2013 iSemantics 2013 25
Conclusion

The extension based approach outperforms syntax or
dictionary based approaches. F measure gain in the range of
57% - 78%.

09/06/2013 iSemantics 2013 25
Conclusion

Equivalence of properties by analyzing property extensions,
which is schema independent.
The extension based approach outperforms syntax or
dictionary based approaches. F measure gain in the range of
57% - 78%.
It requires many comparisons, but can be easily parallelized
evidenced by our Map-Reduce implementation.
09/06/2013 iSemantics 2013 25
Conclusion

26
Thank You
http://knoesis.wright.edu/researchers/kalpa
kalpa@knoesis.org
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Questions ?
09/06/2013 iSemantics 2013

Property Alignment on Linked Open Data

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Property Alignment on Linked Open Data (20)

Recently uploaded (20)

Property Alignment on Linked Open Data

Editor's Notes