Data-driven Joint Debugging of the DBpedia Mappings and Ontology

06/01/17 Heiko Paulheim 1
Data-driven Joint Debugging
of the DBpedia Mappings and Ontology
Towards Addressing the Causes
instead of the Symptoms of Data Quality in DBpedia
Heiko Paulheim

Motivation
• Various works on finding errors in Knowledge Graphs
– 2017 survey: 17 approaches
– 15/17 are evaluated on DBpedia
• Question:
– How does DBpedia benefit
from those works?
￘
H. Paulheim: Knowledge Graph Refinement – A Survey
of Approaches and Evaluation Methods. SWJ 8(3), 2017

Motivation
• What comes out of those research works
– A list of (possibly) wrong statements
– Source code for finding erroneous statements
– ...

Motivation
• Possible option 1: Remove erroneous triples from DBpedia
• Challenges
– May remove correct axioms, may need thresholding
– Needs to be repeated for each release
– Needs to be materialized on all of DBpedia
DBpedia
Extraction
FrameworkWikipedia
DBpedia Mappings Wiki
Post
Filter

Motivation
• Materialized on full DBpedia: 8/15 approaches

Motivation
• Possible option 2: Integrate into DBpedia Extraction Framework
• Challenges
– Development workload
– Some approaches are not fully automated (technically or conceptually)
– Scalability
DBpedia
Extraction
Framework
plus filter
module
Wikipedia

Motivation
• Scalability analyzed: 6/15
Disclaimer: does not imply
that it is actually scalable!

Motivation
• Do we have a third option?
– Paulheim & Gangemi (2015): >95% of all inconsistencies in DBpedia
boil down to 40 common root causes
Wikipedia
DBpedia
Extraction
Framework
Inconsistency
DetectionIdentification
of suspicious
mappings and
ontology
constructs
H. Paulheim, A. Gangemi: Serving DBpedia with DOLCE – More than Just Adding a Cherry on Top (ISWC 2015)
Disclaimer: not equivalent to
“wrong statements”

Approach
dbr:San_Diego_
County,_California
dbr:Agua_Caliente_
Airport
dbo:operator
foaf:name
dbo:Airport dbo:Settlement
dbo:Place
dbo:Infrastructure
dbo:Architectural-
Structure
dbo:Agent
owl:disjoint
With
rdf:type
rdf:type
“Agua Caliente Airport”
dbo:PopulatedPlace
dbo:Organisation
rdfs:range
Obama
free
Example!

Approach
• Find inconsistencies in extracted statements
– Using DBpedia and DOLCE as top level ontology
• Trace them back to mappings
– In the example, there are three candidates
• Property mapping to the predicate dbo:operator
• Class mapping (subject) to dbo:Airport
• Class mapping (object) to dbo:Settlement
• Unfortunately, provenance information for DBpedia
is not that fine-grained
– i.e., we do not know which mapping was responsible for which
statement in the end
– first step: heuristic reconstruction

Approach: Identifying Mapping Elements
[1] Dimou et al.: DBpedia Mappings Quality Assessment (ISWC Poster 2016)
Wikipedia Page
DBpedia Resource
• We use the RML representation of the Mapping Wiki contents [1]
https://www.w3.org/TR/r2rml/

DBpedia Ontology
Class

DBpedia Ontology
Property

Approach (ctd.)
• After we heuristically reconstructed the mappings, we can determine
– How often is a mapping element involved in an inconsistency?
– How often is a mapping element used, but not involved in an
inconsistency?

Approach (ctd.)
• Using the two counters cm
and im
, we can compute two scores
for the hypothesis that m is problematic
• Borrowed from Association Rule Mining (support and confidence):
• N is the total number of statements in DBpedia

Identifying Interesting Problems
• Hypothesis: high support and high confidence mapping elements
hint at problems worth investigating
– High support: fixing the issue would fix a lot of individual statements
– High confidence: this mapping element actually hints at the root cause
• i.e., fixing this does not break many other things
• Unfortunately, both come at different scales
– Difficult to use average, harmonic mean or the like
– Support: μ = 0.0002, σ = 0.003
– Confidence: μ = 0.114, σ = 0.260
• Fix: use logarithmic support instead
– LogSupport: μ = 0.179, σ = 0.139

Identifying Interesting Problems (ctd.)
• Inspect mappings that have a high harmonic mean of
confidence and log support
0.25 0.5 0.75
more interesting

Example Findings
• Case 1: Mapping to wrong property
• Example:
– branch in infobox military unit
is mapped to dbo:militaryBranch
• but dbo:militaryBranch
has dbo:Person as its domain
– correction: dbo:commandStructure
– Overall score: 0.721
– Affects 12,172 statements
(31% of all dbo:militaryBranch)

Example Findings
• Case 2: Mappings that should be removed
• Example:
– dbo:picture
– Most of the are inconsistent (64.5% places, 23.0% persons)
– Reason: statements are extracted from picture caption
dbo:Brixton_Academy
dbo:picture
dbo:Brixton .
dbo:Justify_My_Love
dbo:picture
dbo:Madonna_(entertainer) .

Example Findings
• Case 3: Ontology problems (domain/range)
• Example 1:
– Populated places (e.g., cities) are used both as place and organization
– For some properties, the range is either one of the two
• e.g., dbo:operator (see introductory example)
– Polysemy should be reflected in the ontology
• Example 2:
– dbo:architect, dbo:designer, dbo:engineer etc.
have dbo:Person as their range
– Significant fractions (8.6%, 7.6%, 58.4%, resp.)
have a dbo:Organization as object
– Range should be broadened

Example Findings
• Case 4: Missing properties
• Example 1:
– dbo:president links an organization to its president
– Majority use (8,354, or 76.2%):
link a person to the president s/he served for
• Example 2:
– dbo:instrument links an artist
to the instrument s/he plays
– Prominent alternative use (3,828, or 7.2%):
links a genre to its characteristic instrument
Obamaexamplealert!

Future Work
• Classify ontology, mapping, and other errors automatically
– Currently ongoing: using different language editions of DBpedia
• Heuristic:
– problem present in many languages → ontology problem
– Problem present only in one language → mapping problem
• From post-processing to live processing
– e.g., on-the-fly validation in DBpedia Mappings Wiki

Take Aways
• Fixing bugs in knowledge graphs is nice
– But often a one-time solution
– Preserving the efforts is hard
• Proposed solution
– Identify and address the root problem
– Scoring mechanism helps
identifying interesting problems
– Preserving the efforts by eliminating
the root causes
• Provenance matters!
– The more we know about how a statement
gets into a knowledge graph
– The better can we automate the error analysis

Data-driven Joint Debugging
of the DBpedia Mappings and Ontology
Towards Addressing the Causes
instead of the Symptoms of Data Quality in DBpedia
Heiko Paulheim

Data-driven Joint Debugging of the DBpedia Mappings and Ontology

More Related Content

What's hot (8)

Similar to Data-driven Joint Debugging of the DBpedia Mappings and Ontology (20)

More from Heiko Paulheim (15)

Recently uploaded (20)

Data-driven Joint Debugging of the DBpedia Mappings and Ontology