Towards Knowledge Graph Profiling

10/22/17 Heiko Paulheim 1
Towards Knowledge Graph Profiling
Heiko Paulheim

Introduction
• You’ve seen this, haven’t you?
Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae,
Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/

Introduction
• Knowledge Graphs on the Web
• Everybody talks about them, but what is a Knowledge Graph?
– I don’t have a definition either...

Introduction
• Knowledge Graph definitions
• Many people talk about KGs, few give definitions
• Working definition: a Knowledge Graph
– mainly describes instances and their relations in a graph
• Unlike an ontology
• Unlike, e.g., WordNet
– Defines possible classes and relations in a schema or ontology
• Unlike schema-free output of some IE tools
– Allows for interlinking arbitrary entities with each other
• Unlike a relational database
– Covers various domains
• Unlike, e.g., Geonames

Introduction
• Knowledge Graphs out there (not guaranteed to be complete)
public
private
Paulheim: Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic Web 8:3 (2017), pp. 489-508

Motivation
• In the coming days, you’ll see quite a few works
– that use DBpedia for doing X
– that use Wikidata for doing Y
– ...
• If you see them, do you ever ask yourselves:
– Why DBpedia and not Wikidata?
(or the other way round?)

Motivation
• Questions:
– which knowledge graph should I use for which purpose?
– are there significant differences?
– would it help to combine them?
• For answering those questions, we need knowledge graph profiling
– making quantitative statements about knowledge graphs
– defining measures
– defining setups in which to measure them

Outline
• How are Knowledge Graphs created?
• Key objectives for profiling Knowledge Graphs
– Size
– Timeliness
– Level of detail
– Overlap
– ...
• Key figures for public Knowledge Graphs
• Common Shortcomings of Knowledge Graphs
– ...and how to address them
• New Kids on the Block

Knowledge Graph Creation: CyC
• The beginning
– Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present
– >900 person years
– Far from completion
– Used to exist until 2017

Knowledge Graph Creation
• Lesson learned no. 1:
– Trading efforts against accuracy
Min. efforts Max. accuracy

Knowledge Graph Creation: Freebase
• The 2000s
– Freebase: collaborative editing
– Schema not fixed
• Present
– Acquired by Google in 2010
– Powered first version of Google’s Knowledge Graph
– Shut down in 2016
– Partly lives on in Wikidata (see in a minute)

– Trading formality against number of users
Max. user involvement Max. degree of formality

Knowledge Graph Creation: Wikidata
• The 2010s
– Wikidata: launched 2012
– Goal: centralize data from Wikipedia languages
– Collaborative
– Imports other datasets
• Present
– One of the largest public knowledge graphs
(see later)
– Includes rich provenance

– There is not one truth (but allowing for plurality adds complexity)
Max. simplicity Max. support for plurality

Knowledge Graph Creation: DBpedia & YAGO
• The 2010s
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs

– Heuristics help increasing coverage (at the cost of accuracy)
Max. accuracy Max. coverage

Knowledge Graph Creation: NELL
• The 2010s
– NELL: Never ending language learner
– Input: ontology, seed examples, text corpus
– Output: facts, text patterns
– Large degree of automation, occasional human feedback
• Today
– Still running
– New release every few days

– Quality cannot be maximized without human intervention
Min. human intervention Max. accuracy

Summary of Trade Offs
• (Manual) effort vs. accuracy and completeness
• User involvement (or usability) vs. degree of formality
• Simplicity vs. support for plurality and provenance
→ all those decisions influence the profile of a knowledge graph!

Non-Public Knowledge Graphs
• Many companies have their
own private knowledge graphs
– Google: Knowledge Graph,
Knowledge Vault
– Yahoo!: Knowledge Graph
– Microsoft: Satori
– Facebook: Entities Graph
– Thomson Reuters: permid.org
(partly public)
• However, we usually know only little about them

Comparison of Knowledge Graphs
• Release cycles
Instant updates:
DBpedia live,
Freebase
Wikidata
Days:
NELL
Months:
DBpedia
Years:
YAGO
Cyc
• Size and density
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
Caution!

• What do they actually contain?
• Experiment: pick 25 classes of interest
– And find them in respective ontologies
• Count instances (coverage)
• Determine in and out degree (level of detail)

• Summary findings:
– Persons: more in Wikidata
(twice as many persons as DBpedia and YAGO)
– Countries: more details in Wikidata
– Places: most in DBpedia
– Organizations: most in YAGO
– Events: most in YAGO
– Artistic works:
• Wikidata contains more movies and albums
• YAGO contains more songs

Caveats
• Reading the diagrams right…
• So, Wikidata contains more persons
– but less instances of all the interesting subclasses?
• There are classes like Actor in Wikidata
– but they are hardly used
– rather: modeled using profession relation

Caveats
• Reading the diagrams right… (ctd.)
• So, Wikidata contains more data on countries, but less countries?
• First: Wikidata only counts current, actual countries
– DBpedia and YAGO also count historical countries
• “KG1 contains less of X than KG2” can mean
– it actually contains less instances of X
– it contains equally many or more instances,
but they are not typed with X (see later)
• Second: we count single facts about countries
– Wikidata records some time indexed information, e.g., population
– Each point in time contributes a fact

Overlap of Knowledge Graphs
• How largely do knowledge graphs overlap?
• They are interlinked, so we can simply count links
– For NELL, we use links to Wikipedia as a proxy
DBpedia
YAGO
Wikidata
NELL Open
Cyc

• How largely do knowledge graphs overlap?
• They are interlinked, so we can simply count links
– For NELL, we use links to Wikipedia as a proxy

• Links between Knowledge Graphs are incomplete
– The Open World Assumption also holds for interlinks
• But we can estimate their number
• Approach:
– find link set automatically with different heuristics
– determine precision and recall on existing interlinks
– estimate actual number of links

• Idea:
– Given that the link set F is found
– And the (unknown) actual link set would be C
• Precision P: Fraction of F which is actually correct
– i.e., measures how much |F| is over-estimating |C|
• Recall R: Fraction of C which is contained in F
– i.e., measures how much |F| is under-estimating |C|
• From that, we estimate |C|=|F|⋅P⋅
1
R

• Mathematical derivation:
– Definition of recall:
– Definition of precision:
• Resolve both to , substitute, and resolve to
R=
|Fcorrect|
|C|
P=
|Fcorrect|
|F|
|Fcorrect| |C|
|C|=|F|⋅P⋅
1
R

• Experiment:
– We use the same 25 classes as before
– Measure 1: overlap relative to smaller KG (i.e., potential gain)
– Measure 2: overlap relative to explicit links
(i.e., importance of improving links)
• Link generation with 16 different metrics and thresholds
– Intra-class correlation coefficient for |C|: 0.969
– Intra-class correlation coefficient for |F|: 0.646
• Bottom line:
– Despite variety in link sets generated, the overlap is estimated reliably
– The link generation mechanisms do not need to be overly accurate

• Summary findings:
– DBpedia and YAGO cover roughly the same instances
(not much surprising)
– NELL is the most complementary to the others
– Existing interlinks are insufficient for out-of-the-box parallel usage

Common Shortcomings of Knowledge Graphs
• Knowledge Graph Profiling can reveal certain shortcomings
– ...and once we know them, we can address them
• What is the impact of that?
– Example: answering a SPARQL query

Finding Information in Knowledge Graphs
• Find list of science fiction writers in DBpedia
select ?x where
{?x a dbo:Writer .
?x dbo:genre dbr:Science_Fiction}
order by ?x

Finding Information in Knowledge Graphs
• Results from DBpedia
Arthur C. Clarke?
H.G. Wells?
Isaac Asimov?

Common Shortcomings of Knoweldge Graphs
• What reasons can cause incomplete results?
• Two possible problems:
– The resource at hand is not of type dbo:Writer
– The genre relation to dbr:Science_Fiction is missing
select ?x where
{?x a dbo:Writer .
?x dbo:genre dbr:Science_Fiction}
order by ?x

Common Shortcomings of Knowledge Graphs
• Various works on Knowledge Graph Refinement
– Knowledge Graph completion
– Error detection
• See, e.g., 2017 survey in
Semantic Web Journal
Paulheim: Knowledge Graph Refinement – A Survey
of Approaches and Evaluation Methods. SWJ 8(3), 2017
Tuesday, 4:30 pm
Journal track
paper presentation

New Kids on the Block
Subjective age:
Measured by the fraction
of the audience
that understands a reference
to your young days’
pop culture...

New Kids on the Block
• Wikipedia-based Knowledge Graphs will remain
an essential building block of Semantic Web applications
• But they suffer from...
– ...a coverage bias
– ...limitations of the creating heuristics

Wikipedia’s Coverage Bias
• One (but not the only!) possible source of coverage bias
– Articles about long-tail entities become deleted

Work in Progress: DBkWik
• Why stop at Wikipedia?
• Wikipedia is based on the MediaWiki software
– ...and so are thousands of Wikis
– Fandom by Wikia: >385,000 Wikis on special topics
– WikiApiary: reports >20,000 installations of MediaWiki on the Web

• Back to our original example...

• The DBpedia Extraction Framework consumes MediaWiki dumps
• Experiment
– Can we process dumps from arbitrary Wikis with it?
– Are the results somewhat meaningful?

• Example from Harry Potter Wiki
http://dbkwik.webdatacommons.org/

• Differences to DBpedia
– DBpedia has manually created mappings to an ontology
– Wikipedia has one page per subject
– Wikipedia has global infobox conventions (more or less)
• Challenges
– On-the-fly ontology creation
– Instance matching
– Schema matching

Dump
Downloader
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps Extracted RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
1 2
34
5
• Avoiding O(n²) internal linking:
– Match to DBpedia first
– Use common links to DBpedia as blocking keys for internal matching

• Downloaded ~15k Wiki dumps from Fandom
– 52.4GB of data, roughly the size of the English Wikipedia
• Prototype: extracted data for ~250 Wikis
– 4.3M instances, ~750k linked to DBpedia
– 7k classes, ~1k linked to DBpedia
– 43k properties, ~20k linked to DBpedia
– ...including duplicates!
• Link quality
– Good for classes, OK for properties (F1 of .957 and .852)
– Needs improvement for instances (F1 of .641)
Monday 6:30pm
Poster presentation

Work in Progress: WebIsALOD
• Background: Web table interpretation
• Most approaches need typing information
– DBpedia etc. have too little coverage
on the long tail
– Wanted: extensive type database

• Extraction of type information using Hearst-like patterns, e.g.,
– T, such as X
– X, Y, and other T
• Text corpus: common crawl
– ~2 TB crawled web pages
– Fast implementation: regex over text
– “Expensive” operations only applied once regex has fired
• Resulting database
– 400M hypernymy relations
Seitner et al.: A large DataBase of hypernymy relations extracted from the Web.
LREC 2016

http://webisa.webdatacommons.org/

• Initial effort: transformation to a LOD dataset
– including rich provenance information
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017

• Estimated contents breakdown

• Main challenge
– Original dataset is quite noisy (<10% correct statements)
– Recap: coverage vs. accuracy
– Simple thresholding removes too much knowledge
• Approach
– Train RandomForest model for predicting correct vs. wrong statements
– Using all the provenance information we have
– Use model to compute confidence scores

• Current challenges and works in progress
– Distinguishing instances and classes
• i.e.: subclass vs. instance of relations
– Splitting instances
• Bauhaus is a goth band
• Bauhaus is a German school
– Knowledge extraction from pre and post modifiers
• Bauhaus is a goth band → genre(Bauhaus, Goth)
• Bauhaus is a German school → location(Bauhaus, Germany)
Tuesday, 2:30 pm
Resource track
paper presentation

Take Aways
• Knowledge Graphs contain a massive amount of information
– Various trade offs in their creation
– That lead to different profiles
– ...and different shortcomings
• Knowledge Graph Profiling
– What is in a knowledge graph?
– At which level of detail is it described?
– How different are knowledge graphs?
• Various methods exist for
– ...addressing the various shortcomings
• New kids on the block
– DBkWik and WebIsALOD
– Focus on long tail entities

Towards Knowledge Graph Profiling
Heiko Paulheim

Towards Knowledge Graph Profiling

More Related Content

What's hot (18)

Similar to Towards Knowledge Graph Profiling (20)

More from Heiko Paulheim (11)

Recently uploaded (20)

Towards Knowledge Graph Profiling