SlideShare a Scribd company logo
5/23/19 Heiko Paulheim 1
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim
5/23/19 Heiko Paulheim 2
A Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Wikipedia Dump
(+ mappings)
• Output:
– DBpedia
DBpedia
Extraction
Framework
5/23/19 Heiko Paulheim 3
An Even Higher Bird’s Eye View on DBpedia EF
• DBpedia Extraction Framework
• Input:
– A Media Wiki Dump
(+ mappings)
• Output:
– A Knowledge Graph
DBpedia
Extraction
Framework
5/23/19 Heiko Paulheim 4
What if…?
• What if we applied the DBpedia EF to every MediaWiki?
• According to WikiApiary, there’s thousands...
5/23/19 Heiko Paulheim 5
Why?
• More is better (maybe)
5/23/19 Heiko Paulheim 6
Why?
• Overcoming Wikipedia’s coverage bias
5/23/19 Heiko Paulheim 7
A Brief History of DBkWik
• Started as a student project in 2017
• Task: run DBpedia EF on a large Wiki Farm
– ...and see what happens
5/23/19 Heiko Paulheim 8
DBkWik vs. DBpedia
• Challenges
– Getting dumps: only a fraction of Fandom Wikis has dumps
– Downloadable from Fandom: 12,840 dumps
– Tried: auto-requesting dumps
5/23/19 Heiko Paulheim 9
Obtaining Dumps
• We had to change our strategy: WikiTeam software
– Produces dumps by crawling Wikis
– Fandom has not blocked us so far :-)
– Current collection: 307,466 Wikis
→ will go into DBkWik 1.2 release
5/23/19 Heiko Paulheim 10
DBkWik vs. DBpedia
• Mappings do not exist
– no central ontology
– i.e., only raw extraction possible
• Duplicates exist
– origin: pages about the same entity
in different Wikis
– unlike Wikipedia: often not explicitly linked
• Different configurations of MediaWiki
5/23/19 Heiko Paulheim 11
Absence of Mappings and Ontology
• Every infobox becomes a class:
{infobox actor
→ mywiki:actor a owl:Class
• Every infobox key becomes a property
|role = Harry’s mother
→ mywiki:role a rdf:Property
• The resulting ontology is very shallow
– No class hierarchy
– No distinction of object and data properties
– No domains and ranges
5/23/19 Heiko Paulheim 12
Duplicates
• Collecting Data from a Multitude of Wikis
5/23/19 Heiko Paulheim 13
Representational Variety
• No conventions across Wikis (besides using MediaWiki syntax)
{{Person
|name = Trent Reznor
|image = TrentReznor.jpg
|caption - Reznor at the [[83rd Academy Awards]]
|nominations = 1
|wins = 1
|role = Composer
|birthdate = May 17, 1965
|birthloc = Mercer, Pennsylvania, USA}}
{{Infobox musician
| Name = Trent Reznor
| Birth_name = Michael Trent Reznor
| Born = May 17, [[1965]] (age 53)
| Origin = [[Mercer]],
[[Pennsylvania]], [[United States]]
...
}}
{{Infobox cast
|Name=Trent Reznor
|Image=
|ImageCaption=
|character=
|crew=
|Born={{d|May|17|1965}}{{-}}New Castle,
Pennsylvania, United States
...
}
5/23/19 Heiko Paulheim 14
Data Fusion
5/23/19 Heiko Paulheim 15
Naive Data Fusion and Linking to DBpedia
• String similarity for schema matching (classes/properties)
• doc2vec similarity on original pages for instance matching
• Results
– Classes and properties work OK
– Instances are trickier
– Internal linking seems easier
F1 score... Internal Linking Linking to DBpedia
Classes .979 .898
Properties .836 .865
Instances .879 .657
maybe...
5/23/19 Heiko Paulheim 16
Gold Standard DBkWik 1.1
• Schema alignment: manual
• Instance alignment: crowd-sourced
– Using 3x3 Wikis from 3 different topics
– Asking crowdworkers to identify similar pages
– Search was allowed and encouraged
5/23/19 Heiko Paulheim 17
Gold Standard DBkWik 1.1
• Crowdsourcing results
– High inter rater agreement (Fleiss’ Kappa: 0.8762)
– Most mappings are trivial, though
• Possible bias in gold standard
– We pre-selected matching Wikis!
5/23/19 Heiko Paulheim 18
Results Data Fusion
• Uneven distribution
– e.g., character appears 5k times
• Currently: no multi-linguality
– e.g., Main Page, Hauptseite
• Probably overloaded fusion (false positives)
– e.g., next, location
5/23/19 Heiko Paulheim 19
Light-weight Schema Induction
• Class hierarchy and domain/range induction
– Using association rule mining
●
e.g., Artist(x) → Person(x)
– 5k class subsumption axioms
– 59k domain restrictions
– 114k range restrictions
• Instance typing
– With a light-weight version of SDType
– Using the learned ranges as approximations
of actual distributions
• Result:
~100k new instance types
Person?
Artist
Person
5/23/19 Heiko Paulheim 20
Big Picture
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted
RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoint
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
SubclassMaterialization
5/23/19 Heiko Paulheim 21
DBkWik 1.1
• Source: ~15k Wiki dumps from Fandom
– 52.4GB of data (roughly the size of the English Wikipedia)
Raw Final
Instances 14,212,535 11,163,719
Typed instances 1,880,189 1,372,971
Triples 107,833,322 91,526,001
Avg. indegree 0.624 0.703
Avg. outdegree 7.506 8.169
Classes 71,580 12,029
Properties 506,487 128,566
5/23/19 Heiko Paulheim 22
DBkWik 1.1
• Fused graphs from 15k Wikis
http://dbkwik.webdatacommons.org/
5/23/19 Heiko Paulheim 23
DBkWik 1.1 vs. other Knowledge Graphs
• Caveat:
– Minus non-recognized duplicates!
5/23/19 Heiko Paulheim 24
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
• Challenge:
– We only have an incomplete and partly correct mapping M
– But: we know its precision P and recall R
• Trick (see KI paper 2017):
– O is the actual overlap (unknown),
T ⊆ M is the true part of M (unknown)
• By definition:
– P = |T| / |M|
→ |T| = P * |M|
– R = |T| / |O|
→ |T| = R * |O|
→ |O| = |M| * P / R
DBkWik DBpedia
5/23/19 Heiko Paulheim 25
DBkWik 1.1 vs. DBpedia
• How complementary are DBkWik and Dbpedia?
– |O| = |M| * P / R
– Overlap: ~500k instances
• In other words:
– 95% of all entities in DBkWik
are not in DBpedia
– 90% of all entities in DBpedia
are not in DBkWik
DBkWik DBpedia
5/23/19 Heiko Paulheim 26
Towards Improving Interlinking
• Strategy: ask the experts
– new Knowledge Graph track at OAEI 2018
– seven systems provided results
• Results:
– it is hard to beat the string baseline
– many matching systems rely
on explicit, deep ontologies
●
but we have just shallow schemas
• Possible reasons:
– the problem is too difficult?
– the gold standard is too trivial?
– the ontology lacks formality
5/23/19 Heiko Paulheim 27
Towards Improving Interlinking
• Currently, embedding based methods are on the rise
– e.g., Azmy et al.: “Matching Entities Across Different Knowledge
Graphs with Graph Embeddings”, 2019
– require large-scale training data
5/23/19 Heiko Paulheim 28
Towards Improving Interlinking
• Overcoming issues of first gold standard
– include non-trivial matches
– include non-matches
5/23/19 Heiko Paulheim 29
Towards Improving Interlinking
• Includes trivial and non-trivial matches
– i.e., task gets more demanding
• Low inter-rater agreement: Fleiss’ Kappa 0.02
5/23/19 Heiko Paulheim 30
Towards Improving Interlinking
• Exploiting Wiki Interlinks
30
== External links ==
* {{mbeta}}
* {{Wikipedia|Bajoran#Kai|Kai}}
[[de:Kai]]
[[nl:Kai]]
[[pl:Kai]]
wiki 1
wiki 2
Kai
Meressa
Star Trek
5/23/19 Heiko Paulheim 31
NewNif
Extractor
Towards DBkWik 1.2
• Current crawl: 307,466 Wikis
• Extraction: more robust for non-infobox templates
– e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists
• Robust abstract extraction
– using SWEBLE parser
– no local MediaWiki instance
• Better matching
• New gold standard
Source
Simple
WikiParser
LinkExtractor
Page
NifExtractor
AST
Destination
Graph
HTML
5/23/19 Heiko Paulheim 32
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
5/23/19 Heiko Paulheim 33
Towards DBkWik 1.2
• What to expect?
– data from 307,466 wikis
– 38,985,266 articles
5/23/19 Heiko Paulheim 34
Towards DBkWik 1.2
5/23/19 Heiko Paulheim 35
Further Open Challenges
• More detailed profiling
– e.g., do we reduce or increase bias?
• Task-based evaluation
– Does it improve, e.g., recommender systems?
• Fusion policies
– Identify outdated Wikis
5/23/19 Heiko Paulheim 36
Contributors
• DBkWik contributors (past, present, and future)
Sven Hertling Alexandra
Hofmann
Samresh
Perchani
Jan Portisch
5/23/19 Heiko Paulheim 37
From Wikipedia to Thousands of Wikis
–
The DBkWik Knowledge Graph
Heiko Paulheim

More Related Content

PPT
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
ODP
Machine Learning with and for Semantic Web Knowledge Graphs
PDF
Towards Knowledge Graph Profiling
ODP
Machine Learning & Embeddings for Large Knowledge Graphs
PDF
New Adventures in RDF2vec
ODP
Knowledge Graphs on the Web
ODP
Make Embeddings Semantic Again!
PDF
From Wikis to Knowledge Graphs
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Machine Learning with and for Semantic Web Knowledge Graphs
Towards Knowledge Graph Profiling
Machine Learning & Embeddings for Large Knowledge Graphs
New Adventures in RDF2vec
Knowledge Graphs on the Web
Make Embeddings Semantic Again!
From Wikis to Knowledge Graphs

What's hot (20)

PDF
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
PDF
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
ODP
How much is a Triple?
ODP
Type Inference on Noisy RDF Data
ODP
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
PPT
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
ODP
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
PDF
Ld4 dh tutorial
PDF
New Adventures in RDF2vec
ODP
What the Adoption of schema.org Tells about Linked Open Data
PPT
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
ODP
Linked Open Data enhanced Knowledge Discovery
PPTX
Researcher Pod: Scholarly Communication Using the Decentralized Web
PPTX
The web is rotting and what to do about it
PDF
KIN24x7 and Googling & Wiki'ing
PPTX
Linked data in the German National Library at the OCLC IFLA round table 2013
PDF
The drawbridge to knowledge - Linking scholarly publications and research inf...
PPTX
JudaicaLink: Linked Data from Jewish Encyclopediae
PPT
Niels Brügger's slides from Digital Conversations event on 26/09/2013
PDF
Linked Data at the German National Library
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
How much is a Triple?
Type Inference on Noisy RDF Data
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Ld4 dh tutorial
New Adventures in RDF2vec
What the Adoption of schema.org Tells about Linked Open Data
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Linked Open Data enhanced Knowledge Discovery
Researcher Pod: Scholarly Communication Using the Decentralized Web
The web is rotting and what to do about it
KIN24x7 and Googling & Wiki'ing
Linked data in the German National Library at the OCLC IFLA round table 2013
The drawbridge to knowledge - Linking scholarly publications and research inf...
JudaicaLink: Linked Data from Jewish Encyclopediae
Niels Brügger's slides from Digital Conversations event on 26/09/2013
Linked Data at the German National Library
Ad

Similar to From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph (20)

PDF
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
PDF
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PDF
The discovery of knowledge graphs and their utility in biotech
PDF
Hala skafkeynote@conferencedata2021
PDF
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
PDF
Reconciling Event-Based Knowledge through RDF2VEC
PDF
Linked Data, Ontologies and Inference
PPTX
Knowledge Graph Engineering
PDF
Towards Virtual Knowledge Graphs over Web APIs
PPTX
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
PDF
ESWC SS 2013 - Wednesday Tutorial Elena Simperl: Creating and Using Ontologie...
PPT
A Semantic Multimedia Web (Part 2)
PPT
DBpedia Framework - BBC Talk
PDF
Open data and linked data
PDF
Bigdive 2014 - RDF, principles and case studies
PPTX
Using Linked Data to Mine RDF from Wikipedia's Tables
PDF
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
PDF
Sina presentation in IBM
PPTX
Validating RDF data: Challenges and perspectives
PPTX
SEMANTIC WEB SOURCES – comparison of open-source Knowledge Graphs
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
The discovery of knowledge graphs and their utility in biotech
Hala skafkeynote@conferencedata2021
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Reconciling Event-Based Knowledge through RDF2VEC
Linked Data, Ontologies and Inference
Knowledge Graph Engineering
Towards Virtual Knowledge Graphs over Web APIs
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
ESWC SS 2013 - Wednesday Tutorial Elena Simperl: Creating and Using Ontologie...
A Semantic Multimedia Web (Part 2)
DBpedia Framework - BBC Talk
Open data and linked data
Bigdive 2014 - RDF, principles and case studies
Using Linked Data to Mine RDF from Wikipedia's Tables
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Sina presentation in IBM
Validating RDF data: Challenges and perspectives
SEMANTIC WEB SOURCES – comparison of open-source Knowledge Graphs
Ad

More from Heiko Paulheim (10)

PDF
What_do_Knowledge_Graph_Embeddings_Learn.pdf
ODP
Weakly Supervised Learning for Fake News Detection on Twitter
ODP
Fast Approximate A-box Consistency Checking using Machine Learning
ODP
Combining Ontology Matchers via Anomaly Detection
PPT
Gathering Alternative Surface Forms for DBpedia Entities
ODP
Mining the Web of Linked Data with RapidMiner
ODP
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
PDF
Detecting Incorrect Numerical Data in DBpedia
PDF
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
ODP
Extending DBpedia with Wikipedia List Pages
What_do_Knowledge_Graph_Embeddings_Learn.pdf
Weakly Supervised Learning for Fake News Detection on Twitter
Fast Approximate A-box Consistency Checking using Machine Learning
Combining Ontology Matchers via Anomaly Detection
Gathering Alternative Surface Forms for DBpedia Entities
Mining the Web of Linked Data with RapidMiner
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Detecting Incorrect Numerical Data in DBpedia
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Extending DBpedia with Wikipedia List Pages

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Mega Projects Data Mega Projects Data
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Lecture1 pattern recognition............
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Computer network topology notes for revision
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Mega Projects Data Mega Projects Data
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Lecture1 pattern recognition............
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Major-Components-ofNKJNNKNKNKNKronment.pptx
Moving the Public Sector (Government) to a Digital Adoption
Computer network topology notes for revision
Business Acumen Training GuidePresentation.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
IB Computer Science - Internal Assessment.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Quality review (1)_presentation of this 21
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph

  • 1. 5/23/19 Heiko Paulheim 1 From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph Heiko Paulheim
  • 2. 5/23/19 Heiko Paulheim 2 A Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Wikipedia Dump (+ mappings) • Output: – DBpedia DBpedia Extraction Framework
  • 3. 5/23/19 Heiko Paulheim 3 An Even Higher Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Media Wiki Dump (+ mappings) • Output: – A Knowledge Graph DBpedia Extraction Framework
  • 4. 5/23/19 Heiko Paulheim 4 What if…? • What if we applied the DBpedia EF to every MediaWiki? • According to WikiApiary, there’s thousands...
  • 5. 5/23/19 Heiko Paulheim 5 Why? • More is better (maybe)
  • 6. 5/23/19 Heiko Paulheim 6 Why? • Overcoming Wikipedia’s coverage bias
  • 7. 5/23/19 Heiko Paulheim 7 A Brief History of DBkWik • Started as a student project in 2017 • Task: run DBpedia EF on a large Wiki Farm – ...and see what happens
  • 8. 5/23/19 Heiko Paulheim 8 DBkWik vs. DBpedia • Challenges – Getting dumps: only a fraction of Fandom Wikis has dumps – Downloadable from Fandom: 12,840 dumps – Tried: auto-requesting dumps
  • 9. 5/23/19 Heiko Paulheim 9 Obtaining Dumps • We had to change our strategy: WikiTeam software – Produces dumps by crawling Wikis – Fandom has not blocked us so far :-) – Current collection: 307,466 Wikis → will go into DBkWik 1.2 release
  • 10. 5/23/19 Heiko Paulheim 10 DBkWik vs. DBpedia • Mappings do not exist – no central ontology – i.e., only raw extraction possible • Duplicates exist – origin: pages about the same entity in different Wikis – unlike Wikipedia: often not explicitly linked • Different configurations of MediaWiki
  • 11. 5/23/19 Heiko Paulheim 11 Absence of Mappings and Ontology • Every infobox becomes a class: {infobox actor → mywiki:actor a owl:Class • Every infobox key becomes a property |role = Harry’s mother → mywiki:role a rdf:Property • The resulting ontology is very shallow – No class hierarchy – No distinction of object and data properties – No domains and ranges
  • 12. 5/23/19 Heiko Paulheim 12 Duplicates • Collecting Data from a Multitude of Wikis
  • 13. 5/23/19 Heiko Paulheim 13 Representational Variety • No conventions across Wikis (besides using MediaWiki syntax) {{Person |name = Trent Reznor |image = TrentReznor.jpg |caption - Reznor at the [[83rd Academy Awards]] |nominations = 1 |wins = 1 |role = Composer |birthdate = May 17, 1965 |birthloc = Mercer, Pennsylvania, USA}} {{Infobox musician | Name = Trent Reznor | Birth_name = Michael Trent Reznor | Born = May 17, [[1965]] (age 53) | Origin = [[Mercer]], [[Pennsylvania]], [[United States]] ... }} {{Infobox cast |Name=Trent Reznor |Image= |ImageCaption= |character= |crew= |Born={{d|May|17|1965}}{{-}}New Castle, Pennsylvania, United States ... }
  • 14. 5/23/19 Heiko Paulheim 14 Data Fusion
  • 15. 5/23/19 Heiko Paulheim 15 Naive Data Fusion and Linking to DBpedia • String similarity for schema matching (classes/properties) • doc2vec similarity on original pages for instance matching • Results – Classes and properties work OK – Instances are trickier – Internal linking seems easier F1 score... Internal Linking Linking to DBpedia Classes .979 .898 Properties .836 .865 Instances .879 .657 maybe...
  • 16. 5/23/19 Heiko Paulheim 16 Gold Standard DBkWik 1.1 • Schema alignment: manual • Instance alignment: crowd-sourced – Using 3x3 Wikis from 3 different topics – Asking crowdworkers to identify similar pages – Search was allowed and encouraged
  • 17. 5/23/19 Heiko Paulheim 17 Gold Standard DBkWik 1.1 • Crowdsourcing results – High inter rater agreement (Fleiss’ Kappa: 0.8762) – Most mappings are trivial, though • Possible bias in gold standard – We pre-selected matching Wikis!
  • 18. 5/23/19 Heiko Paulheim 18 Results Data Fusion • Uneven distribution – e.g., character appears 5k times • Currently: no multi-linguality – e.g., Main Page, Hauptseite • Probably overloaded fusion (false positives) – e.g., next, location
  • 19. 5/23/19 Heiko Paulheim 19 Light-weight Schema Induction • Class hierarchy and domain/range induction – Using association rule mining ● e.g., Artist(x) → Person(x) – 5k class subsumption axioms – 59k domain restrictions – 114k range restrictions • Instance typing – With a light-weight version of SDType – Using the learned ranges as approximations of actual distributions • Result: ~100k new instance types Person? Artist Person
  • 20. 5/23/19 Heiko Paulheim 20 Big Picture Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoint Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light SubclassMaterialization
  • 21. 5/23/19 Heiko Paulheim 21 DBkWik 1.1 • Source: ~15k Wiki dumps from Fandom – 52.4GB of data (roughly the size of the English Wikipedia) Raw Final Instances 14,212,535 11,163,719 Typed instances 1,880,189 1,372,971 Triples 107,833,322 91,526,001 Avg. indegree 0.624 0.703 Avg. outdegree 7.506 8.169 Classes 71,580 12,029 Properties 506,487 128,566
  • 22. 5/23/19 Heiko Paulheim 22 DBkWik 1.1 • Fused graphs from 15k Wikis http://dbkwik.webdatacommons.org/
  • 23. 5/23/19 Heiko Paulheim 23 DBkWik 1.1 vs. other Knowledge Graphs • Caveat: – Minus non-recognized duplicates!
  • 24. 5/23/19 Heiko Paulheim 24 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? • Challenge: – We only have an incomplete and partly correct mapping M – But: we know its precision P and recall R • Trick (see KI paper 2017): – O is the actual overlap (unknown), T ⊆ M is the true part of M (unknown) • By definition: – P = |T| / |M| → |T| = P * |M| – R = |T| / |O| → |T| = R * |O| → |O| = |M| * P / R DBkWik DBpedia
  • 25. 5/23/19 Heiko Paulheim 25 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? – |O| = |M| * P / R – Overlap: ~500k instances • In other words: – 95% of all entities in DBkWik are not in DBpedia – 90% of all entities in DBpedia are not in DBkWik DBkWik DBpedia
  • 26. 5/23/19 Heiko Paulheim 26 Towards Improving Interlinking • Strategy: ask the experts – new Knowledge Graph track at OAEI 2018 – seven systems provided results • Results: – it is hard to beat the string baseline – many matching systems rely on explicit, deep ontologies ● but we have just shallow schemas • Possible reasons: – the problem is too difficult? – the gold standard is too trivial? – the ontology lacks formality
  • 27. 5/23/19 Heiko Paulheim 27 Towards Improving Interlinking • Currently, embedding based methods are on the rise – e.g., Azmy et al.: “Matching Entities Across Different Knowledge Graphs with Graph Embeddings”, 2019 – require large-scale training data
  • 28. 5/23/19 Heiko Paulheim 28 Towards Improving Interlinking • Overcoming issues of first gold standard – include non-trivial matches – include non-matches
  • 29. 5/23/19 Heiko Paulheim 29 Towards Improving Interlinking • Includes trivial and non-trivial matches – i.e., task gets more demanding • Low inter-rater agreement: Fleiss’ Kappa 0.02
  • 30. 5/23/19 Heiko Paulheim 30 Towards Improving Interlinking • Exploiting Wiki Interlinks 30 == External links == * {{mbeta}} * {{Wikipedia|Bajoran#Kai|Kai}} [[de:Kai]] [[nl:Kai]] [[pl:Kai]] wiki 1 wiki 2 Kai Meressa Star Trek
  • 31. 5/23/19 Heiko Paulheim 31 NewNif Extractor Towards DBkWik 1.2 • Current crawl: 307,466 Wikis • Extraction: more robust for non-infobox templates – e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists • Robust abstract extraction – using SWEBLE parser – no local MediaWiki instance • Better matching • New gold standard Source Simple WikiParser LinkExtractor Page NifExtractor AST Destination Graph HTML
  • 32. 5/23/19 Heiko Paulheim 32 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  • 33. 5/23/19 Heiko Paulheim 33 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  • 34. 5/23/19 Heiko Paulheim 34 Towards DBkWik 1.2
  • 35. 5/23/19 Heiko Paulheim 35 Further Open Challenges • More detailed profiling – e.g., do we reduce or increase bias? • Task-based evaluation – Does it improve, e.g., recommender systems? • Fusion policies – Identify outdated Wikis
  • 36. 5/23/19 Heiko Paulheim 36 Contributors • DBkWik contributors (past, present, and future) Sven Hertling Alexandra Hofmann Samresh Perchani Jan Portisch
  • 37. 5/23/19 Heiko Paulheim 37 From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph Heiko Paulheim