SlideShare a Scribd company logo
Knowledge Graphs
Public Knowledge Graphs
Heiko Paulheim
10/17/22 Heiko Paulheim 2
Previously on “Knowledge Graphs”
• Principles:
– RDF, RDF-S, SPARQL & co
– Linked Open Data
• Today
– A closer look on actually existing knowledge graphs
– Some useful, large-scale resources
10/17/22 Heiko Paulheim 3
Introduction
• Knowledge Graphs out there (not guaranteed to be complete)
public
private
Paulheim: Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic Web 8:3 (2017), pp. 489-508
10/17/22 Heiko Paulheim 4
Knowledge Graph Creation: CyC
• The beginning
– Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present (as of June 2017)
– ~1,000 person years, $120M total development cost
– 21M axioms and rules
– Used to exist until 2017
10/17/22 Heiko Paulheim 5
Knowledge Graph Creation: CyC
10/17/22 Heiko Paulheim 6
Knowledge Graph Creation
• Lesson learned no. 1:
– Trading efforts against accuracy
Min. efforts Max. accuracy
10/17/22 Heiko Paulheim 7
Knowledge Graph Creation: Freebase
• The 2000s
– Freebase: collaborative editing
– Schema not fixed
• Present
– Acquired by Google in 2010
– Powered first version of Google’s Knowledge Graph
– Shut down in 2016
– Partly lives on in Wikidata (see in a minute)
coming up soon:
was it a good deal or not?
10/17/22 Heiko Paulheim 8
Knowledge Graph Creation: Freebase
• Community based
• Like Wikipedia,
but more structured
10/17/22 Heiko Paulheim 9
Knowledge Graph Creation
• Lesson learned no. 2:
– Trading formality against number of users
Max. user involvement Max. degree of formality
10/17/22 Heiko Paulheim 10
Knowledge Graph Creation: Wikidata
• The 2010s
– Wikidata: launched 2012
– Goal: centralize data from Wikipedia languages
– Collaborative
– Imports other datasets
• Present
– One of the largest public knowledge graphs
(see later)
– Includes rich provenance
10/17/22 Heiko Paulheim 11
Knowledge Graph Creation: Wikidata
• Collaborative
editing
10/17/22 Heiko Paulheim 12
Knowledge Graph Creation: Wikidata
• Provenance
10/17/22 Heiko Paulheim 13
Wikidata
10/17/22 Heiko Paulheim 14
Knowledge Graph Creation
• Lesson learned no. 3:
– There is not one truth (but allowing for plurality adds complexity)
Max. simplicity Max. support for
plurality
10/17/22 Heiko Paulheim 15
Knowledge Graph Creation: DBpedia & YAGO
• The 2010s
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up
10/17/22 Heiko Paulheim 16
DBpedia
10/17/22 Heiko Paulheim 17
DBpedia
Lehmann et al.: DBpedia – A Large-scale, Multilingual Knowledge Base
Extracted from Wikipedia. 2014
10/17/22 Heiko Paulheim 18
DBpedia
10/17/22 Heiko Paulheim 19
DBpedia
10/17/22 Heiko Paulheim 20
YAGO
• Wikipedia categories for types
– Plus WordNet as upper structure
• Manual mappings for properties
https://www.cs.princeton.edu/courses/archive/spring07/cos226/assignments/wordnet.html
10/17/22 Heiko Paulheim 21
YAGO
10/17/22 Heiko Paulheim 22
YAGO
10/17/22 Heiko Paulheim 23
Knowledge Graph Creation
• Lesson learned no. 4:
– Heuristics help increasing coverage (at the cost of accuracy)
Max. accuracy Max. coverage
10/17/22 Heiko Paulheim 24
Knowledge Graph Creation: NELL
• The 2010s
– NELL: Never ending language learner
– Input: ontology, seed examples, text corpus
– Output: facts, text patterns
– Large degree of automation,
occasional human feedback
• Until 2018
– Continuously ran for ~8 years
– New release every few days
http://rtw.ml.cmu.edu/rtw/overview
10/17/22 Heiko Paulheim 25
Knowledge Graph Creation: NELL
• Extraction of a Knowledge Graph from a Text Corpus
Nine Inch Nails
singer Trent Reznor,
born
1965
...as stated by Filter
singer Richard
Patrick
...says Slipknot
singer Corey Taylor,
44, in the interview.
“X singer Y”
→ band_member(X,Y)
band_member(Nine_Inch_Nails, Trent_Reznor)
band_member(Filter,Richard_Patrick)
band_member(Slipknot,Corey_Taylor)
patterns
facts
10/17/22 Heiko Paulheim 26
Knowledge Graph Creation: NELL
10/17/22 Heiko Paulheim 27
Knowledge Graph Creation
• Lesson learned no. 5:
– Quality cannot be maximized without human intervention
Min. human intervention Max. accuracy
10/17/22 Heiko Paulheim 28
Summary of Trade Offs
• (Manual) effort vs. accuracy and completeness
• User involvement (or usability) vs. degree of formality
• Simplicity vs. support for plurality and provenance
→ all those decisions influence the shape of a knowledge graph!
10/17/22 Heiko Paulheim 29
Non-Public Knowledge Graphs
• Many companies have their
own private knowledge graphs
– Google: Knowledge Graph,
Knowledge Vault
– Yahoo!: Knowledge Graph
– Microsoft: Satori
– Facebook: Entities Graph
– Thomson Reuters: permid.org
(partly public)
• However, we usually know only little about them
See: Noy et al. (2019): Industry-scale Knowledge Graphs: Lessons and Challenges:
Five diverse technology companies show how it’s done
10/17/22 Heiko Paulheim 30
Comparison of Knowledge Graphs
• Release cycles
Instant updates:
DBpedia live,
Freebase
Wikidata
Days:
NELL
Months:
DBpedia
Years:
YAGO
Cyc
• Size and density
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
Caution!
10/17/22 Heiko Paulheim 31
Comparison of Knowledge Graphs
• What do they actually contain?
• Experiment: pick 25 classes of interest
– And find them in respective ontologies
• Count instances (coverage)
• Determine in and out degree (level of detail)
10/17/22 Heiko Paulheim 32
Comparison of Knowledge Graphs
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 33
Comparison of Knowledge Graphs
• Summary findings:
– Persons: more in Wikidata
(twice as many persons as DBpedia and YAGO)
– Countries: more details in Wikidata
– Places: most in DBpedia
– Organizations: most in YAGO
– Events: most in YAGO
– Artistic works:
• Wikidata contains more movies and albums
• YAGO contains more songs
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 34
Caveats
• Reading the diagrams right…
• So, Wikidata contains more persons
– but less instances of all the interesting subclasses?
• There are classes like Actor in Wikidata
– but they are hardly used
– rather: modeled using profession relation
10/17/22 Heiko Paulheim 35
Caveats
• Reading the diagrams right… (ctd.)
• So, Wikidata contains more data on countries, but less countries?
• First: Wikidata only counts current, actual countries
– DBpedia and YAGO also count historical countries
• “KG1 contains less of X than KG2” can mean
– it actually contains less instances of X
– it contains equally many or more instances,
but they are not typed with X (see later)
• Second: we count single facts about countries
– Wikidata records some time indexed information, e.g., population
– Each point in time contributes a fact
10/17/22 Heiko Paulheim 36
Overlap of Knowledge Graphs
• How largely do knowledge graphs overlap?
• They are interlinked, so we can simply count links
– For NELL, we use links to Wikipedia as a proxy
DBpedia
YAGO
Wikidata
NELL Open
Cyc
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 37
Overlap of Knowledge Graphs
• How largely do knowledge graphs overlap?
• They are interlinked, so we can simply count links
– For NELL, we use links to Wikipedia as a proxy
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 38
Overlap of Knowledge Graphs
• Links between Knowledge Graphs are incomplete
– The Open World Assumption also holds for interlinks
• But we can estimate their number
• Approach:
– find link set automatically with different heuristics
– determine precision and recall on existing interlinks
– estimate actual number of links
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 39
Overlap of Knowledge Graphs
• Idea:
– Given that the link set F is found
– And the (unknown) actual link set would be C
• Precision P: Fraction of F which is actually correct
– i.e., measures how much |F| is over-estimating |C|
• Recall R: Fraction of C which is contained in F
– i.e., measures how much |F| is under-estimating |C|
• From that, we estimate |C|=|F|
⋅P⋅
1
R
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 40
Overlap of Knowledge Graphs
• Mathematical derivation:
– Definition of recall:
– Definition of precision:
• Resolve both to , substitute, and resolve to
R=
|Fcorrect|
|C|
P=
|Fcorrect|
|F|
|Fcorrect| |C|
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
|C|=|F|
⋅P⋅
1
R
unknown
10/17/22 Heiko Paulheim 41
Overlap of Knowledge Graphs
• Experiment:
– We use the same 25 classes as before
– Measure 1: overlap relative to smaller KG (i.e., potential gain)
– Measure 2: overlap relative to explicit links
(i.e., importance of improving links)
• Link generation with 16 different metrics and thresholds
– Intra-class correlation coefficient for |C|: 0.969
– Intra-class correlation coefficient for |F|: 0.646
• Bottom line:
– Despite variety in link sets generated, the overlap is estimated reliably
– The link generation mechanisms do not need to be overly accurate
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 42
Overlap of Knowledge Graphs
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 43
Overlap of Knowledge Graphs
• Summary findings:
– DBpedia and YAGO cover roughly the same instances
(not much surprising)
– NELL is the most complementary to the others
– Existing interlinks are insufficient for out-of-the-box parallel usage
Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
10/17/22 Heiko Paulheim 44
Intermezzo: Knowledge Graph Creation Cost
• There are quite a few metrics for evaluating KGs
– size, degree, interlinking, quality, licensing, ...
Färber et al.: Linked data quality of
DBpedia, Freebase, OpenCyc,
Wikidata, and YAGO SWJ 9(1), 2018
Zaveri et al.: Quality Assessment for
Linked Open Data: A Survey. SWJ 7(1),
2016
10/17/22 Heiko Paulheim 45
Intermezzo: Knowledge Graph Creation Cost
• ...but what is the cost of a single statement?
Some back of the envelope calculations...
Paulheim: How much is a triple?
Estimating the Cost of Knowledge Graph Creation, 2018
10/17/22 Heiko Paulheim 46
Intermezzo: Knowledge Graph Creation Cost
• Case 1: manual curation
– Cyc: created by experts
Total development cost: $120M
Total #statements: 21M
→ $5.71 per statement
– Freebase: created by laymen
Assumption: adding a statement to Freebase
equals adding a sentence to Wikipedia
• English Wikipedia up to April 2011: 41M working hours
(Geiger and Halfaker, 2013),
size in April 2011: 3.6M pages, avg. 36.4 sentences each
• Using US minimum wage: $2.25 per sentence
→ $2.25 per statement
(Footnote: total cost of creating Freebase would be $6.75B)
acquisition by Google
estimated as $60-300M
10/17/22 Heiko Paulheim 47
Intermezzo: Knowledge Graph Creation Cost
• Case 2: automatic/heuristic creation
– DBpedia: 4.9M LOC, 2.2M LOC for mappings
software project development: ~37 LOC per hour
(Devanbu et al., 1996)
we use German PhD salaries as a cost estimate
→ 1.85c per statement
– YAGO: made from 1.6M LOC
uses WordNet: 117k synsets, we treat each synset like a Wiki page
→ 0.83c per statement
– NELL: 103k LOC
→ 14.25c per statement
• Compared to manual curation: saving factor 16-250
10/17/22 Heiko Paulheim 48
Intermezzo: Knowledge Graph Creation Cost
• Graph error rate against cost
– we can pay for accuracy
– NELL is a bit of an outlier
10/17/22 Heiko Paulheim 49
New Kids on the Block
Subjective age:
Measured by the fraction
of the audience
that understands a reference
to your young days’
pop culture...
10/17/22 Heiko Paulheim 50
Enhancing the Coverage of Knowledge Graphs
• Study for KG-based
Recommender Systems*
– DBpedia (likewise: YAGO)
has a coverage of
• 85% for movies
• 63% for music artists
• 31% for books
*) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n
Recommendations using Linked Open Data. In: ACM TIST, 2016
https://grouplens.org/datasets/
10/17/22 Heiko Paulheim 51
Enhancing the Coverage of Knowledge Graphs
• Only existing pages have categories
– Lists may also link to non-existing pages
10/17/22 Heiko Paulheim 52
Entity Extraction from List Pages
• Lists form (shallow) hierarchies
10/17/22 Heiko Paulheim 53
Entity Extraction from List Pages
• Idea: align with category graph
• Equivalence:
– “List of Japanese Writers”
↔ Category:Japanese Writers
• Subsumption:
– “List of Japanese
Speculative Fiction Writers”
→ Category:Japanese Writers
10/17/22 Heiko Paulheim 54
Classifying Red Links
• Not all entities on a list page
belong to the same category
• Idea:
– Learn classifier to tell subject
entities from non-subject entities
• Distant learning approach
– Positive examples:
• Entities that are in the
corresponding category
– Negative examples
• Entities that are in a category
which is disjoint
• e.g., Book <> Writer
10/17/22 Heiko Paulheim 55
Increasing Level of Detail
• YAGO uses categories for types
– e.g., Category:American Industrial Groups
– but does not analyze them further
• :NineInchNails a :AmericanIndustrialGroup
– “Things, not Strings”?
• :NineInchNails a :MusicalGroup ;
hometown :United_States ;
genre :Industrial .
10/17/22 Heiko Paulheim 56
Cat2Ax: Axiomatizing Wikipedia Categories
 dbo:Album
 dbo:artist.{dbr:Nine_Inch_Nails}
 dbo:genre.{dbr:Rock_Music}
Heist & Paulheim (2019): Uncovering the Semantics of Wikipedia Categories
10/17/22 Heiko Paulheim 57
Cat2Ax: Axiomatizing Wikipedia Categories
 dbo:genre.{dbr:Rock_Music} ?
 dbo:artist.{dbr:Rock_(Rapper)} ?
10/17/22 Heiko Paulheim 58
Cat2Ax: Axiomatizing Wikipedia Categories
– Frequency: how often does the pattern occur in a category?
• i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}?
– Lexical score: likelihood of term as a surface form of object
• i.e.: how often is Rock used to refer to dbr:Rock_Music?
– Sibling score: how likely are sibling categories sharing similar patterns?
• i.e., are there sibling categories with a high score for dbo:genre?
10/17/22 Heiko Paulheim 59
CaLiGraph Example
Category: Musical Groups established
in 1987
List of symphonic metal bands
Category: Swedish death metal bands
List of Swedes in Music
10/17/22 Heiko Paulheim 60
Pushing Entity Coverage Further
• Beyond red links (2020) • Beyond explicit lists (2021)
10/17/22 Heiko Paulheim 61
Entity Extraction from List Pages
• Red and grey links
– Red links point to entities
that do not exist
– “Grey links”
• are actually not links
• i.e., entities to be
discovered
10/17/22 Heiko Paulheim 62
Beyond List Pages
• Many pages
contain list-like
constructs
• Usually
– small
– same type
– same relation
to page entity
– more grey links
…
10/17/22 Heiko Paulheim 63
Beyond List Pages
10/17/22 Heiko Paulheim 64
Beyond List Pages
• Learning descriptive rules for listings, e.g.
– topSection(“Discography”) → artist.{>PageEntity<}
– Learning across pages to mitigate small data problems
• Metrics:
– Support: no. of listings covered by rule antecedent
– Confidence: frequency of rule consequent over all covered listings
– Consistency: mean absolute deviation
of overall confidence and listing confidence
• i.e., does the rule work equally well across all covered listings
10/17/22 Heiko Paulheim 65
CaLiGraph at a Glance
• Latest version 2.1
– 15M entities
• incl. 8M from listings
– Caveat:
• disambiguation!
10/17/22 Heiko Paulheim 66
Entity Disambiguation
• Examples: Wikipedia pages of Die Krupps and Eisbrecher
?
10/17/22 Heiko Paulheim 67
CaLiGraph Glitches
10/17/22 Heiko Paulheim 68
From DBpedia to DBkWik
• Wikipedia-based Knowledge Graphs will remain
an essential building block of Semantic Web applications
• But they suffer from...
– ...a coverage bias
– ...limitations of the creating heuristics
10/17/22 Heiko Paulheim 69
From DBpedia to DBkWik
• One (but not the only!) possible source of coverage bias
– Articles about long-tail entities become deleted
10/17/22 Heiko Paulheim 70
From DBpedia to DBkWik
• Why stop at Wikipedia?
• Wikipedia is based on the MediaWiki software
– ...and so are thousands of Wikis
– Fandom by Wikia: >385,000 Wikis on special topics
– WikiApiary: reports >20,000 installations of MediaWiki on the Web
10/17/22 Heiko Paulheim 71
From DBpedia to DBkWik
• Collecting Data from a Multitude of Wikis
10/17/22 Heiko Paulheim 72
From DBpedia to DBkWik
• The DBpedia Extraction Framework consumes MediaWiki dumps
• Experiment (started as team project 2017)
– Can we process dumps from arbitrary Wikis with it?
– Are the results somewhat meaningful?
10/17/22 Heiko Paulheim 73
From DBpedia to DBkWik
• Example from Harry Potter Wiki
http://dbkwik.org/
10/17/22 Heiko Paulheim 74
From DBpedia to DBkWik
• Differences to DBpedia
– DBpedia has manually created mappings to an ontology
– Wikipedia has one page per subject
– Wikipedia has global infobox conventions (more or less)
• Challenges
– On-the-fly ontology creation
– Instance matching
– Schema matching
Hertling & Paulheim: DBkWik: A Consolidated Knowledge Graph from
Thousands of Wikis. ICBK 2018
10/17/22 Heiko Paulheim 75
From DBpedia to DBkWik
Dump
Downloader
DBpedia
Extraction
Framework
Interlinking
Instance
Matcher
Schema
Matcher
MediaWiki Dumps
Extracted RDF
Internal Linking
Instance
Matcher
Schema
Matcher
Consolidated
Knowledge Graph
DBkWik
Linked
Data
Endpoin
t
Ontology
Knowledge
Graph
Fusion
Instance
Matcher
Domain/
Range
Type
SDType
Light
Subclass
Materialization
• Heuristics
– Ontology induction
– Instance/Schema Matching
Hertling & Paulheim: DBkWik: A Consolidated Knowledge Graph from
Thousands of Wikis. ICBK 2018
10/17/22 Heiko Paulheim 76
From DBpedia to DBkWik
• Downloaded ~15k Wiki dumps from Fandom
– 52.4GB of data, roughly the size of the English Wikipedia
• Prototype: extracted data for ~250 Wikis
– 4.3M instances, ~750k linked to DBpedia
– 7k classes, ~1k linked to DBpedia
– 43k properties, ~20k linked to DBpedia
– ...including duplicates!
• Link quality
– Good for classes, OK for properties (F1 of .957 and .852)
– Needs improvement for instances (F1 of .641)
10/17/22 Heiko Paulheim 77
From DBpedia to DBkWik
• Scalability of matching:
– Pairwise matching does not scale
– 300k Wikis, 1 minute for each pair → 171k years
• Iteratively match and merge
– 300k Wikis, 1 minute for each match&merge run → 200 days
• Tree-shaped execution plan
– Parallelizable
– Hierarchical clustering by topic
– Whole run under a week
10/17/22 Heiko Paulheim 78
WebIsALOD
• Background: Web table interpretation
• Most approaches need typing information
– DBpedia etc. have too little coverage
on the long tail
– Wanted: extensive type database
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
10/17/22 Heiko Paulheim 79
WebIsALOD
• Extraction of type information using Hearst-like patterns, e.g.,
– T, such as X
– X, Y, and other T
• Text corpus: common crawl
– ~2 TB crawled web pages
– Fast implementation: regex over text
– “Expensive” operations only applied once regex has fired
• Resulting database
– 400M hypernymy relations
Seitner et al.: A large DataBase of hypernymy relations extracted from the Web.
LREC 2016
10/17/22 Heiko Paulheim 80
WebIsALOD
• Example:
http://webisa.webdatacommons.org/
10/17/22 Heiko Paulheim 81
WebIsALOD
• Initial effort: transformation to a LOD dataset
– including rich provenance information
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
10/17/22 Heiko Paulheim 82
WebIsALOD
• Estimated contents breakdown
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
10/17/22 Heiko Paulheim 83
WebIsALOD
• Main challenge
– Original dataset is quite noisy (<10% correct statements)
– Recap: coverage vs. accuracy
– Simple thresholding removes too much knowledge
• Approach
– Train RandomForest model for predicting correct vs. wrong statements
– Using all the provenance information we have
– Use model to compute confidence scores
• Current ongoing research
– Using transformers and a larger training set
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
10/17/22 Heiko Paulheim 84
WebIsALOD
• Current challenges and works in progress
– Distinguishing instances and classes
• i.e.: subclass vs. instance of relations
– Splitting instances
• Bauhaus is a goth band
• Bauhaus is a German school
– Knowledge extraction from pre and post modifiers
• Bauhaus is a goth band → genre(Bauhaus, Goth)
• Bauhaus is a German school → location(Bauhaus, Germany)
Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted
from the Web as Linked Open Data. ISWC 2017
10/17/22 Heiko Paulheim 85
Summary
• We have seen a couple of Knowledge Graphs
– How they are built
– What they contain
• For your project
– Have a look at the fit for your domain
– Try different options
• For a master’s thesis later
– Work on recent developments in our group
10/17/22 Heiko Paulheim 86
Questions?

More Related Content

ODP
Machine Learning with and for Semantic Web Knowledge Graphs
PDF
Towards Knowledge Graph Profiling
ODP
Knowledge Graphs on the Web
PDF
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
PDF
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
PDF
From Wikis to Knowledge Graphs
PDF
ESWC 2017 Tutorial Knowledge Graphs
PDF
Ten myths about knowledge graphs.pdf
Machine Learning with and for Semantic Web Knowledge Graphs
Towards Knowledge Graph Profiling
Knowledge Graphs on the Web
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
From Wikis to Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
Ten myths about knowledge graphs.pdf

Similar to The discovery of knowledge graphs and their utility in biotech (20)

PDF
Introduction to Knowledge Graphs for Information Architects.pdf
PDF
Towards an Ecology of Knowledge
PDF
Enterprise Knowledge Graphs - Data Summit 2024
PDF
Introduction of Knowledge Graphs
PDF
Getting Started with Knowledge Graphs
PDF
ACM Hypertext and Social Media Conference Tutorial on Knowledge-infused Deep ...
PPTX
Semantics of the Black-Box: Using knowledge-infused learning approach to make...
PPTX
Semantics of the Black-Box: Using knowledge-infused learning approach to make...
PDF
A Brief Introduction to Knowledge Graphs
PDF
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
PDF
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
PDF
Enterprise Scale Knowledge Graphs
PDF
week1 - What_Is_A_Knowledge_Graphs_S.pdf
PPTX
Knowledge Graph Introduction
PDF
On Statistical Characteristics of Real-life Knowledge Graphs
PDF
Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Gr...
ODP
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
PDF
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
PDF
Virtual Knowledge Graph by MIT Article.pdf
PDF
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
Introduction to Knowledge Graphs for Information Architects.pdf
Towards an Ecology of Knowledge
Enterprise Knowledge Graphs - Data Summit 2024
Introduction of Knowledge Graphs
Getting Started with Knowledge Graphs
ACM Hypertext and Social Media Conference Tutorial on Knowledge-infused Deep ...
Semantics of the Black-Box: Using knowledge-infused learning approach to make...
Semantics of the Black-Box: Using knowledge-infused learning approach to make...
A Brief Introduction to Knowledge Graphs
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
Enterprise Scale Knowledge Graphs
week1 - What_Is_A_Knowledge_Graphs_S.pdf
Knowledge Graph Introduction
On Statistical Characteristics of Real-life Knowledge Graphs
Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Gr...
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Virtual Knowledge Graph by MIT Article.pdf
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
Ad

Recently uploaded (20)

PDF
Roadmap Map-digital Banking feature MB,IB,AB
PPTX
Lecture (1)-Introduction.pptx business communication
PPTX
HR Introduction Slide (1).pptx on hr intro
PPTX
Business Ethics - An introduction and its overview.pptx
PDF
COST SHEET- Tender and Quotation unit 2.pdf
DOCX
Euro SEO Services 1st 3 General Updates.docx
PDF
A Brief Introduction About Julia Allison
PDF
MSPs in 10 Words - Created by US MSP Network
PDF
Unit 1 Cost Accounting - Cost sheet
PDF
How to Get Funding for Your Trucking Business
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PPT
Chapter four Project-Preparation material
PDF
Chapter 5_Foreign Exchange Market in .pdf
PDF
Training And Development of Employee .pdf
PDF
Business model innovation report 2022.pdf
PDF
Nidhal Samdaie CV - International Business Consultant
Roadmap Map-digital Banking feature MB,IB,AB
Lecture (1)-Introduction.pptx business communication
HR Introduction Slide (1).pptx on hr intro
Business Ethics - An introduction and its overview.pptx
COST SHEET- Tender and Quotation unit 2.pdf
Euro SEO Services 1st 3 General Updates.docx
A Brief Introduction About Julia Allison
MSPs in 10 Words - Created by US MSP Network
Unit 1 Cost Accounting - Cost sheet
How to Get Funding for Your Trucking Business
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
Ôn tập tiếng anh trong kinh doanh nâng cao
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Chapter four Project-Preparation material
Chapter 5_Foreign Exchange Market in .pdf
Training And Development of Employee .pdf
Business model innovation report 2022.pdf
Nidhal Samdaie CV - International Business Consultant
Ad

The discovery of knowledge graphs and their utility in biotech

  • 1. Knowledge Graphs Public Knowledge Graphs Heiko Paulheim
  • 2. 10/17/22 Heiko Paulheim 2 Previously on “Knowledge Graphs” • Principles: – RDF, RDF-S, SPARQL & co – Linked Open Data • Today – A closer look on actually existing knowledge graphs – Some useful, large-scale resources
  • 3. 10/17/22 Heiko Paulheim 3 Introduction • Knowledge Graphs out there (not guaranteed to be complete) public private Paulheim: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8:3 (2017), pp. 489-508
  • 4. 10/17/22 Heiko Paulheim 4 Knowledge Graph Creation: CyC • The beginning – Encyclopedic collection of knowledge – Started by Douglas Lenat in 1984 – Estimation: 350 person years and 250,000 rules should do the job of collecting the essence of the world’s knowledge • The present (as of June 2017) – ~1,000 person years, $120M total development cost – 21M axioms and rules – Used to exist until 2017
  • 5. 10/17/22 Heiko Paulheim 5 Knowledge Graph Creation: CyC
  • 6. 10/17/22 Heiko Paulheim 6 Knowledge Graph Creation • Lesson learned no. 1: – Trading efforts against accuracy Min. efforts Max. accuracy
  • 7. 10/17/22 Heiko Paulheim 7 Knowledge Graph Creation: Freebase • The 2000s – Freebase: collaborative editing – Schema not fixed • Present – Acquired by Google in 2010 – Powered first version of Google’s Knowledge Graph – Shut down in 2016 – Partly lives on in Wikidata (see in a minute) coming up soon: was it a good deal or not?
  • 8. 10/17/22 Heiko Paulheim 8 Knowledge Graph Creation: Freebase • Community based • Like Wikipedia, but more structured
  • 9. 10/17/22 Heiko Paulheim 9 Knowledge Graph Creation • Lesson learned no. 2: – Trading formality against number of users Max. user involvement Max. degree of formality
  • 10. 10/17/22 Heiko Paulheim 10 Knowledge Graph Creation: Wikidata • The 2010s – Wikidata: launched 2012 – Goal: centralize data from Wikipedia languages – Collaborative – Imports other datasets • Present – One of the largest public knowledge graphs (see later) – Includes rich provenance
  • 11. 10/17/22 Heiko Paulheim 11 Knowledge Graph Creation: Wikidata • Collaborative editing
  • 12. 10/17/22 Heiko Paulheim 12 Knowledge Graph Creation: Wikidata • Provenance
  • 13. 10/17/22 Heiko Paulheim 13 Wikidata
  • 14. 10/17/22 Heiko Paulheim 14 Knowledge Graph Creation • Lesson learned no. 3: – There is not one truth (but allowing for plurality adds complexity) Max. simplicity Max. support for plurality
  • 15. 10/17/22 Heiko Paulheim 15 Knowledge Graph Creation: DBpedia & YAGO • The 2010s – DBpedia: launched 2007 – YAGO: launched 2008 – Extraction from Wikipedia using mappings & heuristics • Present – Two of the most used knowledge graphs – ...with Wikidata catching up
  • 17. 10/17/22 Heiko Paulheim 17 DBpedia Lehmann et al.: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. 2014
  • 20. 10/17/22 Heiko Paulheim 20 YAGO • Wikipedia categories for types – Plus WordNet as upper structure • Manual mappings for properties https://www.cs.princeton.edu/courses/archive/spring07/cos226/assignments/wordnet.html
  • 23. 10/17/22 Heiko Paulheim 23 Knowledge Graph Creation • Lesson learned no. 4: – Heuristics help increasing coverage (at the cost of accuracy) Max. accuracy Max. coverage
  • 24. 10/17/22 Heiko Paulheim 24 Knowledge Graph Creation: NELL • The 2010s – NELL: Never ending language learner – Input: ontology, seed examples, text corpus – Output: facts, text patterns – Large degree of automation, occasional human feedback • Until 2018 – Continuously ran for ~8 years – New release every few days http://rtw.ml.cmu.edu/rtw/overview
  • 25. 10/17/22 Heiko Paulheim 25 Knowledge Graph Creation: NELL • Extraction of a Knowledge Graph from a Text Corpus Nine Inch Nails singer Trent Reznor, born 1965 ...as stated by Filter singer Richard Patrick ...says Slipknot singer Corey Taylor, 44, in the interview. “X singer Y” → band_member(X,Y) band_member(Nine_Inch_Nails, Trent_Reznor) band_member(Filter,Richard_Patrick) band_member(Slipknot,Corey_Taylor) patterns facts
  • 26. 10/17/22 Heiko Paulheim 26 Knowledge Graph Creation: NELL
  • 27. 10/17/22 Heiko Paulheim 27 Knowledge Graph Creation • Lesson learned no. 5: – Quality cannot be maximized without human intervention Min. human intervention Max. accuracy
  • 28. 10/17/22 Heiko Paulheim 28 Summary of Trade Offs • (Manual) effort vs. accuracy and completeness • User involvement (or usability) vs. degree of formality • Simplicity vs. support for plurality and provenance → all those decisions influence the shape of a knowledge graph!
  • 29. 10/17/22 Heiko Paulheim 29 Non-Public Knowledge Graphs • Many companies have their own private knowledge graphs – Google: Knowledge Graph, Knowledge Vault – Yahoo!: Knowledge Graph – Microsoft: Satori – Facebook: Entities Graph – Thomson Reuters: permid.org (partly public) • However, we usually know only little about them See: Noy et al. (2019): Industry-scale Knowledge Graphs: Lessons and Challenges: Five diverse technology companies show how it’s done
  • 30. 10/17/22 Heiko Paulheim 30 Comparison of Knowledge Graphs • Release cycles Instant updates: DBpedia live, Freebase Wikidata Days: NELL Months: DBpedia Years: YAGO Cyc • Size and density Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017 Caution!
  • 31. 10/17/22 Heiko Paulheim 31 Comparison of Knowledge Graphs • What do they actually contain? • Experiment: pick 25 classes of interest – And find them in respective ontologies • Count instances (coverage) • Determine in and out degree (level of detail)
  • 32. 10/17/22 Heiko Paulheim 32 Comparison of Knowledge Graphs Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 33. 10/17/22 Heiko Paulheim 33 Comparison of Knowledge Graphs • Summary findings: – Persons: more in Wikidata (twice as many persons as DBpedia and YAGO) – Countries: more details in Wikidata – Places: most in DBpedia – Organizations: most in YAGO – Events: most in YAGO – Artistic works: • Wikidata contains more movies and albums • YAGO contains more songs Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 34. 10/17/22 Heiko Paulheim 34 Caveats • Reading the diagrams right… • So, Wikidata contains more persons – but less instances of all the interesting subclasses? • There are classes like Actor in Wikidata – but they are hardly used – rather: modeled using profession relation
  • 35. 10/17/22 Heiko Paulheim 35 Caveats • Reading the diagrams right… (ctd.) • So, Wikidata contains more data on countries, but less countries? • First: Wikidata only counts current, actual countries – DBpedia and YAGO also count historical countries • “KG1 contains less of X than KG2” can mean – it actually contains less instances of X – it contains equally many or more instances, but they are not typed with X (see later) • Second: we count single facts about countries – Wikidata records some time indexed information, e.g., population – Each point in time contributes a fact
  • 36. 10/17/22 Heiko Paulheim 36 Overlap of Knowledge Graphs • How largely do knowledge graphs overlap? • They are interlinked, so we can simply count links – For NELL, we use links to Wikipedia as a proxy DBpedia YAGO Wikidata NELL Open Cyc Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 37. 10/17/22 Heiko Paulheim 37 Overlap of Knowledge Graphs • How largely do knowledge graphs overlap? • They are interlinked, so we can simply count links – For NELL, we use links to Wikipedia as a proxy Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 38. 10/17/22 Heiko Paulheim 38 Overlap of Knowledge Graphs • Links between Knowledge Graphs are incomplete – The Open World Assumption also holds for interlinks • But we can estimate their number • Approach: – find link set automatically with different heuristics – determine precision and recall on existing interlinks – estimate actual number of links Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 39. 10/17/22 Heiko Paulheim 39 Overlap of Knowledge Graphs • Idea: – Given that the link set F is found – And the (unknown) actual link set would be C • Precision P: Fraction of F which is actually correct – i.e., measures how much |F| is over-estimating |C| • Recall R: Fraction of C which is contained in F – i.e., measures how much |F| is under-estimating |C| • From that, we estimate |C|=|F| ⋅P⋅ 1 R Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 40. 10/17/22 Heiko Paulheim 40 Overlap of Knowledge Graphs • Mathematical derivation: – Definition of recall: – Definition of precision: • Resolve both to , substitute, and resolve to R= |Fcorrect| |C| P= |Fcorrect| |F| |Fcorrect| |C| Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017 |C|=|F| ⋅P⋅ 1 R unknown
  • 41. 10/17/22 Heiko Paulheim 41 Overlap of Knowledge Graphs • Experiment: – We use the same 25 classes as before – Measure 1: overlap relative to smaller KG (i.e., potential gain) – Measure 2: overlap relative to explicit links (i.e., importance of improving links) • Link generation with 16 different metrics and thresholds – Intra-class correlation coefficient for |C|: 0.969 – Intra-class correlation coefficient for |F|: 0.646 • Bottom line: – Despite variety in link sets generated, the overlap is estimated reliably – The link generation mechanisms do not need to be overly accurate Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 42. 10/17/22 Heiko Paulheim 42 Overlap of Knowledge Graphs Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 43. 10/17/22 Heiko Paulheim 43 Overlap of Knowledge Graphs • Summary findings: – DBpedia and YAGO cover roughly the same instances (not much surprising) – NELL is the most complementary to the others – Existing interlinks are insufficient for out-of-the-box parallel usage Ringler & Paulheim: One Knowledge Graph to Rule them All? KI 2017
  • 44. 10/17/22 Heiko Paulheim 44 Intermezzo: Knowledge Graph Creation Cost • There are quite a few metrics for evaluating KGs – size, degree, interlinking, quality, licensing, ... Färber et al.: Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO SWJ 9(1), 2018 Zaveri et al.: Quality Assessment for Linked Open Data: A Survey. SWJ 7(1), 2016
  • 45. 10/17/22 Heiko Paulheim 45 Intermezzo: Knowledge Graph Creation Cost • ...but what is the cost of a single statement? Some back of the envelope calculations... Paulheim: How much is a triple? Estimating the Cost of Knowledge Graph Creation, 2018
  • 46. 10/17/22 Heiko Paulheim 46 Intermezzo: Knowledge Graph Creation Cost • Case 1: manual curation – Cyc: created by experts Total development cost: $120M Total #statements: 21M → $5.71 per statement – Freebase: created by laymen Assumption: adding a statement to Freebase equals adding a sentence to Wikipedia • English Wikipedia up to April 2011: 41M working hours (Geiger and Halfaker, 2013), size in April 2011: 3.6M pages, avg. 36.4 sentences each • Using US minimum wage: $2.25 per sentence → $2.25 per statement (Footnote: total cost of creating Freebase would be $6.75B) acquisition by Google estimated as $60-300M
  • 47. 10/17/22 Heiko Paulheim 47 Intermezzo: Knowledge Graph Creation Cost • Case 2: automatic/heuristic creation – DBpedia: 4.9M LOC, 2.2M LOC for mappings software project development: ~37 LOC per hour (Devanbu et al., 1996) we use German PhD salaries as a cost estimate → 1.85c per statement – YAGO: made from 1.6M LOC uses WordNet: 117k synsets, we treat each synset like a Wiki page → 0.83c per statement – NELL: 103k LOC → 14.25c per statement • Compared to manual curation: saving factor 16-250
  • 48. 10/17/22 Heiko Paulheim 48 Intermezzo: Knowledge Graph Creation Cost • Graph error rate against cost – we can pay for accuracy – NELL is a bit of an outlier
  • 49. 10/17/22 Heiko Paulheim 49 New Kids on the Block Subjective age: Measured by the fraction of the audience that understands a reference to your young days’ pop culture...
  • 50. 10/17/22 Heiko Paulheim 50 Enhancing the Coverage of Knowledge Graphs • Study for KG-based Recommender Systems* – DBpedia (likewise: YAGO) has a coverage of • 85% for movies • 63% for music artists • 31% for books *) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n Recommendations using Linked Open Data. In: ACM TIST, 2016 https://grouplens.org/datasets/
  • 51. 10/17/22 Heiko Paulheim 51 Enhancing the Coverage of Knowledge Graphs • Only existing pages have categories – Lists may also link to non-existing pages
  • 52. 10/17/22 Heiko Paulheim 52 Entity Extraction from List Pages • Lists form (shallow) hierarchies
  • 53. 10/17/22 Heiko Paulheim 53 Entity Extraction from List Pages • Idea: align with category graph • Equivalence: – “List of Japanese Writers” ↔ Category:Japanese Writers • Subsumption: – “List of Japanese Speculative Fiction Writers” → Category:Japanese Writers
  • 54. 10/17/22 Heiko Paulheim 54 Classifying Red Links • Not all entities on a list page belong to the same category • Idea: – Learn classifier to tell subject entities from non-subject entities • Distant learning approach – Positive examples: • Entities that are in the corresponding category – Negative examples • Entities that are in a category which is disjoint • e.g., Book <> Writer
  • 55. 10/17/22 Heiko Paulheim 55 Increasing Level of Detail • YAGO uses categories for types – e.g., Category:American Industrial Groups – but does not analyze them further • :NineInchNails a :AmericanIndustrialGroup – “Things, not Strings”? • :NineInchNails a :MusicalGroup ; hometown :United_States ; genre :Industrial .
  • 56. 10/17/22 Heiko Paulheim 56 Cat2Ax: Axiomatizing Wikipedia Categories  dbo:Album  dbo:artist.{dbr:Nine_Inch_Nails}  dbo:genre.{dbr:Rock_Music} Heist & Paulheim (2019): Uncovering the Semantics of Wikipedia Categories
  • 57. 10/17/22 Heiko Paulheim 57 Cat2Ax: Axiomatizing Wikipedia Categories  dbo:genre.{dbr:Rock_Music} ?  dbo:artist.{dbr:Rock_(Rapper)} ?
  • 58. 10/17/22 Heiko Paulheim 58 Cat2Ax: Axiomatizing Wikipedia Categories – Frequency: how often does the pattern occur in a category? • i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}? – Lexical score: likelihood of term as a surface form of object • i.e.: how often is Rock used to refer to dbr:Rock_Music? – Sibling score: how likely are sibling categories sharing similar patterns? • i.e., are there sibling categories with a high score for dbo:genre?
  • 59. 10/17/22 Heiko Paulheim 59 CaLiGraph Example Category: Musical Groups established in 1987 List of symphonic metal bands Category: Swedish death metal bands List of Swedes in Music
  • 60. 10/17/22 Heiko Paulheim 60 Pushing Entity Coverage Further • Beyond red links (2020) • Beyond explicit lists (2021)
  • 61. 10/17/22 Heiko Paulheim 61 Entity Extraction from List Pages • Red and grey links – Red links point to entities that do not exist – “Grey links” • are actually not links • i.e., entities to be discovered
  • 62. 10/17/22 Heiko Paulheim 62 Beyond List Pages • Many pages contain list-like constructs • Usually – small – same type – same relation to page entity – more grey links …
  • 63. 10/17/22 Heiko Paulheim 63 Beyond List Pages
  • 64. 10/17/22 Heiko Paulheim 64 Beyond List Pages • Learning descriptive rules for listings, e.g. – topSection(“Discography”) → artist.{>PageEntity<} – Learning across pages to mitigate small data problems • Metrics: – Support: no. of listings covered by rule antecedent – Confidence: frequency of rule consequent over all covered listings – Consistency: mean absolute deviation of overall confidence and listing confidence • i.e., does the rule work equally well across all covered listings
  • 65. 10/17/22 Heiko Paulheim 65 CaLiGraph at a Glance • Latest version 2.1 – 15M entities • incl. 8M from listings – Caveat: • disambiguation!
  • 66. 10/17/22 Heiko Paulheim 66 Entity Disambiguation • Examples: Wikipedia pages of Die Krupps and Eisbrecher ?
  • 67. 10/17/22 Heiko Paulheim 67 CaLiGraph Glitches
  • 68. 10/17/22 Heiko Paulheim 68 From DBpedia to DBkWik • Wikipedia-based Knowledge Graphs will remain an essential building block of Semantic Web applications • But they suffer from... – ...a coverage bias – ...limitations of the creating heuristics
  • 69. 10/17/22 Heiko Paulheim 69 From DBpedia to DBkWik • One (but not the only!) possible source of coverage bias – Articles about long-tail entities become deleted
  • 70. 10/17/22 Heiko Paulheim 70 From DBpedia to DBkWik • Why stop at Wikipedia? • Wikipedia is based on the MediaWiki software – ...and so are thousands of Wikis – Fandom by Wikia: >385,000 Wikis on special topics – WikiApiary: reports >20,000 installations of MediaWiki on the Web
  • 71. 10/17/22 Heiko Paulheim 71 From DBpedia to DBkWik • Collecting Data from a Multitude of Wikis
  • 72. 10/17/22 Heiko Paulheim 72 From DBpedia to DBkWik • The DBpedia Extraction Framework consumes MediaWiki dumps • Experiment (started as team project 2017) – Can we process dumps from arbitrary Wikis with it? – Are the results somewhat meaningful?
  • 73. 10/17/22 Heiko Paulheim 73 From DBpedia to DBkWik • Example from Harry Potter Wiki http://dbkwik.org/
  • 74. 10/17/22 Heiko Paulheim 74 From DBpedia to DBkWik • Differences to DBpedia – DBpedia has manually created mappings to an ontology – Wikipedia has one page per subject – Wikipedia has global infobox conventions (more or less) • Challenges – On-the-fly ontology creation – Instance matching – Schema matching Hertling & Paulheim: DBkWik: A Consolidated Knowledge Graph from Thousands of Wikis. ICBK 2018
  • 75. 10/17/22 Heiko Paulheim 75 From DBpedia to DBkWik Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoin t Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light Subclass Materialization • Heuristics – Ontology induction – Instance/Schema Matching Hertling & Paulheim: DBkWik: A Consolidated Knowledge Graph from Thousands of Wikis. ICBK 2018
  • 76. 10/17/22 Heiko Paulheim 76 From DBpedia to DBkWik • Downloaded ~15k Wiki dumps from Fandom – 52.4GB of data, roughly the size of the English Wikipedia • Prototype: extracted data for ~250 Wikis – 4.3M instances, ~750k linked to DBpedia – 7k classes, ~1k linked to DBpedia – 43k properties, ~20k linked to DBpedia – ...including duplicates! • Link quality – Good for classes, OK for properties (F1 of .957 and .852) – Needs improvement for instances (F1 of .641)
  • 77. 10/17/22 Heiko Paulheim 77 From DBpedia to DBkWik • Scalability of matching: – Pairwise matching does not scale – 300k Wikis, 1 minute for each pair → 171k years • Iteratively match and merge – 300k Wikis, 1 minute for each match&merge run → 200 days • Tree-shaped execution plan – Parallelizable – Hierarchical clustering by topic – Whole run under a week
  • 78. 10/17/22 Heiko Paulheim 78 WebIsALOD • Background: Web table interpretation • Most approaches need typing information – DBpedia etc. have too little coverage on the long tail – Wanted: extensive type database Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 79. 10/17/22 Heiko Paulheim 79 WebIsALOD • Extraction of type information using Hearst-like patterns, e.g., – T, such as X – X, Y, and other T • Text corpus: common crawl – ~2 TB crawled web pages – Fast implementation: regex over text – “Expensive” operations only applied once regex has fired • Resulting database – 400M hypernymy relations Seitner et al.: A large DataBase of hypernymy relations extracted from the Web. LREC 2016
  • 80. 10/17/22 Heiko Paulheim 80 WebIsALOD • Example: http://webisa.webdatacommons.org/
  • 81. 10/17/22 Heiko Paulheim 81 WebIsALOD • Initial effort: transformation to a LOD dataset – including rich provenance information Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 82. 10/17/22 Heiko Paulheim 82 WebIsALOD • Estimated contents breakdown Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 83. 10/17/22 Heiko Paulheim 83 WebIsALOD • Main challenge – Original dataset is quite noisy (<10% correct statements) – Recap: coverage vs. accuracy – Simple thresholding removes too much knowledge • Approach – Train RandomForest model for predicting correct vs. wrong statements – Using all the provenance information we have – Use model to compute confidence scores • Current ongoing research – Using transformers and a larger training set Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 84. 10/17/22 Heiko Paulheim 84 WebIsALOD • Current challenges and works in progress – Distinguishing instances and classes • i.e.: subclass vs. instance of relations – Splitting instances • Bauhaus is a goth band • Bauhaus is a German school – Knowledge extraction from pre and post modifiers • Bauhaus is a goth band → genre(Bauhaus, Goth) • Bauhaus is a German school → location(Bauhaus, Germany) Hertling & Paulheim: WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data. ISWC 2017
  • 85. 10/17/22 Heiko Paulheim 85 Summary • We have seen a couple of Knowledge Graphs – How they are built – What they contain • For your project – Have a look at the fit for your domain – Try different options • For a master’s thesis later – Work on recent developments in our group
  • 86. 10/17/22 Heiko Paulheim 86 Questions?