SlideShare a Scribd company logo
Creative Commons CC BY 3.0:
allowed to share & remix
(also commercial)
but must attribute
Frank van Harmelen
The empirical turn
in
Knowledge Representation
Contributions from many people
in the KR&R group over many years.
And thanks to NWO for
a 750k€ TOP grant for this
KR in the pre-empirical era
Handbook of Knowledge Representation
(1000 pages, ToC alone is 14 pages)
• propositional logic &
satisfiability solvers
• first order logic &
resolution
• description logic
• constraint (logic)
programming
• nonmonotonic reasoning
• belief revision
• qualitative reasoning
• model-based diagnosis
• bayesian networks
• temporal logic
• spatial reasoning
• epistemic logic
• deontic logic
• situation calculus
• default logic
• event calculus
• ……
KR metrics
in the pre-empirical era
KR = logic
• Show small examples
• Prove properties
(expressivity, complexity)
• Give algorithms
(sound, complete)
KR = engineering
• Build applications
• Show high performance
• Show low engineering
costs
BUT AN EXPERIMENT
IN THE PAST 10 YEARS
MADE IT POSSIBLE
TO DO SOMETHING VERY DIFFERENT:
OBSERVE HOW
KNOWLEDGE REPRESENTATIONS BEHAVE
AT VERY LARGE SCALE
The Empirical Turn in Knowledge Representation
Rest of the talk
• Which KR’s were part of the experiment?
• How much of it was there to observe?
• How did we manage to observe it?
• What did we learn from observing it?
Which KR’s ?
RDF (for non-logicians)
RDF (for logicians)
• ground binary predicate: 𝑃(𝑂1, 𝑂2)
• Limited existential variables:
∃𝑥: 𝑃 𝐶1, 𝑥 ∧ 𝑃 𝐶2, 𝑥
• Type is unary predicate: 𝑇𝑖 𝑥
• Subtypes ∀𝑥: 𝑇1 𝑥 → 𝑇2(𝑥)
• Type restrictions ∀𝑥, 𝑦: 𝑃 𝑥, 𝑦 → 𝑇1 𝑥 ∧ 𝑇2(𝑦)
• Equality: 𝑂1= 𝑂2
• Extensions to DL:
– Distjointness of types
– Cardinality restrictions (0,1)
– always decidable: sub-FOL.
RDF deduction
OWL Semantics
How much is there
to observe?
± 45-100 billion facts
1 fact
How big is 100 billion
Denny Vrandečić – AIFB, Universität Karlsruhe ≈ 1 fact per web-page
100 billion golfballs ≈ Jupiter
x T
[<x> IsOfType <T>]
different
owners & locations
< analgesic >
BTW: How did it get so big?
On the Web,
anybody can say anything about anything
BTW: How did it get so big?
On the Web,
anybody can say anything about anything
x T
R
How did you
manage to
observe it?
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge Representation
LOD Laundromat
Beek & Rietveld et al. 2014,
LOD laundromat: a uniform way of
publishing other people's dirty data
http://lodlaundromat.org/pdf/lodla
undry.pdf
HDT
Fernández & Martínez-Prieto &
Gutiérrez, 2013, Binary RDF
representation for publication and
exchange (HDT)
LDF
Verborgh & Vander Sande et al.
2014, Web-Scale Querying through
Linked Data Fragments
LOD-a-lot
http://lod-a-lot.lod.labs.vu.nl/
Surprisingly efficient
1 file
28,362,198,927 unique triples
>650K data documents
524 GB of disk space
16 GB of RAM
Only €305,- hardware cost
Meta-Data for a lot of LOD
http://www.semantic-web-journal.net/content/meta-data-lot-lod-2
Statistics (boring)
triples 28,362,198,927
subject 3,214,347,198
predicates 1,168,932
objects 3,178,409,386
literals 5.3B
Re-use is fairly high… or not…
Analysing
Logical identity
Joe Raad Wouter Beek
ESWC2018, under submission
Identity clusters
LOD-a-lot File
http: //lod-a-lot.lod.labs.vu.nl
[Fernández 2017]
558 millions owl:sameAs (309 millions distinct terms)
≈ 4 hours
1. Extracting all owl:sameAs statements on the LOD
HDT File
(4.5 GB)
HDT File
(4.5 GB)
Identity
Closure
1
Identity
Closure
2
Identity
Closure
89 387 082…
- The largest Identity Closure contains 177 794 terms
(contains all the countries in the world, Albert Enstein, « empty string », etc.)
- The smallest Identity Closure contains 2 terms
x owl:sameAs y
z owl:sameAs y
Identity Closure x y z
2. Generating the Identity Closure
The Empirical Turn in Knowledge Representation
Identity Closure « Cities »
3. Detecting Communities (using the Louvain Algorithm)
This network (i.e. identity closure) has a community structure, as it can be grouped into
different sets of nodes, with each set of nodes being densely connected internally.
Goal: Find (and later Evaluate) the most “suspicious” identity links (i.e. the links
between different communities)
4. Application: debugging identity statements
Identity closure
containing the term
“dbpedia.org/page/Barack_Obama”
This Identity Closure contains 388 terms
(i.e. 387 distinct terms are owl:sameAs this term)
95 communities detected
largest community = 99 terms
4. Application: debugging identity statements
comm
0
comm
3
2 links
Community 0
1. dbpedia.org/resource/B_hussein_obama
2. dbpedia.org/resource/Barack_H_Obama,_Jr
3. dbpedia.org/resource/Barak_hussein_obama
4. dbpedia.org/resource/President_Barack
5. dbpedia.org/resource/Senator_Barack_Obama
6. dbpedia.org/resource/Obama
…
99. dbpedia.org/resource/Hussein_Obama
Community 3
1. dbpedia.org/resource/Presidency_of_Barack_Obama
2. dbpedia.org/resource/Barack_Obama_Administration
3. dbpedia.org/resource/Barack_Obama_Cabinet
4. dbpedia.org/resource/Obama_White_House
5. dbpedia.org/resource/Obama_regime
6. dbpedia.org/resource/America_under_Obama
…
52. dbpedia.org/resource/Presidential_transition_of_Barac
k_Obama
Symbols or words?
Steven de Rooij Peter Bloem Wouter Beek (ISWC 2016)
http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf
Symbols or words?
Symbol names are supposed to be meaningless
Aspirin headache
analgesic pain
symptomdrug
treats
treats
Measure mutual information content
between string and semantics of a symbol
E(x) = efficient encoding of x
Mutual information content
M(x,y) =E(x) + E(y) – E(x,y)
Take x = symbol name of x as a string
Take 𝑦1 = {types of x} ≈ semantics of x
Take 𝑦2 = {properties of x} ≈ semantics of x
Calculate M(x, 𝑦1) and M(x, 𝑦2) for all symbols
in 600k datasets
But variables do encode meaning!
Fraction of datasets with redundancy for types/predicates
at significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)
Very different
network structures
for different predicates
Tobias Kuhn Wouter Beek
http://ceur-ws.org/Vol-1946/paper-05.pdf
skos:exactMatch
foaf:knows
osspr:contains
Geopolitics:hasborderWith
Summary
&
So what…
• We now have larger KB’s than ever before
• We now have the instruments
to observe and analyse these very large KB’s
• We can use these insights for better tools:
– query & inference
– publish & maintain
– visualise & explain
– …
But my secret hope is that this will help us
to understand the patterns of knowledge:
AI as a computational theory of knowledge

More Related Content

PPTX
Semantic Web questions we couldn't ask 10 years ago
PPTX
Empirical Semantics
PPTX
Modular design patterns for systems that learn and reason: a boxology
PPTX
Devxs
PDF
Web Data Management with RDF
PDF
Open data and linked data
PDF
Introduction of Knowledge Graphs
PDF
Web Data Management in the RDF Age
Semantic Web questions we couldn't ask 10 years ago
Empirical Semantics
Modular design patterns for systems that learn and reason: a boxology
Devxs
Web Data Management with RDF
Open data and linked data
Introduction of Knowledge Graphs
Web Data Management in the RDF Age

What's hot (19)

PPTX
One day workshop Linked Data and Semantic Web
PPTX
Linked Data: principles and examples
PPTX
Development of Semantic Web based Disaster Management System
PPTX
SWT Lecture Session 8 - Rules
PPTX
Why do they call it Linked Data when they want to say...?
PPTX
2011 05-02 linked data intro
PPTX
2011 05-01 linked data
PPTX
Knowledge Graph Introduction
PDF
Web Data Management in RDF Age
PDF
Pandas, Data Wrangling & Data Science
PDF
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
ODP
20110330 bruxelles doc_freedom
PPTX
Developing Linked Data and Semantic Web-based Applications (Expotec 2015)
PPT
Information Extraction and Linked Data Cloud
PPTX
Das Semantische Daten Web für Unternehmen
PDF
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
PDF
R, Data Wrangling & Kaggle Data Science Competitions
PDF
Linked Data Snowball, or Why We Need Reconciliation
One day workshop Linked Data and Semantic Web
Linked Data: principles and examples
Development of Semantic Web based Disaster Management System
SWT Lecture Session 8 - Rules
Why do they call it Linked Data when they want to say...?
2011 05-02 linked data intro
2011 05-01 linked data
Knowledge Graph Introduction
Web Data Management in RDF Age
Pandas, Data Wrangling & Data Science
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
20110330 bruxelles doc_freedom
Developing Linked Data and Semantic Web-based Applications (Expotec 2015)
Information Extraction and Linked Data Cloud
Das Semantische Daten Web für Unternehmen
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
R, Data Wrangling & Kaggle Data Science Competitions
Linked Data Snowball, or Why We Need Reconciliation
Ad

Similar to The Empirical Turn in Knowledge Representation (20)

PPT
R for the semantic web, Quesada useR 2009
PPTX
Jim Hendler's Presentation at SSSW 2011
PPT
Representing and Reasoning with Modular Ontologies
PDF
ESWC SS 2012 - Tuesday Tutorial Dan Brickley and Denny Vrandecic: Linked Open...
PPTX
BT02.pptx
PPTX
Semantic Modelling using Semantic Web Technology
PDF
Effective Semantics for Engineering NLP Systems
PDF
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
PPT
Apache Stanbol 
and the Web of Data - ApacheCon 2011
PPT
Towards Linked Ontologies and Data on the Semantic Web
PPT
Apachecon 2011 stanbol_ogrisel
PDF
Knowledge Graph Maintenance
PPT
Package-based Description Logics – Preliminary Results
PPTX
Building AI Applications using Knowledge Graphs
PPTX
Using Knowledge Graph for Promoting Cognitive Computing
PDF
EDF2012 Mariana Damova - Factforge
PDF
Linked data for knowledge curation in humanities research
PDF
Ontologies Fmi 042010
PDF
Adding Semantics to Ontologies
PDF
Ontologies and Ontology Languages: RDFS, OWL, and SKOS: University of Florida...
R for the semantic web, Quesada useR 2009
Jim Hendler's Presentation at SSSW 2011
Representing and Reasoning with Modular Ontologies
ESWC SS 2012 - Tuesday Tutorial Dan Brickley and Denny Vrandecic: Linked Open...
BT02.pptx
Semantic Modelling using Semantic Web Technology
Effective Semantics for Engineering NLP Systems
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Towards Linked Ontologies and Data on the Semantic Web
Apachecon 2011 stanbol_ogrisel
Knowledge Graph Maintenance
Package-based Description Logics – Preliminary Results
Building AI Applications using Knowledge Graphs
Using Knowledge Graph for Promoting Cognitive Computing
EDF2012 Mariana Damova - Factforge
Linked data for knowledge curation in humanities research
Ontologies Fmi 042010
Adding Semantics to Ontologies
Ontologies and Ontology Languages: RDFS, OWL, and SKOS: University of Florida...
Ad

More from Frank van Harmelen (20)

PPTX
Neuro-symbolic is not enough, we need neuro-*semantic*
PPTX
The K in "neuro-symbolic" stands for "knowledge"
PPTX
Adoption of Knowledge Graphs, mid 2022 (incomplete)
PPTX
Adoption of Knowledge Graphs, late 2019
PPTX
Adoption of Knowledge Graphs, mid 2019
PPTX
The end of the scientific paper as we know it (or not...)
PPTX
On the nature of AI, and the relation between symbolic and statistical approa...
PPTX
The end of the scientific paper as we know it (in 4 easy steps)
PPTX
Linked Open Data for Medical Guidelines Interactions
PPTX
The Web of Data: do we actually understand what we built?
PPT
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
PPTX
Informatics is a natural science
PPTX
How the Web can change social science research (including yours)
PPTX
4 Popular Fallacies about the Semantic Web
PPT
PPT
Het slimme Web 3.0
PPT
OWL briefing
PPT
RDF briefing
PPT
Semantic Web research anno 2006:main streams, popular falacies, current statu...
PPT
Ontology mapping needs context & approximation
Neuro-symbolic is not enough, we need neuro-*semantic*
The K in "neuro-symbolic" stands for "knowledge"
Adoption of Knowledge Graphs, mid 2022 (incomplete)
Adoption of Knowledge Graphs, late 2019
Adoption of Knowledge Graphs, mid 2019
The end of the scientific paper as we know it (or not...)
On the nature of AI, and the relation between symbolic and statistical approa...
The end of the scientific paper as we know it (in 4 easy steps)
Linked Open Data for Medical Guidelines Interactions
The Web of Data: do we actually understand what we built?
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
Informatics is a natural science
How the Web can change social science research (including yours)
4 Popular Fallacies about the Semantic Web
Het slimme Web 3.0
OWL briefing
RDF briefing
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Ontology mapping needs context & approximation

Recently uploaded (20)

PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
. Radiology Case Scenariosssssssssssssss
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
An interstellar mission to test astrophysical black holes
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPT
protein biochemistry.ppt for university classes
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Biophysics 2.pdffffffffffffffffffffffffff
. Radiology Case Scenariosssssssssssssss
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
microscope-Lecturecjchchchchcuvuvhc.pptx
The KM-GBF monitoring framework – status & key messages.pptx
Cell Membrane: Structure, Composition & Functions
An interstellar mission to test astrophysical black holes
The scientific heritage No 166 (166) (2025)
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
protein biochemistry.ppt for university classes
neck nodes and dissection types and lymph nodes levels
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
ECG_Course_Presentation د.محمد صقران ppt

The Empirical Turn in Knowledge Representation

  • 1. Creative Commons CC BY 3.0: allowed to share & remix (also commercial) but must attribute Frank van Harmelen The empirical turn in Knowledge Representation Contributions from many people in the KR&R group over many years. And thanks to NWO for a 750k€ TOP grant for this
  • 2. KR in the pre-empirical era
  • 3. Handbook of Knowledge Representation (1000 pages, ToC alone is 14 pages) • propositional logic & satisfiability solvers • first order logic & resolution • description logic • constraint (logic) programming • nonmonotonic reasoning • belief revision • qualitative reasoning • model-based diagnosis • bayesian networks • temporal logic • spatial reasoning • epistemic logic • deontic logic • situation calculus • default logic • event calculus • ……
  • 4. KR metrics in the pre-empirical era KR = logic • Show small examples • Prove properties (expressivity, complexity) • Give algorithms (sound, complete) KR = engineering • Build applications • Show high performance • Show low engineering costs
  • 5. BUT AN EXPERIMENT IN THE PAST 10 YEARS MADE IT POSSIBLE TO DO SOMETHING VERY DIFFERENT: OBSERVE HOW KNOWLEDGE REPRESENTATIONS BEHAVE AT VERY LARGE SCALE
  • 7. Rest of the talk • Which KR’s were part of the experiment? • How much of it was there to observe? • How did we manage to observe it? • What did we learn from observing it?
  • 10. RDF (for logicians) • ground binary predicate: 𝑃(𝑂1, 𝑂2) • Limited existential variables: ∃𝑥: 𝑃 𝐶1, 𝑥 ∧ 𝑃 𝐶2, 𝑥 • Type is unary predicate: 𝑇𝑖 𝑥 • Subtypes ∀𝑥: 𝑇1 𝑥 → 𝑇2(𝑥) • Type restrictions ∀𝑥, 𝑦: 𝑃 𝑥, 𝑦 → 𝑇1 𝑥 ∧ 𝑇2(𝑦) • Equality: 𝑂1= 𝑂2 • Extensions to DL: – Distjointness of types – Cardinality restrictions (0,1) – always decidable: sub-FOL.
  • 13. How much is there to observe?
  • 15. 1 fact How big is 100 billion
  • 16. Denny Vrandečić – AIFB, Universität Karlsruhe ≈ 1 fact per web-page 100 billion golfballs ≈ Jupiter
  • 17. x T [<x> IsOfType <T>] different owners & locations < analgesic > BTW: How did it get so big? On the Web, anybody can say anything about anything
  • 18. BTW: How did it get so big? On the Web, anybody can say anything about anything x T R
  • 19. How did you manage to observe it?
  • 22. LOD Laundromat Beek & Rietveld et al. 2014, LOD laundromat: a uniform way of publishing other people's dirty data http://lodlaundromat.org/pdf/lodla undry.pdf HDT Fernández & Martínez-Prieto & Gutiérrez, 2013, Binary RDF representation for publication and exchange (HDT) LDF Verborgh & Vander Sande et al. 2014, Web-Scale Querying through Linked Data Fragments
  • 24. Surprisingly efficient 1 file 28,362,198,927 unique triples >650K data documents 524 GB of disk space 16 GB of RAM Only €305,- hardware cost Meta-Data for a lot of LOD http://www.semantic-web-journal.net/content/meta-data-lot-lod-2
  • 25. Statistics (boring) triples 28,362,198,927 subject 3,214,347,198 predicates 1,168,932 objects 3,178,409,386 literals 5.3B
  • 26. Re-use is fairly high… or not…
  • 27. Analysing Logical identity Joe Raad Wouter Beek ESWC2018, under submission
  • 28. Identity clusters LOD-a-lot File http: //lod-a-lot.lod.labs.vu.nl [Fernández 2017] 558 millions owl:sameAs (309 millions distinct terms) ≈ 4 hours 1. Extracting all owl:sameAs statements on the LOD HDT File (4.5 GB)
  • 29. HDT File (4.5 GB) Identity Closure 1 Identity Closure 2 Identity Closure 89 387 082… - The largest Identity Closure contains 177 794 terms (contains all the countries in the world, Albert Enstein, « empty string », etc.) - The smallest Identity Closure contains 2 terms x owl:sameAs y z owl:sameAs y Identity Closure x y z 2. Generating the Identity Closure
  • 31. Identity Closure « Cities » 3. Detecting Communities (using the Louvain Algorithm) This network (i.e. identity closure) has a community structure, as it can be grouped into different sets of nodes, with each set of nodes being densely connected internally. Goal: Find (and later Evaluate) the most “suspicious” identity links (i.e. the links between different communities)
  • 32. 4. Application: debugging identity statements Identity closure containing the term “dbpedia.org/page/Barack_Obama” This Identity Closure contains 388 terms (i.e. 387 distinct terms are owl:sameAs this term) 95 communities detected largest community = 99 terms
  • 33. 4. Application: debugging identity statements comm 0 comm 3 2 links Community 0 1. dbpedia.org/resource/B_hussein_obama 2. dbpedia.org/resource/Barack_H_Obama,_Jr 3. dbpedia.org/resource/Barak_hussein_obama 4. dbpedia.org/resource/President_Barack 5. dbpedia.org/resource/Senator_Barack_Obama 6. dbpedia.org/resource/Obama … 99. dbpedia.org/resource/Hussein_Obama Community 3 1. dbpedia.org/resource/Presidency_of_Barack_Obama 2. dbpedia.org/resource/Barack_Obama_Administration 3. dbpedia.org/resource/Barack_Obama_Cabinet 4. dbpedia.org/resource/Obama_White_House 5. dbpedia.org/resource/Obama_regime 6. dbpedia.org/resource/America_under_Obama … 52. dbpedia.org/resource/Presidential_transition_of_Barac k_Obama
  • 34. Symbols or words? Steven de Rooij Peter Bloem Wouter Beek (ISWC 2016) http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf
  • 35. Symbols or words? Symbol names are supposed to be meaningless Aspirin headache analgesic pain symptomdrug treats treats
  • 36. Measure mutual information content between string and semantics of a symbol E(x) = efficient encoding of x Mutual information content M(x,y) =E(x) + E(y) – E(x,y) Take x = symbol name of x as a string Take 𝑦1 = {types of x} ≈ semantics of x Take 𝑦2 = {properties of x} ≈ semantics of x Calculate M(x, 𝑦1) and M(x, 𝑦2) for all symbols in 600k datasets
  • 37. But variables do encode meaning! Fraction of datasets with redundancy for types/predicates at significance level > 0.99 BTW, this is 600.000 datapoints (RDF docs)
  • 38. Very different network structures for different predicates Tobias Kuhn Wouter Beek http://ceur-ws.org/Vol-1946/paper-05.pdf
  • 44. • We now have larger KB’s than ever before • We now have the instruments to observe and analyse these very large KB’s • We can use these insights for better tools: – query & inference – publish & maintain – visualise & explain – …
  • 45. But my secret hope is that this will help us to understand the patterns of knowledge: AI as a computational theory of knowledge