The Empirical Turn in Knowledge Representation

Creative Commons CC BY 3.0:
allowed to share & remix
(also commercial)
but must attribute
Frank van Harmelen
The empirical turn
in
Knowledge Representation
Contributions from many people
in the KR&R group over many years.
And thanks to NWO for
a 750k€ TOP grant for this

Handbook of Knowledge Representation
(1000 pages, ToC alone is 14 pages)
• propositional logic &
satisfiability solvers
• first order logic &
resolution
• description logic
• constraint (logic)
programming
• nonmonotonic reasoning
• belief revision
• qualitative reasoning
• model-based diagnosis
• bayesian networks
• temporal logic
• spatial reasoning
• epistemic logic
• deontic logic
• situation calculus
• default logic
• event calculus
• ……

KR metrics
in the pre-empirical era
KR = logic
• Show small examples
• Prove properties
(expressivity, complexity)
• Give algorithms
(sound, complete)
KR = engineering
• Build applications
• Show high performance
• Show low engineering
costs

BUT AN EXPERIMENT
IN THE PAST 10 YEARS
MADE IT POSSIBLE
TO DO SOMETHING VERY DIFFERENT:
OBSERVE HOW
KNOWLEDGE REPRESENTATIONS BEHAVE
AT VERY LARGE SCALE

Rest of the talk
• Which KR’s were part of the experiment?
• How much of it was there to observe?
• How did we manage to observe it?
• What did we learn from observing it?

RDF (for logicians)
• ground binary predicate: 𝑃(𝑂1, 𝑂2)
• Limited existential variables:
∃𝑥: 𝑃 𝐶1, 𝑥 ∧ 𝑃 𝐶2, 𝑥
• Type is unary predicate: 𝑇𝑖 𝑥
• Subtypes ∀𝑥: 𝑇1 𝑥 → 𝑇2(𝑥)
• Type restrictions ∀𝑥, 𝑦: 𝑃 𝑥, 𝑦 → 𝑇1 𝑥 ∧ 𝑇2(𝑦)
• Equality: 𝑂1= 𝑂2
• Extensions to DL:
– Distjointness of types
– Cardinality restrictions (0,1)
– always decidable: sub-FOL.

Denny Vrandečić – AIFB, Universität Karlsruhe ≈ 1 fact per web-page
100 billion golfballs ≈ Jupiter

x T
[<x> IsOfType <T>]
different
owners & locations
< analgesic >
BTW: How did it get so big?
On the Web,
anybody can say anything about anything

BTW: How did it get so big?
On the Web,
anybody can say anything about anything
x T
R

How did you
manage to
observe it?

LOD Laundromat
Beek & Rietveld et al. 2014,
LOD laundromat: a uniform way of
publishing other people's dirty data
http://lodlaundromat.org/pdf/lodla
undry.pdf
HDT
Fernández & Martínez-Prieto &
Gutiérrez, 2013, Binary RDF
representation for publication and
exchange (HDT)
LDF
Verborgh & Vander Sande et al.
2014, Web-Scale Querying through
Linked Data Fragments

LOD-a-lot
http://lod-a-lot.lod.labs.vu.nl/

Surprisingly efficient
1 file
28,362,198,927 unique triples
>650K data documents
524 GB of disk space
16 GB of RAM
Only €305,- hardware cost
Meta-Data for a lot of LOD
http://www.semantic-web-journal.net/content/meta-data-lot-lod-2

Statistics (boring)
triples 28,362,198,927
subject 3,214,347,198
predicates 1,168,932
objects 3,178,409,386
literals 5.3B

Re-use is fairly high… or not…

Analysing
Logical identity
Joe Raad Wouter Beek
ESWC2018, under submission

Identity clusters
LOD-a-lot File
http: //lod-a-lot.lod.labs.vu.nl
[Fernández 2017]
558 millions owl:sameAs (309 millions distinct terms)
≈ 4 hours
1. Extracting all owl:sameAs statements on the LOD
HDT File
(4.5 GB)

HDT File
(4.5 GB)
Identity
Closure
1
Identity
Closure
2
Identity
Closure
89 387 082…
- The largest Identity Closure contains 177 794 terms
(contains all the countries in the world, Albert Enstein, « empty string », etc.)
- The smallest Identity Closure contains 2 terms
x owl:sameAs y
z owl:sameAs y
Identity Closure x y z
2. Generating the Identity Closure

Identity Closure « Cities »
3. Detecting Communities (using the Louvain Algorithm)
This network (i.e. identity closure) has a community structure, as it can be grouped into
different sets of nodes, with each set of nodes being densely connected internally.
Goal: Find (and later Evaluate) the most “suspicious” identity links (i.e. the links
between different communities)

4. Application: debugging identity statements
Identity closure
containing the term
“dbpedia.org/page/Barack_Obama”
This Identity Closure contains 388 terms
(i.e. 387 distinct terms are owl:sameAs this term)
95 communities detected
largest community = 99 terms

4. Application: debugging identity statements
comm
0
comm
3
2 links
Community 0
1. dbpedia.org/resource/B_hussein_obama
2. dbpedia.org/resource/Barack_H_Obama,_Jr
3. dbpedia.org/resource/Barak_hussein_obama
4. dbpedia.org/resource/President_Barack
5. dbpedia.org/resource/Senator_Barack_Obama
6. dbpedia.org/resource/Obama
…
99. dbpedia.org/resource/Hussein_Obama
Community 3
1. dbpedia.org/resource/Presidency_of_Barack_Obama
2. dbpedia.org/resource/Barack_Obama_Administration
3. dbpedia.org/resource/Barack_Obama_Cabinet
4. dbpedia.org/resource/Obama_White_House
5. dbpedia.org/resource/Obama_regime
6. dbpedia.org/resource/America_under_Obama
…
52. dbpedia.org/resource/Presidential_transition_of_Barac
k_Obama

Symbols or words?
Steven de Rooij Peter Bloem Wouter Beek (ISWC 2016)
http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf

Symbols or words?
Symbol names are supposed to be meaningless
Aspirin headache
analgesic pain
symptomdrug
treats
treats

Measure mutual information content
between string and semantics of a symbol
E(x) = efficient encoding of x
Mutual information content
M(x,y) =E(x) + E(y) – E(x,y)
Take x = symbol name of x as a string
Take 𝑦1 = {types of x} ≈ semantics of x
Take 𝑦2 = {properties of x} ≈ semantics of x
Calculate M(x, 𝑦1) and M(x, 𝑦2) for all symbols
in 600k datasets

But variables do encode meaning!
Fraction of datasets with redundancy for types/predicates
at significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)

Very different
network structures
for different predicates
Tobias Kuhn Wouter Beek
http://ceur-ws.org/Vol-1946/paper-05.pdf

• We now have larger KB’s than ever before
• We now have the instruments
to observe and analyse these very large KB’s
• We can use these insights for better tools:
– query & inference
– publish & maintain
– visualise & explain
– …

But my secret hope is that this will help us
to understand the patterns of knowledge:
AI as a computational theory of knowledge

The Empirical Turn in Knowledge Representation

More Related Content

What's hot (19)

Similar to The Empirical Turn in Knowledge Representation (20)

More from Frank van Harmelen (20)

Recently uploaded (20)

The Empirical Turn in Knowledge Representation