R for the semantic web, Quesada useR 2009

From relational databases to linked
data:R for the semantic web
Jose Quesada,
Max Planck Institute, Berlin

Who this talk targets
• You have big data; you use a database

• You have an evolving schema definition.
Sometimes at runtime

• You are interested in alternative ways to present
your data

• You would thrive by using data out there, if only
they were more accessible

Credit: Jim Hendler

THE TWO TOWERS

The Semantic web
• Ontology as Barad-dur
(Sauron’s tower)
– Extremely powerful

– Patrolled by Orcs
• Let one little hobbit in it,
and the whole thing could
come crashing down
– OWL

The Semantic web
• Ontology as Barad-dur
(Sauron’s tower)
– Extremely powerful
Decidable logic basis
– Patrolled by Orcs
inconsistency
• Let one little hobbit in it,
and the whole thing could
come crashing down
– OWL

The semantic web
• The tower of Babel
– We will build a tower to
reach the sky
– We only need a little
ontological agreement
• Who cares if we all speak
different languages?
This is RDFS
Statistics matter here
Web-scale
Lots of data; finding
anything in the mess can
be a win

Approaches to data representation

• Objects
• Tables (relational databases)
• Non-relational databases
• Tables (data.frame)
• Graphs

What one can do with semantic web data,
now:
People that died in Nazi Germany and if possible, any
notable works that they might have created
SELECT *
WHERE {
?subject dbpprop:deathPlace
<http://dbpedia.org/resource/Nazi_Germany> .
OPTIONAL {
?subject dbpedia-owl:notableworks ?works
}
}

subject works
:Anne_Frank :The_Diary_of_a_Young_Girl
:Martin_Bormann -
:Ir%C3%A8ne_N%C3%A9mirovsky -
:Erich_Fellgiebel -
:Friedrich_Ferdinand%2C_Duke_of_Schleswig-Holstein
-

:Friedrich_Olbricht -
:Ludwig_Beck -
:Erwin_Rommel -
:Maurice_Bavaud -
:Early_Years_of_Adolf_Hitler -
:Emil_Zegad%C5%82owicz -
:Friedrich_Fromm -
:Helmuth_James_Graf_von_Moltke

• Scale to the entire web • Use cases:
– Real time city
– Cancer monographs for
• Do reasoning with open WHO
word assumption – Gene expression finding

• Retrieval in real-time

• Go beyond logics

RDF is a graph
• We have lots of interesting statistics that run on graphs

• In many Semantic Web (SW) domains a tremendous
amount of statements (expressed as triples) might be
true but, in a given domain, only a small number of
statements is known to be true or can be inferred to be
true. It thus makes sense to attempt to estimate the
truth values of statements by exploring regularities in
the SW data with machine learning

Scale
• You cannot use the entire thing at once:
subsetting

• Are there patterns in knowledge structures
that we can use for subsetting?

Idea
• Graph theory applied to subsetting large graphs

• Developing Semantic Web applications requires
handling the RDF data model in a programming
language

• Problem: current software is developed in the
object-oriented paradigm, programming in RDF is
currently triple-based.

Data
IMDB is a big graph:
– 1.4 m movies
– 1.7 m actors
– 11 M connections
• Movies have votes
– Bipartite network

Packages: igraph:
– Nice functions that you cannot find anywhere else
– Uses Sparse Matrices
– Implemented in C
– Some support for bipartite networks

Rmysql, Matrix (sparse m)

Pagerank
• The pagerank vector is
the stationary
distribution of a markov 1 3
chain in a link matrix

• Some assumptions to 2 4
warrant convergence

• The typical value of d
is .85
norm <- function(x) x/sum(x)
norm(eigen(0.15/nVertices + 0.85 * t(A))$vectors[,1])

Top movies by pageRank
in the actor->movie network

degree pagerank cluster imdbID title rank votes
0.000243688
1298 252192870 0 822609Around the World in Eighty Days (1956) 40031 6134
0.000103540
313 862390464 0 76352Beyond Our Control" (1968)" 0 0
0.000091669
291 0099912811 0 993780Gone to Earth (1950) 7.0 291
0.000089025
285 5923652847 0 915626Deadlands 2: Trapped (2008) 39971 15
0.000083882
424 328163772 0 1282574Stuck on You (2003) 6.0 19709
0.000080824
629 1101098043 0 622100Shortland Street" (1992)" 39850 225

Problems
• Graphs have advantages over
RDBMS/tables[1]. But we are used to think in
tables
• There is no direct way to handle RDF in R.
worth an R package?

Linked data are out there for the grabs

We need to start thinking in terms of graphs,
and slowly move away from tables

Thanks for your attention
Jose Quesada, quesada@workingcogs.com, http://josequesada.name
Twitter: @Quesada

R for the semantic web, Quesada useR 2009

More Related Content

Similar to R for the semantic web, Quesada useR 2009 (20)

Recently uploaded (20)

R for the semantic web, Quesada useR 2009