SlideShare a Scribd company logo
From relational databases to linked
   data:R for the semantic web
            Jose Quesada,
      Max Planck Institute, Berlin
Who this talk targets
• You have big data; you use a database

• You have an evolving schema definition.
  Sometimes at runtime

• You are interested in alternative ways to present
  your data

• You would thrive by using data out there, if only
  they were more accessible
Semantic web
R for the semantic web, Quesada useR 2009
R for the semantic web, Quesada useR 2009
Credit: Jim Hendler

THE TWO TOWERS
The Semantic web
        • Ontology as Barad-dur
          (Sauron’s tower)
          – Extremely powerful

          – Patrolled by Orcs
             • Let one little hobbit in it,
               and the whole thing could
               come crashing down
          – OWL
The Semantic web
        • Ontology as Barad-dur
          (Sauron’s tower)
          – Extremely powerful
                     Decidable logic basis
          – Patrolled by Orcs
                           inconsistency
             • Let one little hobbit in it,
               and the whole thing could
               come crashing down
          – OWL
Inconsistency
The semantic web
        • The tower of Babel
           – We will build a tower to
             reach the sky
           – We only need a little
             ontological agreement
              • Who cares if we all speak
                different languages?
        This is RDFS
        Statistics matter here
        Web-scale
        Lots of data; finding
          anything in the mess can
          be a win
Approaches to data representation

•   Objects
•   Tables (relational databases)
•   Non-relational databases
•   Tables (data.frame)
•   Graphs
What one can do with semantic web data,
now:
 People that died in Nazi Germany and if possible, any
 notable works that they might have created
SELECT *
WHERE {
  ?subject dbpprop:deathPlace
<http://dbpedia.org/resource/Nazi_Germany> .
  OPTIONAL {
    ?subject dbpedia-owl:notableworks ?works
  }
}
subject                          works
:Anne_Frank                      :The_Diary_of_a_Young_Girl
:Martin_Bormann                  -
:Ir%C3%A8ne_N%C3%A9mirovsky -
:Erich_Fellgiebel                -
:Friedrich_Ferdinand%2C_Duke_of_Schleswig-Holstein
                                 -

:Friedrich_Olbricht              -
:Ludwig_Beck                     -
:Erwin_Rommel                    -
:Maurice_Bavaud                  -
:Early_Years_of_Adolf_Hitler     -
:Emil_Zegad%C5%82owicz           -
:Friedrich_Fromm                 -
:Helmuth_James_Graf_von_Moltke
• Scale to the entire web   • Use cases:
                               – Real time city
                               – Cancer monographs for
• Do reasoning with open         WHO
  word assumption              – Gene expression finding

• Retrieval in real-time

• Go beyond logics
RDF is a graph
• We have lots of interesting statistics that run on graphs

• In many Semantic Web (SW) domains a tremendous
  amount of statements (expressed as triples) might be
  true but, in a given domain, only a small number of
  statements is known to be true or can be inferred to be
  true. It thus makes sense to attempt to estimate the
  truth values of statements by exploring regularities in
  the SW data with machine learning
Scale
• You cannot use the entire thing at once:
  subsetting

• Are there patterns in knowledge structures
  that we can use for subsetting?
R for the semantic web, Quesada useR 2009
Idea
• Graph theory applied to subsetting large graphs

• Developing Semantic Web applications requires
  handling the RDF data model in a programming
  language

• Problem: current software is developed in the
  object-oriented paradigm, programming in RDF is
  currently triple-based.
Data
   IMDB is a big graph:
   – 1.4 m movies
   – 1.7 m actors
   – 11 M connections
        • Movies have votes
   – Bipartite network

Packages: igraph:
   –   Nice functions that you cannot find anywhere else
   –   Uses Sparse Matrices
   –   Implemented in C
   –   Some support for bipartite networks

Rmysql, Matrix (sparse m)
Centrality
Centrality
Pagerank
  • The pagerank vector is
    the stationary
    distribution of a markov           1                3
    chain in a link matrix

  • Some assumptions to               2                 4
    warrant convergence

  • The typical value of d
    is .85
norm <- function(x) x/sum(x)
norm(eigen(0.15/nVertices + 0.85 * t(A))$vectors[,1])
R for the semantic web, Quesada useR 2009
Top movies by pageRank
             in the actor->movie network

degree pagerank      cluster imdbID   title                               rank       votes
       0.000243688
  1298 252192870          0    822609Around the World in Eighty Days (1956) 40031 6134
       0.000103540
   313 862390464          0     76352Beyond Our Control" (1968)"               0       0
       0.000091669
   291 0099912811         0    993780Gone to Earth (1950)                 7.0          291
       0.000089025
   285 5923652847         0    915626Deadlands 2: Trapped (2008)          39971         15
       0.000083882
   424 328163772          0 1282574Stuck on You (2003)                    6.0        19709
       0.000080824
   629 1101098043         0    622100Shortland Street" (1992)"          39850        225
Problems
• Graphs have advantages over
  RDBMS/tables[1]. But we are used to think in
  tables
• There is no direct way to handle RDF in R.
  worth an R package?
Linked data are out there for the grabs

We need to start thinking in terms of graphs,
and slowly move away from tables




    Thanks for your attention
    Jose Quesada, quesada@workingcogs.com, http://josequesada.name
    Twitter: @Quesada

More Related Content

PPT
A quick overview of the available reference managers2010
PPT
Wave Hackathon Intro
PDF
Irmles2010 Random indexing spaces to bridge the human and data webs
PDF
Data science-retreat-how it works plus advice for upcoming data scientists
PDF
#BigDataCanarias: "Big Data & Career Paths"
PPTX
Future of data science as a profession
PDF
Big data & data science challenges and opportunities
PDF
data science @NYT ; inaugural Data Science Initiative Lecture
A quick overview of the available reference managers2010
Wave Hackathon Intro
Irmles2010 Random indexing spaces to bridge the human and data webs
Data science-retreat-how it works plus advice for upcoming data scientists
#BigDataCanarias: "Big Data & Career Paths"
Future of data science as a profession
Big data & data science challenges and opportunities
data science @NYT ; inaugural Data Science Initiative Lecture

Similar to R for the semantic web, Quesada useR 2009 (20)

PPTX
Intro to Big Data and NoSQL
PDF
Is NoSQL The Future of Data Storage?
ODP
Involutionary%20Self-Replicating%20Machines.ppt_1
PPTX
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
PPTX
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
PDF
Where Does Big Data Meet Big Database - QCon 2012
PPTX
Introduction to Google BigQuery
PPTX
Gephi, Graphx, and Giraph
PDF
What Does Big Data Mean and Who Will Win
PDF
PPTX
Graph Databases
PDF
PayPal Big Data and MySQL Cluster
PDF
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
PPT
NO SQL: What, Why, How
PPT
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
PPTX
Nosql public
PPT
A review of the state of the art in Machine Learning on the Semantic Web
PDF
Scaling Out With Hadoop And HBase
PPTX
BDI- The Beginning (Big data training in Coimbatore)
PDF
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
Intro to Big Data and NoSQL
Is NoSQL The Future of Data Storage?
Involutionary%20Self-Replicating%20Machines.ppt_1
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Where Does Big Data Meet Big Database - QCon 2012
Introduction to Google BigQuery
Gephi, Graphx, and Giraph
What Does Big Data Mean and Who Will Win
Graph Databases
PayPal Big Data and MySQL Cluster
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
NO SQL: What, Why, How
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
Nosql public
A review of the state of the art in Machine Learning on the Semantic Web
Scaling Out With Hadoop And HBase
BDI- The Beginning (Big data training in Coimbatore)
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
Ad

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Cloud computing and distributed systems.
PPTX
Spectroscopy.pptx food analysis technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Cloud computing and distributed systems.
Spectroscopy.pptx food analysis technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Ad

R for the semantic web, Quesada useR 2009

  • 1. From relational databases to linked data:R for the semantic web Jose Quesada, Max Planck Institute, Berlin
  • 2. Who this talk targets • You have big data; you use a database • You have an evolving schema definition. Sometimes at runtime • You are interested in alternative ways to present your data • You would thrive by using data out there, if only they were more accessible
  • 7. The Semantic web • Ontology as Barad-dur (Sauron’s tower) – Extremely powerful – Patrolled by Orcs • Let one little hobbit in it, and the whole thing could come crashing down – OWL
  • 8. The Semantic web • Ontology as Barad-dur (Sauron’s tower) – Extremely powerful Decidable logic basis – Patrolled by Orcs inconsistency • Let one little hobbit in it, and the whole thing could come crashing down – OWL
  • 10. The semantic web • The tower of Babel – We will build a tower to reach the sky – We only need a little ontological agreement • Who cares if we all speak different languages? This is RDFS Statistics matter here Web-scale Lots of data; finding anything in the mess can be a win
  • 11. Approaches to data representation • Objects • Tables (relational databases) • Non-relational databases • Tables (data.frame) • Graphs
  • 12. What one can do with semantic web data, now: People that died in Nazi Germany and if possible, any notable works that they might have created SELECT * WHERE { ?subject dbpprop:deathPlace <http://dbpedia.org/resource/Nazi_Germany> . OPTIONAL { ?subject dbpedia-owl:notableworks ?works } }
  • 13. subject works :Anne_Frank :The_Diary_of_a_Young_Girl :Martin_Bormann - :Ir%C3%A8ne_N%C3%A9mirovsky - :Erich_Fellgiebel - :Friedrich_Ferdinand%2C_Duke_of_Schleswig-Holstein - :Friedrich_Olbricht - :Ludwig_Beck - :Erwin_Rommel - :Maurice_Bavaud - :Early_Years_of_Adolf_Hitler - :Emil_Zegad%C5%82owicz - :Friedrich_Fromm - :Helmuth_James_Graf_von_Moltke
  • 14. • Scale to the entire web • Use cases: – Real time city – Cancer monographs for • Do reasoning with open WHO word assumption – Gene expression finding • Retrieval in real-time • Go beyond logics
  • 15. RDF is a graph • We have lots of interesting statistics that run on graphs • In many Semantic Web (SW) domains a tremendous amount of statements (expressed as triples) might be true but, in a given domain, only a small number of statements is known to be true or can be inferred to be true. It thus makes sense to attempt to estimate the truth values of statements by exploring regularities in the SW data with machine learning
  • 16. Scale • You cannot use the entire thing at once: subsetting • Are there patterns in knowledge structures that we can use for subsetting?
  • 18. Idea • Graph theory applied to subsetting large graphs • Developing Semantic Web applications requires handling the RDF data model in a programming language • Problem: current software is developed in the object-oriented paradigm, programming in RDF is currently triple-based.
  • 19. Data IMDB is a big graph: – 1.4 m movies – 1.7 m actors – 11 M connections • Movies have votes – Bipartite network Packages: igraph: – Nice functions that you cannot find anywhere else – Uses Sparse Matrices – Implemented in C – Some support for bipartite networks Rmysql, Matrix (sparse m)
  • 22. Pagerank • The pagerank vector is the stationary distribution of a markov 1 3 chain in a link matrix • Some assumptions to 2 4 warrant convergence • The typical value of d is .85 norm <- function(x) x/sum(x) norm(eigen(0.15/nVertices + 0.85 * t(A))$vectors[,1])
  • 24. Top movies by pageRank in the actor->movie network degree pagerank cluster imdbID title rank votes 0.000243688 1298 252192870 0 822609Around the World in Eighty Days (1956) 40031 6134 0.000103540 313 862390464 0 76352Beyond Our Control" (1968)" 0 0 0.000091669 291 0099912811 0 993780Gone to Earth (1950) 7.0 291 0.000089025 285 5923652847 0 915626Deadlands 2: Trapped (2008) 39971 15 0.000083882 424 328163772 0 1282574Stuck on You (2003) 6.0 19709 0.000080824 629 1101098043 0 622100Shortland Street" (1992)" 39850 225
  • 25. Problems • Graphs have advantages over RDBMS/tables[1]. But we are used to think in tables • There is no direct way to handle RDF in R. worth an R package?
  • 26. Linked data are out there for the grabs We need to start thinking in terms of graphs, and slowly move away from tables Thanks for your attention Jose Quesada, quesada@workingcogs.com, http://josequesada.name Twitter: @Quesada