SlideShare a Scribd company logo
How to Juggle with more
than a Billion Triples?

Ansgar Scherp
Research Group on Data and
Web Science

Universität Mannheim
October 2012
                                                                                             Image source:
                                              http://www.flickr.com/photos/pedromourapinheiro/2122754745/ 1
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                                    Slide
My thanks go to …
•    Marianna                                       •   Daniel Eißing
•    Simon Schenk                                   •   Mathias Konrath
•    Carsten Saathoff                               •   Daniel Schmeiß
•    Thomas Franz                                   •   Anton Baumesberger
•    Thomas Gottron                                 •   Frederik Jochum
•    Steffen Staab                                  •   Alexander Kleinen
•    Arne Peters
•    Bastian Krayer                                      And many more …


Ansgar Scherp – ansgar@informatik.uni-mannheim.de                      Slide 2
Scenario

• Tim plans to travel
  – from London
  – to a customer in Cologne




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 3
Website of the German Railway




It works, why bother…?
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 4
Let„s Try Different Queries

 Bottlenecks in public transportation?
 Compare the connections with flights?
 Visualize on a map?
…


 All these queries cannot be answered,
  because the data …


Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 5
… locked in Silos!


 – High Integration Effort
 – Lack in Reuse of Data
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                                           Slide 6
                                                    B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY
Linked Data
• Publishing and interlinking of data
• Different quality and purpose
• From different sources in the Web

          World Wide Web                                Linked Data
        Documents                                   Data
        Hyperlinks                                  Typed Links
        HTML                                        RDF
        Addresses (URIs)                            Addresses (URIs)

Example: http://www.uni-mannheim.de/
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                      Slide 7
Relevance of Linked Data?




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 8
Linked Data: May „07                                           Sept. „11
                                                Web 2.0


                                Media



                                                                             Publications

   eGovernment

                                 Cross-Domain



                                                            Life
               Geographic                                 Sciences



Ansgar Billion–Triples
< 31 Scherp ansgar@informatik.uni-mannheim.de                        Source: http://lod-cloud.net
                                                                                           Slide 9
Linked Data Principles


1.        Identification
2.        Interlinkage
3.        Dereferencing
4.        Description




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 10
Example: Big Lynx
                                               Matt Briggs




                                             Scott Miller
                                                               ?
                                                              Big Lynx
                                                              Company




Ansgar Scherp – ansgar@informatik.uni-mannheim.de
< 31 Milliarde Triple                                        Source: http://lod-cloud.net
                                                                                   Slide 11
1. Use URIs for Identification




 Matt Briggs


                                                                              Scott Miller
         http://biglynx.co.uk/
         people/matt-briggs
                                                                         http://biglynx.co.uk/
                                                                         people/scott-miller

Ansgar Scherp – ansgar@informatik.uni-mannheim.de
                   B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY                           Slide 12
Example: Big Lynx
                                               Matt Briggs




                                             Scott Miller
                                                             Big Lynx
                                                             Company



 How to model relationships like knows?

Ansgar Scherp – ansgar@informatik.uni-mannheim.de                       Slide 13
Resource DescriptionFramework (RDF)
• Description of Ressources with RDF triple
            Matt Briggs                               is a      Person


                  Subject                           Predicate    Object

@prefix rdf:<http://w3.org/1999/02/22-rdf-
      syntax-ns#> .
@prefix foaf:<http://xmlns.com/foaf/0.1/> .
<http://biglynx.co.uk/people/matt-briggs>
    rdf:type foaf:Person .
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                         Slide 14
1. Use URIs also for Relations




        http://biglynx.co.uk/
        people/matt-briggs

                                                                         http://biglynx.co.uk/
                                                                         people/scott-miller

Ansgar Scherp – ansgar@informatik.uni-mannheim.de
                   B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY                           Slide 15
Example: Big Lynx
                                                             Dave Smith
         London
                                       „lives here―

                                             Matt Briggs

                                              „same
                                             Scott Miller
                                                            Big Lynx
                          …                     person―
                                                            Company

           DBpedia                                           Matt Briggs

                              Matts private
                              Webseite
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                      Slide 16
2. Establishing Interlinkage
• Relation links between ressources
       <http://biglynx.co.uk/people/dave-smith>
           foaf:based_near
           <http://dbpedia.org/resource/London> .


 Identity links between ressources
    <http://biglynx.co.uk/people/matt-briggs>
        owl:sameAs
         <http://www.matt-briggs.eg.uk#me> .
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 17
Example: Big Lynx
                                                            Dave Smith
         London
                                      „lives here―
                                    foaf:based_near


                                             Matt Briggs

                                              „same
                                             owl:sameAs
                                              Person―      Big Lynx
                                                           Company

           DBpedia                                          Matt Briggs

                              Matts private
                              Webseite
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                     Slide 18
3. Dereferencing of URIs

• Looking up of web documents

• How can we ―look up‖ things of the real world?




                                 http://biglynx.co.uk/
                                 people/matt-briggs


Ansgar Scherp – ansgar@informatik.uni-mannheim.de        Slide 19
Two Approaches
1. Hash URIs
   – URI contains a part separated by #, e.g.,
    http://biglynx.co.uk/vocab/sme#Team

2. Negotiation via „303 See Other― request
      http://biglynx.co.uk/people/matt-briggs
      Response: „Look here:―
      http://biglynx.co.uk/people/matt-briggs.rdf


Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 20
Example: Big Lynx
                                                           Dave Smith
         London
                                    foaf:based_near


                               Description of
                                     Matt Briggs
                               Matt?
                                             owl:sameAs
                                                          Big Lynx
                                                          Company

           DBpedia                                         Matt Briggs

                              Matts private
                              Webseite
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                    Slide 21
4. Description of URIs
                  foaf:Person                                                …
…                                                    dp:Birmingham
                              rdf:type
                                                    foaf:based_near          …

             biglynx:matt-briggs                    ex:loc
                                                              _:point
                              foaf:knows
                                                                          wgs84:
                                                         wgs84:             long
            biglynx:dave-smith
                                                         lat
                                                                        ―-0.118‖
                              foaf:based_near
                                                             ―51.509‖
                   dp:London

        …                                           …
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                            Slide 22
Formalization of Description
 Given a RDF graph G (V , P, E ) with
  V R B L and E ( R B) P V

                                         ∩∞
 SimpleCBD(n) =                                    I j with
                                        j=0

        I 0 = { (s, p, o) | (s, p, o)                          E     s=n}

     I j+1 = { (o, p‗, o‗)                    E|        (s, p, o)       Ij : o   B
                                                                                 ∩j
                                                                   (o, p‗, o‗)        Ik}
                                                                                 k=0

Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                     Slide 23
W3C RDF / RDF Schema Vocabulary
•    Set of URIs defined in rdf:/rdfs: namespace
•    rdf:type               • rdfs:domain
•    rdf:Property           • rdfs:range
•    rdf:XMLLiteral         • rdfs:Resource
•    rdf:List               • rdfs:Literal
•    rdf:first              • rdfs:Datatype
•    rdf:rest               • rdfs:Class
•    rdf:Seq                • rdfs:subClassOf
•    rdf:Bag                • rdfs:subPropertyOf
•    rdf:Alt                • rdfs:comment
•    ...                    • …
•    rdf:value              • rdfs:label
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 24
Semantic Web Layer Cake (Simplified)




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 25
Exploration of Linked Data


                             Word
                             Net




         Swoogle

                                                Geo
                                               Names
Ansgar Scherp – ansgar@informatik.uni-mannheim.de
< 31 Billion Triples                                   Source: http://lod-cloud.net
                                                                             Slide 26
Naive Approach
• Download all data
• Store in really big
  database                                                               RDFS
• Programming of                                    WordNet              Rules
  queries                                           Swoogle               Geo
• Design of
  user interface                                     GeoNames

                                                Inflexible           Monolithic
                                                                Not
Ansgar Scherp – ansgar@informatik.uni-mannheim.de
                                                             scaleable
                                                                                 Slide 27
SemaPlorer Approach
                                                                             Flexible

                                                                               Extensible

                                                                                 Scaleable
                                                    birthplace



                              placeOfBirth
                               birthplace

                                                                    Geo
               RDFS             Rules          Fulltext            Queries     > 1 Billion
                                                                                 Triples
             WordNet      +              +   Swoogle      +   +   GeoNames
                                                                      12 Month in 2005/06
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                       700 Mio. Triple Slide 28
SemaPlorer – Semantic Social Media




Ansgar Scherpvideo online: http://vimeo.com/2057249
    Watch – ansgar@informatik.uni-mannheim.de         Slide 29
Billion Triple Challenge 2008




                                                    [JWS 2009]
Ansgar Scherp – ansgar@informatik.uni-mannheim.de          Slide 30
Searching for Linked Data Sources




                                                      ?
       Persons that are
       - Politicians and
       - Actors
       ?




<Ansgar Scherp – ansgar@informatik.uni-mannheim.de
  31 Milliarde Triples                               Quelle: http://lod-cloud.net
                                                                           Slide 31
Idea: Index of Data Sources
SELECT ?x
FROM …
WHERE {
 ?x rdf:type ex:Actor .
 ?x rdf:type ex:Politician .
}

                                 Index


                                        ?
           Query

  “Politician and
      Actor”
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 32
The Naive Approach
1.     Download the entire LOD cloud
2.     Put it into a (really) large triple store
3.     Process the data and extract schema
4.     Provide lookup

- Big machinery
- Late in processing the data
- High effort to scale with LOD cloud



Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 33
Idea
 Schema-level index
   Define families of graph patterns
   Assign instances to graph patterns
   Map graph patterns to context (source URI)
 Construction
   Stream-based for scalability
   Little loss of accuracy
 Note
   Index defined over instances
   But stores the context
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 34
Input Data
 n-Quads
         <subject> <predicate> <object> <context>
 Example:
            <http://www.w3.org/People/Connolly/#me>
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#
            <http://xmlns.com/foaf/0.1/Person>
            <http://dig.csail.mit.edu/2008/webdav/timbl/
                             http://dig.csail.mit.edu/2008/
                             webdav/timbl/foaf.rdf
                          w3p:
                          #me
                                                       foaf:
                                                      Person



Ansgar Scherp – ansgar@informatik.uni-mannheim.de              Slide 35
SchemEX Approach
• Stream-based schema extraction
• While crawling the data


                                          FIFO
LOD-Crawler                                         Instance-
 RDF-Dump                                             Cache      RDF
 Triple Store                                                   RDBMS
                              NxParser

    Nquad-                                          Schema-     Schema-
                                Parser
    Stream                                          Extractor    Level
                                                                 Index
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                   Slide 36
Building the Index from a Stream
 Stream of n-quads (coming from a LD crawler)
      … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1



                                                         FiFo
                                                                     1
                                                    C3    4
                                                                     6
                                                    C2    3
                                                                     4
                                                          2
                                                    C2               2
                                                          1              3
                                                    C1               5



• Linear runtime complexity wrt # of input triples
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                            Slide 37
Building the Schema and Index
                                                                      RDF
      C1                C2              C3               …    Ck
                                                                     classes
                                         consistsOf
                                                                      Type
        TC1                     TC2                      …   TCm     clusters
hasEQ
Class                 p1                            p2
       EQC1                   EQC2                       … EQCn    Equivalence
                                                                     classes
                                            hasDataSource

                                                         …           Data
  DS1 DS2 DS3 DS4 DS5                                        DSx    sources
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                          Slide 38
Layer 1: RDF Classes
 All instances of a                                                   C1
  particular type
                                                            DS 1      DS 2        DS 3

 SELECT ?x
 FROM …
 WHERE {
    ?x rdfs:type foaf:Person .
                           foaf:Person
 }

                                                                   http://dig.csail.mit.edu/2008/...
                                foaf:
 timbl:                        Person
 card#i                                             http://www.w3.org/People/Berners-Lee/card



Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                          Slide 39
Layer 2: Type Clusters
 All instances belonging                                        C1         C2

  to exactly the same set
                                                                      TC1
  of types
 SELECT ?x                     DS 1      DS 2    DS 3
 FROM …
 WHERE {
                            foaf:Person       pim:Male
    ?x rdfs:type foaf:Person .
    ?x rdfs:type pim:Male .           tc4711
 }
                       pim:
                       Male
                                                    http://www.w3.org/People/Berners-Lee/card
                                     foaf:
 timbl:
                                    Person
 card#i
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                      Slide 40
Layer 3: Equivalence Classes
 Two instances are                                  C1           C2         C3

  equivalent iff:
    They are in the same TC                               TC1               TC2

    They have the same                                                p
     properties
                                                           EQC1
    The property targets are
     in the same TC                                 DS 1     DS 2          DS 3




  Similar to 1-Bisimulation
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                            Slide 41
Layer 3: Equivalence Classes
SELECT ?x
WHERE {
   ?x rdfs:type foaf:Person foaf:Person
                            .
   ?x rdfs:type pim:Male .            pim:Male foaf:PPD
   ?x foaf:maker ?y .
   ?y rdfs:type
      foaf:PersonalProfileDocument .
                                 tc4711         tc1234
}                                       eqc0815
                                                                          -maker-
 pim:           foaf:                foaf:                                 tc1234
 Male          Person                PPD
                                                                eqc0815
                                                                               foaf:maker


                                  timbl:            http://www.w3.org/People/Berners-Lee/card
      timbl:                       card
      card#i
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                                       Slide 42
Computing SchemEX: TimBL Data Set
• Analysis of a smaller data set
• 11 M triples, TimBL‘s FOAF profile
• LDspider with ~ 2k triples / sec


•   Different cache sizes: 100, 1k, 10k, 50k, 100k
•   Compared SchemEX with reference schema
•   Index queries on all Types, TCs, EQCs
•   Good precision/recall ratio at 50k+
• Commodity hardware (4GB RAM, single CPU)
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 43
Quality of Stream-based Index
Construction




+ Runtime increases hardly with window size
+ Memory consumption scales with window size
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 44
Computing SchemEX: Full BTC 2011 Data




Cache size: 50 k
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 45
Billion Triple Challenge 2011




  …




                                                    [JWS 2012]
Ansgar Scherp – ansgar@informatik.uni-mannheim.de          Slide 46
And 2012? Get the Google Feeling!




Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 47
Semantic Data Management Chain
• Research topics in a greater context

       SchemEX*                                OntoMDE       SemaPlorer*

      Publish                  Collect              Aggregate     Use

      Kreuzverweis.com                              Core Ontologies

                                                            Mobile Facets
* Winner of Billion Triple Challenge 2011/2008
    See at: dws.informatik.uni-mannheim.de 
Ansgar Scherp – ansgar@informatik.uni-mannheim.de                       Slide 48
Ansgar Scherp – ansgar@informatik.uni-mannheim.de   Slide 49
Recommended Readings
• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web:
  Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483
  (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x
• Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp:
  SemaPlorer - Interactive semantic exploration of data and media based on
  a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009)
  URL: http://dx.doi.org/10.1016/j.websem.2009.09.006
• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp:
  SchemEX — Efficient construction of a data catalogue by stream-based
  indexing of linked data, J. of Web Semantics: Science, Services and
  Agents on the World Wide Web, Available online 23 June 2012
  URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716
• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global
  Data Space, Morgan & Claypool Publishers, 2011
  URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001



Ansgar Scherp – ansgar@informatik.uni-mannheim.de                    Slide 50

More Related Content

PPTX
SchemEX -- Building an Index for Linked Open Data
PPTX
The web is rotting and what to do about it
PPTX
Signposting Overview (Version November 2017)
PDF
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
PPT
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
PPT
Evolving the Web into a Global Dataspace – Advances and Applications
PPTX
FAIR Signposting: A KISS Approach to a Burning Issue
PDF
How links can make your open data even greater
SchemEX -- Building an Index for Linked Open Data
The web is rotting and what to do about it
Signposting Overview (Version November 2017)
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Evolving the Web into a Global Dataspace – Advances and Applications
FAIR Signposting: A KISS Approach to a Burning Issue
How links can make your open data even greater

Viewers also liked (20)

PDF
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
PPTX
A Model of Events for Integrating Event-based Information in Complex Socio-te...
PPTX
A Comparison of Different Strategies for Automated Semantic Document Annotation
PPTX
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
PPTX
ESWC 2013: A Systematic Investigation of Explicit and Implicit Schema Informa...
PDF
Focused Exploration of Geospatial Context on Linked Open Data
PPTX
Finding Good URLs: Aligning Entities in Knowledge Bases with Public Web Docum...
PPTX
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
PDF
Making Use of the Linked Data Cloud: The Role of Index Structures
PDF
Smart photo selection: interpret gaze as personal interest
PPTX
Perplexity of Index Models over Evolving Linked Data
PPTX
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
PPTX
 Challenges in Managing Online Business Communities
PPTX
Can you see it? Annotating Image Regions based on Users' Gaze Information
PPTX
Challenging Retrieval Scenarios: Social Media and Linked Open Data
PPTX
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
PPTX
Events in Multimedia - Theory, Model, Application
PPTX
Mining and Managing Large-scale Linked Open Data
PPTX
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
PPT
Establishing a Strategy for Data Quality
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Comparison of Different Strategies for Automated Semantic Document Annotation
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
ESWC 2013: A Systematic Investigation of Explicit and Implicit Schema Informa...
Focused Exploration of Geospatial Context on Linked Open Data
Finding Good URLs: Aligning Entities in Knowledge Bases with Public Web Docum...
From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources
Making Use of the Linked Data Cloud: The Role of Index Structures
Smart photo selection: interpret gaze as personal interest
Perplexity of Index Models over Evolving Linked Data
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
 Challenges in Managing Online Business Communities
Can you see it? Annotating Image Regions based on Users' Gaze Information
Challenging Retrieval Scenarios: Social Media and Linked Open Data
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Events in Multimedia - Theory, Model, Application
Mining and Managing Large-scale Linked Open Data
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Establishing a Strategy for Data Quality
Ad

Similar to Linked open data - how to juggle with more than a billion triples (20)

PPTX
SchemEX -- Building an Index for Linked Open Data
PPTX
Linked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar Scherp
PDF
Datos enlazados BNE and MARiMbA
PDF
Big Data and NoSQL in Microsoft-Land
PPTX
Big Linked Data - Creating Training Curricula
PPTX
EDF2013: Data Science Curriculum: Barry Norton: Big Linked Data
PDF
Pal gov.tutorial2.session1.xml basics and namespaces
PPTX
CSC 8101 Non Relational Databases
PDF
Pal gov.tutorial2.session5 2.rdfs_jarrar
PPTX
Dublinked workshop dec15-2011
PDF
Pal gov.tutorial2.session5 1.rdf_jarrar
KEY
Developing With Django
PDF
Visualizations of Spatial and Social Data
PPT
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...
PDF
Brief for W3C Government Linked Data Working Group 29-June 2011
PDF
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
PDF
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
PPTX
Chris Gutteridge: RDF Crash Course
PPTX
Anti-social Databases
PDF
Doug Belshaw - Open badges and learning
SchemEX -- Building an Index for Linked Open Data
Linked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar Scherp
Datos enlazados BNE and MARiMbA
Big Data and NoSQL in Microsoft-Land
Big Linked Data - Creating Training Curricula
EDF2013: Data Science Curriculum: Barry Norton: Big Linked Data
Pal gov.tutorial2.session1.xml basics and namespaces
CSC 8101 Non Relational Databases
Pal gov.tutorial2.session5 2.rdfs_jarrar
Dublinked workshop dec15-2011
Pal gov.tutorial2.session5 1.rdf_jarrar
Developing With Django
Visualizations of Spatial and Social Data
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...
Brief for W3C Government Linked Data Working Group 29-June 2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Chris Gutteridge: RDF Crash Course
Anti-social Databases
Doug Belshaw - Open badges and learning
Ad

More from Ansgar Scherp (9)

PPTX
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
PDF
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
PDF
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
PPTX
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
PDF
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
PDF
Knowledge Discovery in Social Media and Scientific Digital Libraries
PPTX
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
PDF
A Framework for Iterative Signing of Graph Data on the Web
PPTX
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
Knowledge Discovery in Social Media and Scientific Digital Libraries
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
A Framework for Iterative Signing of Graph Data on the Web
strukt - A Pattern System for Integrating Individual and Organizational Knowl...

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Spectroscopy.pptx food analysis technology
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
MIND Revenue Release Quarter 2 2025 Press Release
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Spectroscopy.pptx food analysis technology
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Linked open data - how to juggle with more than a billion triples

  • 1. How to Juggle with more than a Billion Triples? Ansgar Scherp Research Group on Data and Web Science Universität Mannheim October 2012 Image source: http://www.flickr.com/photos/pedromourapinheiro/2122754745/ 1 Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide
  • 2. My thanks go to … • Marianna • Daniel Eißing • Simon Schenk • Mathias Konrath • Carsten Saathoff • Daniel Schmeiß • Thomas Franz • Anton Baumesberger • Thomas Gottron • Frederik Jochum • Steffen Staab • Alexander Kleinen • Arne Peters • Bastian Krayer And many more … Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 2
  • 3. Scenario • Tim plans to travel – from London – to a customer in Cologne Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 3
  • 4. Website of the German Railway It works, why bother…? Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 4
  • 5. Let„s Try Different Queries  Bottlenecks in public transportation?  Compare the connections with flights?  Visualize on a map? …  All these queries cannot be answered, because the data … Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 5
  • 6. … locked in Silos! – High Integration Effort – Lack in Reuse of Data Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 6 B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY
  • 7. Linked Data • Publishing and interlinking of data • Different quality and purpose • From different sources in the Web World Wide Web Linked Data Documents Data Hyperlinks Typed Links HTML RDF Addresses (URIs) Addresses (URIs) Example: http://www.uni-mannheim.de/ Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 7
  • 8. Relevance of Linked Data? Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 8
  • 9. Linked Data: May „07  Sept. „11 Web 2.0 Media Publications eGovernment Cross-Domain Life Geographic Sciences Ansgar Billion–Triples < 31 Scherp ansgar@informatik.uni-mannheim.de Source: http://lod-cloud.net Slide 9
  • 10. Linked Data Principles 1. Identification 2. Interlinkage 3. Dereferencing 4. Description Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 10
  • 11. Example: Big Lynx Matt Briggs Scott Miller ? Big Lynx Company Ansgar Scherp – ansgar@informatik.uni-mannheim.de < 31 Milliarde Triple Source: http://lod-cloud.net Slide 11
  • 12. 1. Use URIs for Identification Matt Briggs Scott Miller http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-miller Ansgar Scherp – ansgar@informatik.uni-mannheim.de B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY Slide 12
  • 13. Example: Big Lynx Matt Briggs Scott Miller Big Lynx Company  How to model relationships like knows? Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 13
  • 14. Resource DescriptionFramework (RDF) • Description of Ressources with RDF triple Matt Briggs is a Person Subject Predicate Object @prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> . @prefix foaf:<http://xmlns.com/foaf/0.1/> . <http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person . Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 14
  • 15. 1. Use URIs also for Relations http://biglynx.co.uk/ people/matt-briggs http://biglynx.co.uk/ people/scott-miller Ansgar Scherp – ansgar@informatik.uni-mannheim.de B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY Slide 15
  • 16. Example: Big Lynx Dave Smith London „lives here― Matt Briggs „same Scott Miller Big Lynx … person― Company DBpedia Matt Briggs Matts private Webseite Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 16
  • 17. 2. Establishing Interlinkage • Relation links between ressources <http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> .  Identity links between ressources <http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> . Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 17
  • 18. Example: Big Lynx Dave Smith London „lives here― foaf:based_near Matt Briggs „same owl:sameAs Person― Big Lynx Company DBpedia Matt Briggs Matts private Webseite Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 18
  • 19. 3. Dereferencing of URIs • Looking up of web documents • How can we ―look up‖ things of the real world? http://biglynx.co.uk/ people/matt-briggs Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 19
  • 20. Two Approaches 1. Hash URIs – URI contains a part separated by #, e.g., http://biglynx.co.uk/vocab/sme#Team 2. Negotiation via „303 See Other― request http://biglynx.co.uk/people/matt-briggs Response: „Look here:― http://biglynx.co.uk/people/matt-briggs.rdf Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 20
  • 21. Example: Big Lynx Dave Smith London foaf:based_near Description of Matt Briggs Matt? owl:sameAs Big Lynx Company DBpedia Matt Briggs Matts private Webseite Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 21
  • 22. 4. Description of URIs foaf:Person … … dp:Birmingham rdf:type foaf:based_near … biglynx:matt-briggs ex:loc _:point foaf:knows wgs84: wgs84: long biglynx:dave-smith lat ―-0.118‖ foaf:based_near ―51.509‖ dp:London … … Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 22
  • 23. Formalization of Description  Given a RDF graph G (V , P, E ) with V R B L and E ( R B) P V ∩∞  SimpleCBD(n) = I j with j=0 I 0 = { (s, p, o) | (s, p, o) E s=n} I j+1 = { (o, p‗, o‗) E| (s, p, o) Ij : o B ∩j (o, p‗, o‗) Ik} k=0 Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 23
  • 24. W3C RDF / RDF Schema Vocabulary • Set of URIs defined in rdf:/rdfs: namespace • rdf:type • rdfs:domain • rdf:Property • rdfs:range • rdf:XMLLiteral • rdfs:Resource • rdf:List • rdfs:Literal • rdf:first • rdfs:Datatype • rdf:rest • rdfs:Class • rdf:Seq • rdfs:subClassOf • rdf:Bag • rdfs:subPropertyOf • rdf:Alt • rdfs:comment • ... • … • rdf:value • rdfs:label Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 24
  • 25. Semantic Web Layer Cake (Simplified) Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 25
  • 26. Exploration of Linked Data Word Net Swoogle Geo Names Ansgar Scherp – ansgar@informatik.uni-mannheim.de < 31 Billion Triples Source: http://lod-cloud.net Slide 26
  • 27. Naive Approach • Download all data • Store in really big database RDFS • Programming of WordNet Rules queries Swoogle Geo • Design of user interface GeoNames Inflexible Monolithic Not Ansgar Scherp – ansgar@informatik.uni-mannheim.de scaleable Slide 27
  • 28. SemaPlorer Approach Flexible Extensible Scaleable birthplace placeOfBirth birthplace Geo RDFS Rules Fulltext Queries > 1 Billion Triples WordNet + + Swoogle + + GeoNames 12 Month in 2005/06 Ansgar Scherp – ansgar@informatik.uni-mannheim.de  700 Mio. Triple Slide 28
  • 29. SemaPlorer – Semantic Social Media Ansgar Scherpvideo online: http://vimeo.com/2057249 Watch – ansgar@informatik.uni-mannheim.de Slide 29
  • 30. Billion Triple Challenge 2008 [JWS 2009] Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 30
  • 31. Searching for Linked Data Sources ? Persons that are - Politicians and - Actors ? <Ansgar Scherp – ansgar@informatik.uni-mannheim.de 31 Milliarde Triples Quelle: http://lod-cloud.net Slide 31
  • 32. Idea: Index of Data Sources SELECT ?x FROM … WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician . } Index ? Query “Politician and Actor” Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 32
  • 33. The Naive Approach 1. Download the entire LOD cloud 2. Put it into a (really) large triple store 3. Process the data and extract schema 4. Provide lookup - Big machinery - Late in processing the data - High effort to scale with LOD cloud Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 33
  • 34. Idea  Schema-level index  Define families of graph patterns  Assign instances to graph patterns  Map graph patterns to context (source URI)  Construction  Stream-based for scalability  Little loss of accuracy  Note  Index defined over instances  But stores the context Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 34
  • 35. Input Data  n-Quads <subject> <predicate> <object> <context>  Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns# <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/ http://dig.csail.mit.edu/2008/ webdav/timbl/foaf.rdf w3p: #me foaf: Person Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 35
  • 36. SchemEX Approach • Stream-based schema extraction • While crawling the data FIFO LOD-Crawler Instance- RDF-Dump Cache RDF Triple Store RDBMS NxParser Nquad- Schema- Schema- Parser Stream Extractor Level Index Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 36
  • 37. Building the Index from a Stream  Stream of n-quads (coming from a LD crawler) … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1 FiFo 1 C3 4 6 C2 3 4 2 C2 2 1 3 C1 5 • Linear runtime complexity wrt # of input triples Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 37
  • 38. Building the Schema and Index RDF C1 C2 C3 … Ck classes consistsOf Type TC1 TC2 … TCm clusters hasEQ Class p1 p2 EQC1 EQC2 … EQCn Equivalence classes hasDataSource … Data DS1 DS2 DS3 DS4 DS5 DSx sources Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 38
  • 39. Layer 1: RDF Classes  All instances of a C1 particular type DS 1 DS 2 DS 3 SELECT ?x FROM … WHERE { ?x rdfs:type foaf:Person . foaf:Person } http://dig.csail.mit.edu/2008/... foaf: timbl: Person card#i http://www.w3.org/People/Berners-Lee/card Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 39
  • 40. Layer 2: Type Clusters  All instances belonging C1 C2 to exactly the same set TC1 of types SELECT ?x DS 1 DS 2 DS 3 FROM … WHERE { foaf:Person pim:Male ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . tc4711 } pim: Male http://www.w3.org/People/Berners-Lee/card foaf: timbl: Person card#i Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 40
  • 41. Layer 3: Equivalence Classes  Two instances are C1 C2 C3 equivalent iff:  They are in the same TC TC1 TC2  They have the same p properties EQC1  The property targets are in the same TC DS 1 DS 2 DS 3  Similar to 1-Bisimulation Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 41
  • 42. Layer 3: Equivalence Classes SELECT ?x WHERE { ?x rdfs:type foaf:Person foaf:Person . ?x rdfs:type pim:Male . pim:Male foaf:PPD ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument . tc4711 tc1234 } eqc0815 -maker- pim: foaf: foaf: tc1234 Male Person PPD eqc0815 foaf:maker timbl: http://www.w3.org/People/Berners-Lee/card timbl: card card#i Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 42
  • 43. Computing SchemEX: TimBL Data Set • Analysis of a smaller data set • 11 M triples, TimBL‘s FOAF profile • LDspider with ~ 2k triples / sec • Different cache sizes: 100, 1k, 10k, 50k, 100k • Compared SchemEX with reference schema • Index queries on all Types, TCs, EQCs • Good precision/recall ratio at 50k+ • Commodity hardware (4GB RAM, single CPU) Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 43
  • 44. Quality of Stream-based Index Construction + Runtime increases hardly with window size + Memory consumption scales with window size Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 44
  • 45. Computing SchemEX: Full BTC 2011 Data Cache size: 50 k Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 45
  • 46. Billion Triple Challenge 2011 … [JWS 2012] Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 46
  • 47. And 2012? Get the Google Feeling! Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 47
  • 48. Semantic Data Management Chain • Research topics in a greater context SchemEX* OntoMDE SemaPlorer* Publish Collect Aggregate Use Kreuzverweis.com Core Ontologies Mobile Facets * Winner of Billion Triple Challenge 2011/2008  See at: dws.informatik.uni-mannheim.de  Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 48
  • 49. Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 49
  • 50. Recommended Readings • Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web: Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011) URL: http://dx.doi.org/10.1007/s00287-011-0535-x • Simon Schenk, Carsten Saathoff, Steffen Staab, Ansgar Scherp: SemaPlorer - Interactive semantic exploration of data and media based on a federated cloud infrastructure. J. Web Sem. 7(4): 298-304 (2009) URL: http://dx.doi.org/10.1016/j.websem.2009.09.006 • Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012 URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716 • Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011 URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001 Ansgar Scherp – ansgar@informatik.uni-mannheim.de Slide 50