SlideShare a Scribd company logo
A Main Memory Index Structure
    to Query Linked Data
                                        Olaf Hartig
                       http://olafhartig.de/foaf.rdf#olaf

                                      Frank Huber
   Database and Information Systems Research Group
                      Humboldt-Universität zu Berlin
The Issue
  no reuse       given 0     0,2 0,4 0,6 0,8         1      0      5 10 15 20 25 30   0   20   40   60   80
                 order
ContactInfoPhillipe
   (Query No. 36)

UnsetPropsPhillipe
   (Query No. 37)

2ndDegree1Phillipe
    (Query No. 38)

2ndDegree2Phillipe
    (Query No. 39)

  IncomingPhillipe
    (Query No. 40)
                         0   0,2 0,4 0,6 0,8         1      0      5 10 15 20 25 30   0   20   40   60   80
                                    hit rate               number of query results    query execution time
                                                                                          (in seconds)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                             2
The Issue
  no reuse       given 0
                 order
                             0,2 0,4 0,6 0,8         1
                                                           Descriptor20 25 30
                                                           0 5 10 15
                                                                      objects         0   20   40   60   80

                                                           in the query-local
ContactInfoPhillipe
   (Query No. 36)                                             dataset after
                                                            query execution:
UnsetPropsPhillipe
   (Query No. 37)
                                                                        172

2ndDegree1Phillipe                                                      533
    (Query No. 38)

2ndDegree2Phillipe
    (Query No. 39)

  IncomingPhillipe
    (Query No. 40)
                         0   0,2 0,4 0,6 0,8         1      0      5 10 15 20 25 30   0   20   40   60   80
                                    hit rate               number of query results    query execution time
                                                                                          (in seconds)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                             3
query-local
 Logical representation of                                dataset
 Linked Data from the Web

 Physical representation of
 Linked Data from the Web


                                                           ?
                      What data structure do we use to
                 physically represent the query-local dataset?

Olaf Hartig - A Main Memory Index Structure to Query Linked Data       4
Outline

        1. Requirements + Existing Work


        2. Data Structures


        3. Evaluation


Olaf Hartig - A Main Memory Index Structure to Query Linked Data   5
Requirements
 ●   (Consecutively) build and use ad hoc collections of
     many small sets of RDF triples
 ●   Four main operations:
     ●   Find           … matching triples for a triple pattern in all
                          descriptor objects
     ●   Add, Remove, Replace                               … descriptor objects
 ●   Support of concurrent access (i.e. isolation)

 ●   Non-relevant properties:
     ●   Querying descriptor objects individually is not necessary
     ●   No need to write data back to the Web
     ●   ACID properties not required for complete queries
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                   6
Requirements
 ●   (Consecutively) build and use ad hoc collections of
     many small sets of RDF triples
 ●   Four main operations:
     ●   Find           … matching triples for a triple pattern in all
                          descriptor objects
     ●   Add, Remove, Replace                               … descriptor objects
 ●   Support of concurrent access (i.e. isolation)

 ●   Non-relevant properties:
     ●   Querying descriptor objects individually is not necessary
     ●   No need to write data back to the Web
     ●   ACID properties not required for complete queries
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                   7
Existing Work
 ●   Disk based storage solutions for RDF data
     ●   Unsuitable due to very costly I/O operations
 ●   Main memory based data structures in the literature
     ●   Focus on a large, single set of RDF triples
     ●   Optimized for complete graph pattern queries or path queries
 ●   Main memory based data structures in RDF frameworks
     ●   Focus on Jena, ARQ and NG4J
     ●   Inefficient (see evaluation)




Olaf Hartig - A Main Memory Index Structure to Query Linked Data        8
Outline

        1. Requirements + Existing Work


        2. Data Structures


        3. Evaluation


Olaf Hartig - A Main Memory Index Structure to Query Linked Data   9
Hash-Based Index for RDF Data


 Logical representation

 Physical representation
                                                                          SP    PO     SO
 ●   Dictionary:
     ●   Two-way mapping between RDF                               Dict
         terms and numerical identifiers                                  S      P      O

 ●   6 hash tables:
     ●   Each hash table contains
         all ID-encoded triples
     ●   Efficient support for all types of triple patterns
                                                           *Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                            10
Hash-Based Index for RDF Data


 Logical representation

 Physical representation
                                                                          SP    PO     SO
 ●   Dictionary:
     ●   Two-way mapping between RDF                               Dict
         terms and numerical identifiers                                  S      P      O

 ●   6 hash tables:
     ●   Each hash table contains t = ( id ,id ,id )
         all ID-encoded triples    id     s   p   o

     ●   Efficient support for all types of triple patterns
                                                           *Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                            11
Hash-Based Index for RDF Data


 Logical representation
       ?acq know                                             Find
                     s
 Physical representation
                                          http://bob.name                  SP   PO     SO
 ●   Dictionary:
     ●   Two-way mapping between RDF                                Dict
         terms and numerical identifiers                                   S     P      O

 ●   6 hash tables:
     ●   Each hash table contains t = ( id ,id ,id )
         all ID-encoded triples    id     s   p   o

     ●   Efficient support for all types of triple patterns
                                                           *Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                            12
Individual Indexing

                                                                                query-local
                                                                                 dataset
 Logical representation

 Physical representation                                     SP    PO   SO                         SP      PO   SO




                                                              S    P    O    Dict                  S       P    O




                                                                                        SP    PO    SO


 ●   Idea: Index each descriptor object                                                  S    P        O


           separately
 ●   Implementation of the four operations:
     ●   Add, Remove, and Replace are straightforward
     ●   Find requires iterating over all indexes
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                 13
Individual Indexing

     ?acq      know
                                                        Find
                    s
                                                                                query-local
                                    http://bob.name                              dataset
 Logical representation

 Physical representation                                     SP    PO   SO                         SP      PO   SO




                                                              S    P    O    Dict                  S       P    O




                                                                                        SP    PO    SO


 ●   Idea: Index each descriptor object                                                  S    P        O


           separately
 ●   Implementation of the four operations:
     ●   Add, Remove, and Replace are straightforward
     ●   Find requires iterating over all indexes
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                 14
Combined Indexing

                                                                               query-local
                                                                                dataset
 Logical representation

 Physical representation
                                                                          SP        PO       SO

 ●   Idea: Use a single index
     for all descriptor objects                                    Dict
                                                                          S          P       O
     ●   src – maps each triple to a set
               of descriptor object IDs




Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                  15
Combined Indexing

                                                                                    query-local
                                                                                     dataset
 Logical representation

 Physical representation
                                                                               SP        PO       SO

 ●   Idea: Use a single index
     for all descriptor objects                                       Dict
                                                                               S          P       O
     ●   src – maps each triple to a set
               of descriptor object IDs

                                                       tid = ( ids,idp,ido )




Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                       16
Combined Indexing

                                                                                      query-local
                                                                                       dataset
 Logical representation

 Physical representation
                                                                                 SP        PO       SO

 ●   Idea: Use a single index
     for all descriptor objects                                       Dict
                                                                                 S          P       O
     ●   src – maps each triple to a set
               of descriptor object IDs

                                                       tid = ( ids,idp,ido )

                                                                             +
                                                                                 src( tid ) = { , }
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                         17
Quad Indexing

                                                                                   query-local
                                                                                    dataset
 Logical representation

 Physical representation
                                                                              SP        PO           SO

 ●   Idea: Use a single quad index
     for all descriptor objects                                        Dict
                                                                              S          P           O
     ●   quad =        ID-encoded triple
                      + descriptor object ID


                                                                   q = ( (ids,idp,ido) ,         )

Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                          18
Outline

        1. Requirements + Existing Work


        2. Data Structures


        3. Evaluation


Olaf Hartig - A Main Memory Index Structure to Query Linked Data   19
Experiment Setup
                Does this affect the overall execution time for
                  link traversal based query executions ?




Olaf Hartig - A Main Memory Index Structure to Query Linked Data   20
Experiment Setup
                Does this affect the overall execution time for
                  link traversal based query executions ?
 ●   Simulation of the Web of Data
     ●   Linked Data server publishes BSBM dataset (scal. factor: 50)
     ●   Adjusted BSBM queries link to the simulation server
 ●   Experiment:
     ●   Sequence of 200 query mixes
     ●   Reuse of the query-local dataset for the whole sequence
     ●   IndIR, CombIR, and QuadIR (as presented), engine: SQUIN
     ●
         NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client


Olaf Hartig - A Main Memory Index Structure to Query Linked Data        21
Execution Time
                                                         2500                                                     80
overall number of descr.objects in the queried dataset




                                                                                                                                                                 NG4J (SWClLib
                                                                                                                  70                                             )
                                                         2000                                                                                                    IndIR, m=4
                                                                                                                  60                                             CombIR, m=12
                                                                                                                                                                 CombQuadIR,
                                                                                      execution time in seconds                                                  m=12
                                                                                                                  50
                                                         1500

                                                                                                                  40

                                                         1000
                                                                                                                  30


                                                                                                                  20
                                                          500

                                                                                                                  10


                                                            0                                                     0
                                                                0 40 80 120 160 200                                    0   20   40   60   80      100      120   140   160   180   200
                                                                   query mix                                                                   query mix
                           Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                                                        22
Execution Time
                                                         2500                                                     80
overall number of descr.objects in the queried dataset




                                                                                                                                                                 NG4J (SWClLib
                                                                                                                  70                                             )
                                                         2000                                                                                                    IndIR, m=4
                                                                                                                  60                                             CombIR, m=12
                                                                                                                                                                 CombQuadIR,
                                                                                      execution time in seconds                                                  m=12
                                                                                                                  50
                                                         1500

                                                                                                                  40

                                                         1000
                                                                                                                  30


                                                                                                                  20
                                                          500

                                                                                                                  10


                                                            0                                                     0
                                                                0 40 80 120 160 200                                    0   20   40   60   80      100      120   140   160   180   200
                                                                   query mix                                                                   query mix
                           Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                                                        23
Summary
 ●   Three hash index based data structures:
     ●   Individually indexing
     ●   Combined indexing
     ●   Quad indexing
 ●   Findings:
     ●   A single index improves query performance significantly
     ●   Smaller load times with quads
 ●   Also for other use cases of ad hoc storing of Linked Data
     ●   Consecutively retrieved from remote sources
     ●   Used for immediate local processing


Olaf Hartig - A Main Memory Index Structure to Query Linked Data   24
Backup Slides




Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data   25
Combined Indexing

                                                                               query-local
                                                                                dataset
 Logical representation

 Physical representation
                                                                          SP        PO       SO

 ●   Idea: Use a single index
     for all descriptor objects                                    Dict
                                                                          S          P       O
     ●   src – maps each triple to a set
               of descriptor object IDs
 ●   Implementation of the four operations requires:
     ●   status – maps each descriptor object ID to a status
                        ( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved )
     ●   Find reports a triple t only if: ∃d ∈ src(t) : status(d) = Indexed
Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                  26
Implementation
 ●   Available as Free Software (http://squin.org)
 ●
     Hash tables with n = 2m buckets
 ●   Hash functions:
     ●   hS( is, ip, io ) = is & bitmask[m]
     ●   hSP( is, ip, io ) = ( is • ip ) & bitmask[m]
     ●   etc.
 ●
     Which m ?*                                                    *
                                                                   see paper
     ●   m = 4 for the individual indexes
     ●   m = 12 for the combined index
     ●   m = 12 for the quad index
Olaf Hartig - A Main Memory Index Structure to Query Linked Data           27
Experiment Setup
 ●   Comparison without link traversal based query execution
 ●   Compared data structures:
     ●   Our implementation of IndIR, CombIR, CombQuadIR
     ●
         NamedGraphSetImpl in NG4J (Jena)
     ●
         DatasetGraphMap in ARQ (Jena)
 ●   Berlin SPARQL Benchmark (BSBM)
     ●   BSBM datasets partitioned into query-local datasets
     ●   BSBM (v2.0) query mixes executed over these datasets




Olaf Hartig - A Main Memory Index Structure to Query Linked Data   28
Required Memory
                                  120
                                                                                             ARQ
                                                                                             NG4J
                                                                                             IndIR (m=4)
                                  100
Estimated required memory in MB




                                                                                             CombIR (m=12)
                                                                                             CombQuad (m=12)

                                   80



                                   60


                                                                                             BSBM      number of     overall
                                   40                                                        scaling   descriptor   number
                                                                                              factor    objects     of triples
                                                                                               50        2,599       22,616
                                                                                              100        4,178       40,133
                                   20
                                                                                              150        5,756       57,524
                                                                                              200        7,329       75,062

                                    0                                                         250        9,873       97,613
                                        0   100        200         300           400   500    300       11,455      115,217
                                                  pc (BSBM scaling factor)                    350       13,954      137,567
                                                                                              500       18,687      190,502
              Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                            29
Execution Time
                                                  2500
average execution time per query mix in seconds




                                                                                                          ARQ
                                                                                                          NG4J
                                                                                                          IndIR (m=4)
                                                  2000                                                    CombIR (m=12)
                                                                                                          CombQuad (m=12)


                                                  1500



                                                  1000
                                                                                                          BSBM      number of     overall
                                                                                                          scaling   descriptor   number
                                                                                                           factor    objects     of triples
                                                                                                            50        2,599       22,616
                                                   500
                                                                                                           100        4,178       40,133
                                                                                                           150        5,756       57,524
                                                                                                           200        7,329       75,062
                                                     0                                                     250        9,873       97,613
                                                         0   100         200         300      400   500
                                                                                                           300       11,455      115,217
                                                                   pc (BSBM scaling factor)                350       13,954      137,567
                                                                                                           500       18,687      190,502
                      Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                 30
Load Time
                                   10
                                                                                              CombIR (m=12)
                                    9                                                         IndIR (m=4)
                                                                                              CombQuad (m=12)
                                    8
average creation time in seconds




                                    7

                                    6

                                    5

                                    4
                                                                                              BSBM      number of     overall
                                                                                              scaling   descriptor   number
                                    3                                                          factor    objects     of triples
                                                                                                50        2,599       22,616
                                    2                                                          100        4,178       40,133
                                                                                               150        5,756       57,524
                                    1
                                                                                               200        7,329       75,062
                                                                                               250        9,873       97,613
                                    0
                                        0   100       200          300            400   500    300       11,455      115,217

                                                  pc (BSBM scaling factor)                     350       13,954      137,567
                                                                                               500       18,687      190,502
               Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                            31
Execution Time
                            80                           60

                                                                                       NG4J (SWClLib)
                            70                           50                            IndIR, m=4
                                                                                              CombIR,
                                                                                              m=12
                                                                                       CombIR, m=12
                                                                                              CombQuadIR
                                                                                       CombQuadIR, m=12
                            60                           40                                    , m=12
execution time in seconds




                            50
                                                         30

                            40
                                                         20

                            30
                                                         10

                            20
                                                          0
                                                              0            5     10           15         20         25
                            10
                                                                                  query mix

                             0
                                 0   20   40     60           80          100   120    140         160        180        200
                                                                    query mix
       Olaf Hartig - A Main Memory Index Structure to Query Linked Data                                                  32
These slides have been created by
                                      Olaf Hartig

                                              http://olafhartig.de


                     This work is licensed under a
       Creative Commons Attribution-Share Alike 3.0 License
           (http://creativecommons.org/licenses/by-sa/3.0/)




Olaf Hartig - A Main Memory Index Structure to Query Linked Data     33

More Related Content

PPTX
Democratizing Big Semantic Data management
PDF
[Conference] Cognitive Graph Analytics on Company Data and News
PDF
Distributed Query Processing for Federated RDF Data Management
PDF
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
PPTX
Triple Stores
PPTX
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
PDF
Scaling the (evolving) web data –at low cost-
Democratizing Big Semantic Data management
[Conference] Cognitive Graph Analytics on Company Data and News
Distributed Query Processing for Federated RDF Data Management
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Triple Stores
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Scaling the (evolving) web data –at low cost-

What's hot (12)

PPTX
Inference on the Semantic Web
PPT
SPARQL in the Semantic Web
PDF
PPT
Projection Indexes for HDF5 Datasets
PDF
SPARQL Query Containment with ShEx Constraints
PPTX
Dev days 2017 questionnaires (brian postlethwaite)
PDF
Applications of Word Vectors in Text Retrieval and Classification
PDF
Heuristic based Query Optimisation for SPARQL
PPTX
Large-Scale Semantic Search
PPTX
Furore devdays 2017- rdf2(solbrig)
PDF
Bio ontologies and semantic technologies
PDF
Data Mining with Excel 2010 and PowerPivot
Inference on the Semantic Web
SPARQL in the Semantic Web
Projection Indexes for HDF5 Datasets
SPARQL Query Containment with ShEx Constraints
Dev days 2017 questionnaires (brian postlethwaite)
Applications of Word Vectors in Text Retrieval and Classification
Heuristic based Query Optimisation for SPARQL
Large-Scale Semantic Search
Furore devdays 2017- rdf2(solbrig)
Bio ontologies and semantic technologies
Data Mining with Excel 2010 and PowerPivot
Ad

Similar to A Main Memory Index Structure to Query Linked Data (20)

PPTX
Polyglot metadata for Hadoop
PPTX
Thinking About Guideline for Data Interoperability - Design concept and workf...
PPTX
RasterFrames + STAC
PDF
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
PPTX
Apache Spark sql
PPTX
Hadoop Summit - Hausenblas 20 March
PPTX
Understanding the Value and Architecture of Apache Drill
PPTX
Redis Modules - Redis India Tour - 2017
PDF
Ontologies & linked open data
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PPTX
Contributing to the Smart City Through Linked Library Data
PPTX
Data Integration at the Ontology Engineering Group
PPTX
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
PDF
Applying large scale text analytics with graph databases
PPTX
Etu L2 Training - Hadoop 企業應用實作
PDF
JDD 2016 - Michal Matloka - Small Intro To Big Data
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Open Security Operations Center - OpenSOC
Polyglot metadata for Hadoop
Thinking About Guideline for Data Interoperability - Design concept and workf...
RasterFrames + STAC
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Apache Spark sql
Hadoop Summit - Hausenblas 20 March
Understanding the Value and Architecture of Apache Drill
Redis Modules - Redis India Tour - 2017
Ontologies & linked open data
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Contributing to the Smart City Through Linked Library Data
Data Integration at the Ontology Engineering Group
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Applying large scale text analytics with graph databases
Etu L2 Training - Hadoop 企業應用實作
JDD 2016 - Michal Matloka - Small Intro To Big Data
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Open Security Operations Center - OpenSOC
Ad

More from Olaf Hartig (20)

PDF
LDQL: A Query Language for the Web of Linked Data
PDF
A Context-Based Semantics for SPARQL Property Paths over the Web
PDF
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
PDF
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
PDF
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
PDF
Tutorial "Linked Data Query Processing" Part 3 "Source Selection Strategies" ...
PDF
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
PDF
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
ODP
An Overview on PROV-AQ: Provenance Access and Query
PDF
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
PDF
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
PDF
The Impact of Data Caching of on Query Execution for Linked Data
PDF
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
PDF
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
PDF
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
PDF
Querying Linked Data with SPARQL (2010)
PDF
Answers to usual issues in getting started with consuming Linked Data (2010)
LDQL: A Query Language for the Web of Linked Data
A Context-Based Semantics for SPARQL Property Paths over the Web
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
Tutorial "Linked Data Query Processing" Part 3 "Source Selection Strategies" ...
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 2 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
An Overview on PROV-AQ: Provenance Access and Query
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...
The Impact of Data Caching of on Query Execution for Linked Data
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
Towards a Data-Centric Notion of Trust in the Semantic Web (A Position Statem...
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)
Querying Linked Data with SPARQL (2010)
Answers to usual issues in getting started with consuming Linked Data (2010)

A Main Memory Index Structure to Query Linked Data

  • 1. A Main Memory Index Structure to Query Linked Data Olaf Hartig http://olafhartig.de/foaf.rdf#olaf Frank Huber Database and Information Systems Research Group Humboldt-Universität zu Berlin
  • 2. The Issue no reuse given 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 order ContactInfoPhillipe (Query No. 36) UnsetPropsPhillipe (Query No. 37) 2ndDegree1Phillipe (Query No. 38) 2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2
  • 3. The Issue no reuse given 0 order 0,2 0,4 0,6 0,8 1 Descriptor20 25 30 0 5 10 15 objects 0 20 40 60 80 in the query-local ContactInfoPhillipe (Query No. 36) dataset after query execution: UnsetPropsPhillipe (Query No. 37) 172 2ndDegree1Phillipe 533 (Query No. 38) 2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3
  • 4. query-local Logical representation of dataset Linked Data from the Web Physical representation of Linked Data from the Web ? What data structure do we use to physically represent the query-local dataset? Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4
  • 5. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 5
  • 6. Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add, Remove, Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non-relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queries Olaf Hartig - A Main Memory Index Structure to Query Linked Data 6
  • 7. Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add, Remove, Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non-relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queries Olaf Hartig - A Main Memory Index Structure to Query Linked Data 7
  • 8. Existing Work ● Disk based storage solutions for RDF data ● Unsuitable due to very costly I/O operations ● Main memory based data structures in the literature ● Focus on a large, single set of RDF triples ● Optimized for complete graph pattern queries or path queries ● Main memory based data structures in RDF frameworks ● Focus on Jena, ARQ and NG4J ● Inefficient (see evaluation) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8
  • 9. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 9
  • 10. Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10
  • 11. Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t = ( id ,id ,id ) all ID-encoded triples id s p o ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11
  • 12. Hash-Based Index for RDF Data Logical representation ?acq know Find s Physical representation http://bob.name SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t = ( id ,id ,id ) all ID-encoded triples id s p o ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12
  • 13. Individual Indexing query-local dataset Logical representation Physical representation SP PO SO SP PO SO S P O Dict S P O SP PO SO ● Idea: Index each descriptor object S P O separately ● Implementation of the four operations: ● Add, Remove, and Replace are straightforward ● Find requires iterating over all indexes Olaf Hartig - A Main Memory Index Structure to Query Linked Data 13
  • 14. Individual Indexing ?acq know Find s query-local http://bob.name dataset Logical representation Physical representation SP PO SO SP PO SO S P O Dict S P O SP PO SO ● Idea: Index each descriptor object S P O separately ● Implementation of the four operations: ● Add, Remove, and Replace are straightforward ● Find requires iterating over all indexes Olaf Hartig - A Main Memory Index Structure to Query Linked Data 14
  • 15. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs Olaf Hartig - A Main Memory Index Structure to Query Linked Data 15
  • 16. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs tid = ( ids,idp,ido ) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16
  • 17. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs tid = ( ids,idp,ido ) + src( tid ) = { , } Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17
  • 18. Quad Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single quad index for all descriptor objects Dict S P O ● quad = ID-encoded triple + descriptor object ID q = ( (ids,idp,ido) , ) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18
  • 19. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 19
  • 20. Experiment Setup Does this affect the overall execution time for link traversal based query executions ? Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20
  • 21. Experiment Setup Does this affect the overall execution time for link traversal based query executions ? ● Simulation of the Web of Data ● Linked Data server publishes BSBM dataset (scal. factor: 50) ● Adjusted BSBM queries link to the simulation server ● Experiment: ● Sequence of 200 query mixes ● Reuse of the query-local dataset for the whole sequence ● IndIR, CombIR, and QuadIR (as presented), engine: SQUIN ● NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client Olaf Hartig - A Main Memory Index Structure to Query Linked Data 21
  • 22. Execution Time 2500 80 overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) 2000 IndIR, m=4 60 CombIR, m=12 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22
  • 23. Execution Time 2500 80 overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) 2000 IndIR, m=4 60 CombIR, m=12 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23
  • 24. Summary ● Three hash index based data structures: ● Individually indexing ● Combined indexing ● Quad indexing ● Findings: ● A single index improves query performance significantly ● Smaller load times with quads ● Also for other use cases of ad hoc storing of Linked Data ● Consecutively retrieved from remote sources ● Used for immediate local processing Olaf Hartig - A Main Memory Index Structure to Query Linked Data 24
  • 25. Backup Slides Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25
  • 26. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs ● Implementation of the four operations requires: ● status – maps each descriptor object ID to a status ( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved ) ● Find reports a triple t only if: ∃d ∈ src(t) : status(d) = Indexed Olaf Hartig - A Main Memory Index Structure to Query Linked Data 26
  • 27. Implementation ● Available as Free Software (http://squin.org) ● Hash tables with n = 2m buckets ● Hash functions: ● hS( is, ip, io ) = is & bitmask[m] ● hSP( is, ip, io ) = ( is • ip ) & bitmask[m] ● etc. ● Which m ?* * see paper ● m = 4 for the individual indexes ● m = 12 for the combined index ● m = 12 for the quad index Olaf Hartig - A Main Memory Index Structure to Query Linked Data 27
  • 28. Experiment Setup ● Comparison without link traversal based query execution ● Compared data structures: ● Our implementation of IndIR, CombIR, CombQuadIR ● NamedGraphSetImpl in NG4J (Jena) ● DatasetGraphMap in ARQ (Jena) ● Berlin SPARQL Benchmark (BSBM) ● BSBM datasets partitioned into query-local datasets ● BSBM (v2.0) query mixes executed over these datasets Olaf Hartig - A Main Memory Index Structure to Query Linked Data 28
  • 29. Required Memory 120 ARQ NG4J IndIR (m=4) 100 Estimated required memory in MB CombIR (m=12) CombQuad (m=12) 80 60 BSBM number of overall 40 scaling descriptor number factor objects of triples 50 2,599 22,616 100 4,178 40,133 20 150 5,756 57,524 200 7,329 75,062 0 250 9,873 97,613 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 29
  • 30. Execution Time 2500 average execution time per query mix in seconds ARQ NG4J IndIR (m=4) 2000 CombIR (m=12) CombQuad (m=12) 1500 1000 BSBM number of overall scaling descriptor number factor objects of triples 50 2,599 22,616 500 100 4,178 40,133 150 5,756 57,524 200 7,329 75,062 0 250 9,873 97,613 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 30
  • 31. Load Time 10 CombIR (m=12) 9 IndIR (m=4) CombQuad (m=12) 8 average creation time in seconds 7 6 5 4 BSBM number of overall scaling descriptor number 3 factor objects of triples 50 2,599 22,616 2 100 4,178 40,133 150 5,756 57,524 1 200 7,329 75,062 250 9,873 97,613 0 0 100 200 300 400 500 300 11,455 115,217 pc (BSBM scaling factor) 350 13,954 137,567 500 18,687 190,502 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 31
  • 32. Execution Time 80 60 NG4J (SWClLib) 70 50 IndIR, m=4 CombIR, m=12 CombIR, m=12 CombQuadIR CombQuadIR, m=12 60 40 , m=12 execution time in seconds 50 30 40 20 30 10 20 0 0 5 10 15 20 25 10 query mix 0 0 20 40 60 80 100 120 140 160 180 200 query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 32
  • 33. These slides have been created by Olaf Hartig http://olafhartig.de This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 33