SlideShare a Scribd company logo
Distributed Query Processing for
Federated RDF Data Management
Olaf Görlitz
07.11.2014
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 2
The Linked Open Data Cloud
Use as one large database!
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 3
Life Science Scenario
Find drugs for
nutritional supplementation
SELECT ?drug ?id ?title WHERE {
  ?drug drugbank:drugCategory category:micronutrient .
  ?drug drugbank:casRegistryNumber ?id .
  ?keggDrug rdf:type kegg:Drug .
  ?keggDrug bio2rdf:xRef ?id .
  ?keggDrug purl:title ?title .
}
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 4
Linked Data Querying Paradigms
Data Warehouse
Link Traversal
Federation
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 5
Linked Data Querying Paradigms
Requirements Data Warehouse Link Traversal Federation
Query Expressiveness
Schema Mapping
Data Freshness
Result Completeness
Scalability
Flexibility
Availability
Performance
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 6
Contributions
Large Scale
Information Retrieval
RDF Federation &
Query Optimization
Benchmarking RDF
Federation Systems
PINTS
Peer-to-Peer Statistics
Management
SPLENDID
Distributed SPARQL
Query Processing
SPLODGE
Linked Data Query
Generation
Görlitz, Staab: SPLENDID: SPARQL
Endpoint Federation Exploiting VOID
Descriptions. COLD'11
Görlitz, Thimm, Staab: SPLODGE:
Systematic Generation of SPARQL
Benchmark Queries for Linked Open
Data. ISWC'12
Görlitz, Sizov, Staab: PINTS: Peer-
to-Peer Infrastructure for Tagging
Systems. IPTPS'08
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 7
SPLENDID Federation
Federated Databases Federated RDF
● Relational Schema ● Implicit Schema, Ontologies
● Specific Data Wrappers ● SPARQL endpoints
● Rich Data Statistics ● Limited Statistics (voiD)
Execute complex SPARQL queries
over federated RDF data sources
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 8
SPLENDID Federation
SPARQL
Query
Source
Selection
Query
Optimization
Query
Execution
SELECT ?drug ?id ?title WHERE {
  ?drug drugbank:drugCategory category:micronutrient .
  ?drug drugbank:casRegistryNumber ?id .
  ?keggDrug bio2rdf:xRef ?id .
  ?keggDrug rdf:type kegg:Drug .
  ?keggDrug purl:title ?title .
}
⋈?drug
⋈?id
⋈?keggDrug
⋈?keggDrug
? drugdrugbank :drugCategory category: micronutrient
? drugdrugbank :casRegistryNumber ?id
? keggDrugrdf : type kegg: Drug
? keggDrugbio 2rdf : xRef ?id
? keggDrugpurl: title? title
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 9
Source Selection Objectives
SPARQL
Query
Source
Selection
Query
Optimization
Query
Execution
Determine all relevant data sources
DARQ FedX SPLENDID
● Explicit 'capabilities'
● Query restrictions
(bound predicates)
● ASK queries + caching
many (initial) requests
● Sub query aggregation
● VoiD descriptions
+ ASK queries
● Sub query aggregation
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 10
voiD voiD voiDvoiD
Source Selection Example
SELECT ?drug ?title WHERE {
  ?drug drugbank:drugCategory category:micronutrient .
  ?drug drugbank:casRegistryNumber ?id .
  ?keggDrug rdf:type kegg:Drug .
  ?keggDrug bio2rdf:xRef ?id .
  ?keggDrug purl:title ?title .
}
→ KEGG, DBpedia, ChEBI
→ KEGG
→ DrugBank
SPARQL
ASK
→ DrugBank, ChEBI
→ KEGG
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 11
Source Selection Result
⋈?drug
⋈?id
⋈?keggDrug
⋈?keggDrug
? drugdrugbank :drugCategory category: micronutrient
? drugdrugbank :casRegistryNumber ?id
? keggDrugrdf : type kegg: Drug
? keggDrugbio 2rdf: xRef ?id
? keggDrugpurl: title? title
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 12
Query Optimization
SPARQL
Query
Source
Selection
Query
Optimization
Query
Execution
Find best (fastest) query execution plan
DARQ FedX SPLENDID
● Dynamic Programming
● Custom Statistics
● Only bound predicates
● Bind Join
● Join Order Heuristics
● No Statistics
● Join Chains
● Bind Join
● Dynamic Programming
● Extended voiD statistics
● Bind + Hash Join
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 13
Dynamic Programming
● iterate over all possible execution plans
● compare cost (execution time)
BindJoin,
HashJoin
⋈?drug
⋈?id
⋈?keggDrug
⋈?keggDrug
? drugdrugbank :drugCategory category: micronutrient
? drugdrugbank :casRegistryNumber ?id
? keggDrugrdf : type kegg: Drug
? keggDrugbio 2rdf : xRef ?id
? keggDrugpurl: title? title
Cost Model
costsend−query
costreceive−tuple
card(R(qi ))
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 14
Cardinality Estimation
⋈?drug
⋈?id
⋈?keggDrug
⋈?keggDrug
? drugdrugbank :drugCategory category: micronutrient
? drugdrugbank :casRegistryNumber ?id
? keggDrugrdf: type kegg: Drug
? keggDrugbio 2rdf : xRef ?id
? keggDrugpurl: title? title
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 15
Cardinality Estimation (Triple Pattern)
cardd (s, p,o) = |d|⋅seld(s)⋅seld (p)⋅seld(o), d∈D
Assuming independence of s, p ,o
cardd (?,p,?)
cardd (s ,? ,?)
cardd (?,?,o)
cardd (s ,? ,o)
cardd (s ,p,?)
cardd (?,p,o)
cardd (?,?,?) cardd (s,p,o)= voiDd →|d| = 1
= voiDd →p
=
voiDd→|d|
voiDd →|s|
=
voiDd→|d|
voiDd →|o|
= 1
=
voiDd →p
voiDd→|sp|
=
voiDd →p
voiDd→|op|
cardd (?,rdf: type,T) = voiDd →T
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 16
Cardinality Estimation (Basic Graph Pattern)
Star Pattern Path Pattern
kegg:Drug
?keggDrug
rn:R01786
?title
rdf:Type
purl:title
bio2rdf:xRef
drugbank:Drug
?keggDrug
rdf:Type
owl:sameAs
?drug kegg:Drug
rdf:Type
cardd
*
(P1 ⋈ P2 ⋈ P3) =
min(cardd (P1),cardd (P2))
⋅
voiDd →p3
voiDd →|sp3
|
cardd ,d '
~
(P1 ⋈ P2) =
cardd (P1)⋅cardd ' (P2)
⋅seld ,d ' (P1 ⋈ P2)
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 17
Query Optimization
SPARQL
Query
Source
Selection
Query
Optimization
Query
Execution
⋈?drug
⋈B(? id)
⋈?keggDrug
⋈H(? keggDrug)
? drugdrugbank :drugCategory category: micronutrient
? drugdrugbank :casRegistryNumber ?id
? keggDrugrdf : type kegg: Drug
? keggDrugbio 2rdf: xRef ?id
? keggDrugpurl: title? title
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 18
Evaluation Methodology
Compare with state-of-the-art federation systems
– Use Multiple linked datasets
– With representative characteristics
– Execute 'typical' SPARQL queries
– In a reproducible benchmark setup
FedBench
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 19
Evaluation Results
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 20
Conclusion
● Federation for Linked Open Data
– Database + Semantic Web technology
– Efficient Distributed Query Processing
– Extension of voiD statistics
● Query generation for Federation Benchmarks
● Efficient statistics management in P2P networks
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 21
Thank You
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 22
VoiD Descriptions/Statistics
}
}
}
} General Information
Basic statistics
triples = 732744
Type statistics
chebi:Compound = 50477
Predicate statistics
bio:formula = 39555
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 23
VoiD statistics extension
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 24
State of the Art
DARQ AliBaba FedX SPLENDID
Statistics ServiceDesc – – VoiD
Source
Selection
Statistics
(predicates)
All sources ASK queries Statistics +
ASK queries
Query
Optimization
DynProg Heuristics Heuristics DynProg
Query
Execution
Bind join Bind join Bound Join +
parallelization
Bind Join +
Hash Join
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 25
SPARQL limitations
● Query protocol
● Only SPARQL endpoints
● Endpoint limitations
– SPARQL version
– Result size
– Data rate
– Availability
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 26
Join Implementation
R1 R2 R1 R2
⋈B ⋈H
Bind Join Hash Join
?id ?y
1 42
2 13
3 20
4 50
5 3
?id ?x
1 'A'
1 'G'
4 'A'
7 'A'
7 'C'
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 27
Join Cost Model
R(q1) R(q2 ') R(q1) R(q2)
⋈B ⋈H
Bind Join Hash Join
cost⋈B
(q1, q2) = |R(q1)|⋅costtuple +
|R(q1)|⋅costquery +
|R(q2')|⋅costtuple
cost⋈H
(q1, q2) = |R(q1)|⋅costtuple +
|R(q2)|⋅costtuple +
2⋅costquery
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 28
SPARQL Semi Join
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 29
SPLENDID Architecture
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 30
FedBench Datasets
● Cross Domain
● Life Science
● Linked Data
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 31
Data Source Selection: Requests
Olaf Görlitz: Distributed Query Processing for
Federated RDF Data Management
07.11.2014
Slide 32
Conclusion
Linked Open Data voiD
Web-scale Query Processing
SPLENDID

More Related Content

PDF
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
ODP
Introduction to LDL 2012
PDF
Seminario Cristian Lai, 06-09-2012
PPTX
Querying the Web of Data
PDF
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
PPTX
Triple Stores
PPT
Digital Object Identifiers for EOSDIS data
PDF
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Introduction to LDL 2012
Seminario Cristian Lai, 06-09-2012
Querying the Web of Data
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Triple Stores
Digital Object Identifiers for EOSDIS data
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions

What's hot (20)

PPTX
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
PDF
LDQL: A Query Language for the Web of Linked Data
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
PDF
A Main Memory Index Structure to Query Linked Data
PDF
Property graph vs. RDF Triplestore comparison in 2020
PDF
Scaling the (evolving) web data –at low cost-
PDF
Modelling context and statement-level metadata in knowledge graphs
PPT
Projection Indexes for HDF5 Datasets
PDF
I Mapreduced a Neo store: Creating large Neo4j Databases with Hadoop
PDF
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
PDF
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
PPTX
Efficient RDF Interchange (ERI) Format for RDF Data Streams
PPTX
Hdf Augmentation: Interoperability in the Last Mile
PDF
RDB2RDF, an overview of R2RML and Direct Mapping
PPTX
Inference on the Semantic Web
PPTX
The HDF Product Designer – Interoperability in the First Mile
PPTX
Introduction to HDF5 Data and Programming Models
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...
LDQL: A Query Language for the Web of Linked Data
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
A Main Memory Index Structure to Query Linked Data
Property graph vs. RDF Triplestore comparison in 2020
Scaling the (evolving) web data –at low cost-
Modelling context and statement-level metadata in knowledge graphs
Projection Indexes for HDF5 Datasets
I Mapreduced a Neo store: Creating large Neo4j Databases with Hadoop
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Efficient RDF Interchange (ERI) Format for RDF Data Streams
Hdf Augmentation: Interoperability in the Last Mile
RDB2RDF, an overview of R2RML and Direct Mapping
Inference on the Semantic Web
The HDF Product Designer – Interoperability in the First Mile
Introduction to HDF5 Data and Programming Models
Ad

Similar to Distributed Query Processing for Federated RDF Data Management (20)

ODP
2009 0807 Lod Gmod
PDF
Tese phd
PDF
Linked Data for improved organization of research data
PPTX
Linked Open Data (LOD) part 2
PDF
Semantic Web talk TEMPLATE
PPTX
Querying Linked Data
PDF
Big Data Processing using Apache Spark and Clojure
PDF
Linking the world with Python and Semantics
PDF
Producing, publishing and consuming linked data - CSHALS 2013
PDF
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
ODP
Data Integration And Visualization
PPT
Accessing the Linked Open Data Cloud via ODBC
PPTX
LiveLinkedData - TransWebData - Nantes 2013
PDF
Visualize open data with Plone - eea.daviz PLOG 2013
PDF
Grails And The Semantic Web
PDF
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
PPTX
Publishing "5 star" data: the case for RDF
PPTX
Efficient source selection for sparql endpoint federation
PDF
SPARTIQULATION - Verbalizing SPARQL queries
PPTX
Democratizing Big Semantic Data management
2009 0807 Lod Gmod
Tese phd
Linked Data for improved organization of research data
Linked Open Data (LOD) part 2
Semantic Web talk TEMPLATE
Querying Linked Data
Big Data Processing using Apache Spark and Clojure
Linking the world with Python and Semantics
Producing, publishing and consuming linked data - CSHALS 2013
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Data Integration And Visualization
Accessing the Linked Open Data Cloud via ODBC
LiveLinkedData - TransWebData - Nantes 2013
Visualize open data with Plone - eea.daviz PLOG 2013
Grails And The Semantic Web
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
Publishing "5 star" data: the case for RDF
Efficient source selection for sparql endpoint federation
SPARTIQULATION - Verbalizing SPARQL queries
Democratizing Big Semantic Data management
Ad

Recently uploaded (20)

PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
Introduction to the IoT system, how the IoT system works
PPTX
artificial intelligence overview of it and more
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PPTX
Introduction to Information and Communication Technology
PPTX
E -tech empowerment technologies PowerPoint
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
PDF
Sims 4 Historia para lo sims 4 para jugar
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
Internet___Basics___Styled_ presentation
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PPTX
Funds Management Learning Material for Beg
international classification of diseases ICD-10 review PPT.pptx
Unit-1 introduction to cyber security discuss about how to secure a system
Introduction to the IoT system, how the IoT system works
artificial intelligence overview of it and more
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
Introduction to Information and Communication Technology
E -tech empowerment technologies PowerPoint
Power Point - Lesson 3_2.pptx grad school presentation
Sims 4 Historia para lo sims 4 para jugar
The New Creative Director: How AI Tools for Social Media Content Creation Are...
introduction about ICD -10 & ICD-11 ppt.pptx
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PptxGenJS_Demo_Chart_20250317130215833.pptx
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
Module 1 - Cyber Law and Ethics 101.pptx
Internet___Basics___Styled_ presentation
An introduction to the IFRS (ISSB) Stndards.pdf
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
Funds Management Learning Material for Beg

Distributed Query Processing for Federated RDF Data Management

  • 1. Distributed Query Processing for Federated RDF Data Management Olaf Görlitz 07.11.2014
  • 2. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 2 The Linked Open Data Cloud Use as one large database!
  • 3. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 3 Life Science Scenario Find drugs for nutritional supplementation SELECT ?drug ?id ?title WHERE {   ?drug drugbank:drugCategory category:micronutrient .   ?drug drugbank:casRegistryNumber ?id .   ?keggDrug rdf:type kegg:Drug .   ?keggDrug bio2rdf:xRef ?id .   ?keggDrug purl:title ?title . }
  • 4. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 4 Linked Data Querying Paradigms Data Warehouse Link Traversal Federation
  • 5. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 5 Linked Data Querying Paradigms Requirements Data Warehouse Link Traversal Federation Query Expressiveness Schema Mapping Data Freshness Result Completeness Scalability Flexibility Availability Performance
  • 6. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 6 Contributions Large Scale Information Retrieval RDF Federation & Query Optimization Benchmarking RDF Federation Systems PINTS Peer-to-Peer Statistics Management SPLENDID Distributed SPARQL Query Processing SPLODGE Linked Data Query Generation Görlitz, Staab: SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. COLD'11 Görlitz, Thimm, Staab: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data. ISWC'12 Görlitz, Sizov, Staab: PINTS: Peer- to-Peer Infrastructure for Tagging Systems. IPTPS'08
  • 7. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 7 SPLENDID Federation Federated Databases Federated RDF ● Relational Schema ● Implicit Schema, Ontologies ● Specific Data Wrappers ● SPARQL endpoints ● Rich Data Statistics ● Limited Statistics (voiD) Execute complex SPARQL queries over federated RDF data sources
  • 8. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 8 SPLENDID Federation SPARQL Query Source Selection Query Optimization Query Execution SELECT ?drug ?id ?title WHERE {   ?drug drugbank:drugCategory category:micronutrient .   ?drug drugbank:casRegistryNumber ?id .   ?keggDrug bio2rdf:xRef ?id .   ?keggDrug rdf:type kegg:Drug .   ?keggDrug purl:title ?title . } ⋈?drug ⋈?id ⋈?keggDrug ⋈?keggDrug ? drugdrugbank :drugCategory category: micronutrient ? drugdrugbank :casRegistryNumber ?id ? keggDrugrdf : type kegg: Drug ? keggDrugbio 2rdf : xRef ?id ? keggDrugpurl: title? title
  • 9. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 9 Source Selection Objectives SPARQL Query Source Selection Query Optimization Query Execution Determine all relevant data sources DARQ FedX SPLENDID ● Explicit 'capabilities' ● Query restrictions (bound predicates) ● ASK queries + caching many (initial) requests ● Sub query aggregation ● VoiD descriptions + ASK queries ● Sub query aggregation
  • 10. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 10 voiD voiD voiDvoiD Source Selection Example SELECT ?drug ?title WHERE {   ?drug drugbank:drugCategory category:micronutrient .   ?drug drugbank:casRegistryNumber ?id .   ?keggDrug rdf:type kegg:Drug .   ?keggDrug bio2rdf:xRef ?id .   ?keggDrug purl:title ?title . } → KEGG, DBpedia, ChEBI → KEGG → DrugBank SPARQL ASK → DrugBank, ChEBI → KEGG
  • 11. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 11 Source Selection Result ⋈?drug ⋈?id ⋈?keggDrug ⋈?keggDrug ? drugdrugbank :drugCategory category: micronutrient ? drugdrugbank :casRegistryNumber ?id ? keggDrugrdf : type kegg: Drug ? keggDrugbio 2rdf: xRef ?id ? keggDrugpurl: title? title
  • 12. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 12 Query Optimization SPARQL Query Source Selection Query Optimization Query Execution Find best (fastest) query execution plan DARQ FedX SPLENDID ● Dynamic Programming ● Custom Statistics ● Only bound predicates ● Bind Join ● Join Order Heuristics ● No Statistics ● Join Chains ● Bind Join ● Dynamic Programming ● Extended voiD statistics ● Bind + Hash Join
  • 13. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 13 Dynamic Programming ● iterate over all possible execution plans ● compare cost (execution time) BindJoin, HashJoin ⋈?drug ⋈?id ⋈?keggDrug ⋈?keggDrug ? drugdrugbank :drugCategory category: micronutrient ? drugdrugbank :casRegistryNumber ?id ? keggDrugrdf : type kegg: Drug ? keggDrugbio 2rdf : xRef ?id ? keggDrugpurl: title? title Cost Model costsend−query costreceive−tuple card(R(qi ))
  • 14. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 14 Cardinality Estimation ⋈?drug ⋈?id ⋈?keggDrug ⋈?keggDrug ? drugdrugbank :drugCategory category: micronutrient ? drugdrugbank :casRegistryNumber ?id ? keggDrugrdf: type kegg: Drug ? keggDrugbio 2rdf : xRef ?id ? keggDrugpurl: title? title
  • 15. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 15 Cardinality Estimation (Triple Pattern) cardd (s, p,o) = |d|⋅seld(s)⋅seld (p)⋅seld(o), d∈D Assuming independence of s, p ,o cardd (?,p,?) cardd (s ,? ,?) cardd (?,?,o) cardd (s ,? ,o) cardd (s ,p,?) cardd (?,p,o) cardd (?,?,?) cardd (s,p,o)= voiDd →|d| = 1 = voiDd →p = voiDd→|d| voiDd →|s| = voiDd→|d| voiDd →|o| = 1 = voiDd →p voiDd→|sp| = voiDd →p voiDd→|op| cardd (?,rdf: type,T) = voiDd →T
  • 16. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 16 Cardinality Estimation (Basic Graph Pattern) Star Pattern Path Pattern kegg:Drug ?keggDrug rn:R01786 ?title rdf:Type purl:title bio2rdf:xRef drugbank:Drug ?keggDrug rdf:Type owl:sameAs ?drug kegg:Drug rdf:Type cardd * (P1 ⋈ P2 ⋈ P3) = min(cardd (P1),cardd (P2)) ⋅ voiDd →p3 voiDd →|sp3 | cardd ,d ' ~ (P1 ⋈ P2) = cardd (P1)⋅cardd ' (P2) ⋅seld ,d ' (P1 ⋈ P2)
  • 17. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 17 Query Optimization SPARQL Query Source Selection Query Optimization Query Execution ⋈?drug ⋈B(? id) ⋈?keggDrug ⋈H(? keggDrug) ? drugdrugbank :drugCategory category: micronutrient ? drugdrugbank :casRegistryNumber ?id ? keggDrugrdf : type kegg: Drug ? keggDrugbio 2rdf: xRef ?id ? keggDrugpurl: title? title
  • 18. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 18 Evaluation Methodology Compare with state-of-the-art federation systems – Use Multiple linked datasets – With representative characteristics – Execute 'typical' SPARQL queries – In a reproducible benchmark setup FedBench
  • 19. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 19 Evaluation Results
  • 20. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 20 Conclusion ● Federation for Linked Open Data – Database + Semantic Web technology – Efficient Distributed Query Processing – Extension of voiD statistics ● Query generation for Federation Benchmarks ● Efficient statistics management in P2P networks
  • 21. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 21 Thank You
  • 22. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 22 VoiD Descriptions/Statistics } } } } General Information Basic statistics triples = 732744 Type statistics chebi:Compound = 50477 Predicate statistics bio:formula = 39555
  • 23. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 23 VoiD statistics extension
  • 24. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 24 State of the Art DARQ AliBaba FedX SPLENDID Statistics ServiceDesc – – VoiD Source Selection Statistics (predicates) All sources ASK queries Statistics + ASK queries Query Optimization DynProg Heuristics Heuristics DynProg Query Execution Bind join Bind join Bound Join + parallelization Bind Join + Hash Join
  • 25. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 25 SPARQL limitations ● Query protocol ● Only SPARQL endpoints ● Endpoint limitations – SPARQL version – Result size – Data rate – Availability
  • 26. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 26 Join Implementation R1 R2 R1 R2 ⋈B ⋈H Bind Join Hash Join ?id ?y 1 42 2 13 3 20 4 50 5 3 ?id ?x 1 'A' 1 'G' 4 'A' 7 'A' 7 'C'
  • 27. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 27 Join Cost Model R(q1) R(q2 ') R(q1) R(q2) ⋈B ⋈H Bind Join Hash Join cost⋈B (q1, q2) = |R(q1)|⋅costtuple + |R(q1)|⋅costquery + |R(q2')|⋅costtuple cost⋈H (q1, q2) = |R(q1)|⋅costtuple + |R(q2)|⋅costtuple + 2⋅costquery
  • 28. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 28 SPARQL Semi Join
  • 29. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 29 SPLENDID Architecture
  • 30. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 30 FedBench Datasets ● Cross Domain ● Life Science ● Linked Data
  • 31. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 31 Data Source Selection: Requests
  • 32. Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management 07.11.2014 Slide 32 Conclusion Linked Open Data voiD Web-scale Query Processing SPLENDID