SlideShare a Scribd company logo
Extending LargeRDFBench for Multi-Source
Data at Scale for SPARQL Endpoint Federation
Ali Hasnain, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo,
Dietrich Rebholz-Schuhmann
Agenda
• SPARQL Endpoints Query Federation
• Federated Queries Benchmark
– Datasets
– Input queries
– Important query features
– Benchmark generation
• LargeRDFBench and extended LargeRDFBench
• Evaluation and results
• Conclusion
2
SPARQL Endpoints Query federation
3
Endpoint 1 Endpoint 2 Endpoint 3 Endpoint 4
RDF RDF RDF RDF
Parsing
Source Selection
Federator Optimzer
Integrator
Rewrite query and
get Individual Triple
Patterns
Identify capable
source against
Individual Triple
Patterns
Generate
optimized sub-
query Exe. Plan
Integrate sub-
queries results
Execute sub-
queries
Federation
Engine
Federated Benchmark Components
• Datasets
• Queries
• Performance metrics
• Execution rules
4
Important Datasets Features
RDF Datasets used in the federation benchmark should vary:
– Number of triples
– Number of classes
– Number of resources
– Number of properties
– Number of objects
– Average properties per class
– Average instances per class
– Average in-degree and out-degree
– Structuredness or coherence
5
Important Queries Features
SPARQL queries used in the federation benchmark should vary:
– Number of triple patterns
– Number of join vertices
– Mean join vertex degree,
– Number of sources span
– Query result set sizes
– Mean triple pattern selectivity
– BGP-restricted triple pattern selectivity
– Join-restricted triple pattern selectivity
– Join vertex types (`star', `path', `hybrid', `sink')
– SPARQL clauses used (e.g., LIMIT, UNION, OPTIONAL, FILTER etc.)
6
SPARQL Queries as Directed Hypergraphs
7
Important performance metrics
– Result set completeness and correctness
– Number of sources selected
– Number of SPARQL ASK requests used during source selection
– Source selection time
– Number of endpoint requests
– Number of intermediate results
– Overall query execution time
8
LargeRDFBench
• SPARQL query federation
benchmark
• 13 interconnect real datasets
– 4 Life sciences
– 6 Cross domain
– 3 Large data
• 32 queries of varying complexities
– 14 simple
– 10 complex
– 8 large data
• Multiple performance metrics
• Contains both SPARQL 1.0 and
SPARQL 1.1 versions of the
same queries
9
LargeRDFBench Datasets Statistics
10
Why Extended LargeRDFBench?
• 24 queries only span over 2
datasets, i.e. 2 SPARQL
SERVICES used in the SPARQL
1.1 version of the queries
• Federation engines that optimize
the ordering of SPARQL
SERVICES cannot be fully tested
• Two SERVICES means only Two
possible ordering
• Goal: add more queries which span
over more than two datasets
11
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
S1 S3 S5 S7 S9 S11 S13 C1 C3 C5 C7 C9 L1 L3 L5 L7#RelevantSources
Extended LargeRDFBench
• SPARQL query federation benchmark
• 13 interconnect real datasets
– 4 Life sciences
– 6 Cross domain
– 3 Large data
• 32 queries of varying complexities
– 14 simple
– 10 complex
– 8 large data
– 8 Complex+High sources (CH1-CH8)
• Multiple performance metrics
• Contains both SPARQL 1.0 and
SPARQL 1.1 versions of the same
queries
12
Complex+High Data Sources Queries
• CH1:
– 4 LargeRDFBench data sources involved
– Total triple patterns = 16, result size = 384
– High mean join vertex degree
– High number of incoming and outgoing edges of a
join node
– High number of triple patterns in a single SERVICE
– Runtime for this query ranges from 1 second
(costFed) to over 1 hour (SemaGrow)
13
Complex+High Data Sources Queries
• CH2:
– 4 LargeRDFBench data sources involved
– Total triple patterns = 10, result size = 840
– Low mean triple patterns selectivity (i.e., 0.00005) and high BGP-
restricted (i.e., 0.2595) and Join-restricted (i.e., 0.58115) triple
pattern selectivities
– Triple patterns of this query are selective, i.e., they can have
smaller result sizes
– Less selective when involved in joins with other triple patterns.
– Choosing the right join order which quick converges to smaller
result size is crucial
– The FILTER combined with REGEX made this query particularly
very selective
14
Complex+High Data Sources Queries
• CH3:
– 5 LargeRDFBench data sources involved
– Triple patterns = 11 and result size = 48
– High number of join vertices (i.e., 7 from 11 triple patterns),
moderate join vertex degree (i.e., 2.71), and low mean BGP-
restricted triple pattern selectivity (i.e., 0.0196)
– Mix values for the important query features that can challenge the
federation engines
15
Complex+High Data Sources Queries
• CH4:
– 6 LargeRDFBench data sources involved
– Total number of triple patterns = 12, result size = 1248
– High join vertices (i.e., 10 join vertices from 12 triple patterns)
– Join order optimisation is challenging due to more joins vs. less number
of triple patterns
16
Complex+High Data Sources Queries
• CH5:
– 7 LargeRDFBench data sources involved
– Triple patterns = 18, result size = 5 using LIMIT clause
– Number of bounds subjects and objects in the triple patterns
– 4 triple patterns with subject bound and 2 triple patterns with object
bound
– One triple pattern that contains unbound predicate
– Challenging to accurately estimate the triple patterns as well as the
joins cardinalities
17
• CH6:
– 8 LargeRDFBench data sources involved along with bound subjects
and objects
– Triple patterns = 24, result size = 16
– Low mean triple patterns selectivity (i.e., 0.00002) and high BGP-
restricted (i.e., 0.2522) and Join-restricted (i.e., 0.3186) triple pattern
selectivities
18
Complex+High Data Sources Queries
• CH7:
– 9 LargeRDFBench data sources involved
– Triple patterns = 21, result size = 775 using LIMIT clause
– There are a total of 14 join nodes in this query with 5 Star, 3 Path, 4
Sink, and 2 Hybrid join nodes
– more join nodes and join ordering could not be a trivial task
19
Complex+High Data Sources Queries
• CH8
– 10 LargeRDFBench data sources involved and contains OPTIONAL,
FILTER, and LIMIT
– Highest number of triple patterns = 33, result size = 1
– 19 join nodes in this query with 7 Star, 4 Path, 6 Sink, and 2 Hybrid join
nodes
– None of the federation engines is able to execute this query within the
limit of 1 hour
20
Complex+High Data Sources Queries
Evaluation Results
21
Evaluation Results
22
Hard to rank the federation engines due to many RE, ZR, TO
Unstability exposed
This work was supported by grants from the EU H2020 Framework Programme
provided for the project HOBBIT (GA no. 688227).
Thanks
24
Additional Slides
25
Key Findings
• Extended LargeRDF-Bench can be extremely costly or can be
executed extremely fast when proper optimised query plan is selected.
• However, the number of timeout and runtime errors suggesting that
choosing the optimised query plans for these queries is not a trivial
task.
• The results revealed FedX, CostFed, and ANAPSID can result in
incomplete or zero results.
26
Why extended LargeRDFBench
• In LargeRDFBench 24 out of total 32 queries which only require 2 data
sources to get the complete result set of the queries.
• Federation engines which optimise the ordering of the execution of
SPARQL SERVICES in federated SPARQL 1.1 queries (e.g. QuWeDa)
cannot be fully tested with existing LargeRDFBench queries.
• This is because if there are only two SERVICES used in the query,
there are only two possible orderings of the execution of these
SERVICES.
• The goal of this extension was to fill this gap by adding more federated
queries which require more data sources to get the complete result set
of the query.
28
Number of Triple patterns
29
0
5
10
15
20
25
30
35 S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
#TriplePatterns
FedBench
FedBench-Mean
LargeRDFBench
LargeRDFBench-Mean
STD. ± 1.33 (FedBench) vs. ± 6.15 (LargeRDBFench)
Number of Join Vertices
30
STD. ± 1.33 (FedBench) vs. ± 3.63 (LargeRDBFench)
0
2
4
6
8
10
12
14
16
18
20
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
#JoinVertices
FedBench
FedBench-Mean
LargeRDFBench
LargeRDFBench-Mean
Mean Join Vertices degree
31
STD. ± 1.33 (FedBench) vs. ± 6.15 (LargeRDBFench)
0
1
2
3
4
5
6
7 S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
MeanJoinVerticesDegree
FedBench FedBench-Mean
LargeRDFBench LargeRDFBench-Mean
Number of Results
32
STD. ± 2397 (FedBench) vs. ± 104236 (LargeRDBFench)
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
S1 S3 S5 S7 S9 S11 S13 C1 C3 C5 C7 C9 L1 L3 L5 L7 CH1 CH3 CH5 CH7
#Results(logscale)
FedBench FedBench-Mean
LargeRDFBench LargeRDFBench-Mean
Mean triple patterns selectivity
33
STD. ± 0.11 (FedBench) vs. ± 0.13 (LargeRDBFench)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
MeanTriplePatternSelectivity
FedBench LargeRDFBench
Mean BGP-Rest. TP selectivity
34
STD. ± 0.31 (FedBench) vs. ± 0.22 (LargeRDBFench)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
MeanBGP-restrictedTriplePatternSelectivity
FedBench LargeRDFBench
Mean Join-Rest. TP selectivity
35
STD. ± 0.13 (FedBench) vs. ± 0.15 (LargeRDBFench)
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
MeanJoin-restrictedTriplePatternSelectivity(logscale)
FedBench LargeRDFBench

More Related Content

PDF
Efficient Practices for Large Scale Text Mining Process
PDF
BoTLRet: A Template-based Linked Data Information Retrieval
PDF
A Novel Data Extraction and Alignment Method for Web Databases
PPT
Royal society of chemistry activities to develop a data repository for chemis...
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PDF
At33264269
PPTX
12. Heaps - Data Structures using C++ by Varsha Patil
PDF
Analysis of the “KDD Cup-1999” Datasets
Efficient Practices for Large Scale Text Mining Process
BoTLRet: A Template-based Linked Data Information Retrieval
A Novel Data Extraction and Alignment Method for Web Databases
Royal society of chemistry activities to develop a data repository for chemis...
Open innovation contributions from RSC resulting from the Open Phacts project
At33264269
12. Heaps - Data Structures using C++ by Varsha Patil
Analysis of the “KDD Cup-1999” Datasets

Similar to Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint Federation (20)

PPTX
LargeRDFBench
PDF
LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation
PPTX
SPARQL Querying Benchmarks ISWC2016
PDF
Querying federations 
of Triple Pattern Fragments
PPTX
SQCFramework: SPARQL Query containment Benchmark Generation Framework
PPTX
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
PDF
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
PDF
SQCFramework: SPARQL Query Containment Benchmarks Generation Framework
PPTX
Strategies for Processing and Explaining Distributed Queries on Linked Data
PPTX
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
PDF
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
PPTX
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
PDF
Distributed Query Processing for Federated RDF Data Management
PPTX
FEASIBLE-Benchmark-Framework-ISWC2015
PDF
final_copy_camera_ready_paper (7)
PPTX
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
PDF
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
PPTX
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
PPTX
Consuming Linked Data 4/5 Semtech2011
PDF
Querying and reasoning over large scale building datasets: an outline of a pe...
LargeRDFBench
LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation
SPARQL Querying Benchmarks ISWC2016
Querying federations 
of Triple Pattern Fragments
SQCFramework: SPARQL Query containment Benchmark Generation Framework
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
SQCFramework: SPARQL Query Containment Benchmarks Generation Framework
Strategies for Processing and Explaining Distributed Queries on Linked Data
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Distributed Query Processing for Federated RDF Data Management
FEASIBLE-Benchmark-Framework-ISWC2015
final_copy_camera_ready_paper (7)
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
Consuming Linked Data 4/5 Semtech2011
Querying and reasoning over large scale building datasets: an outline of a pe...
Ad

More from Holistic Benchmarking of Big Linked Data (20)

PDF
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
PDF
Benchmarking Big Linked Data: The case of the HOBBIT Project
PDF
Assessing Linked Data Versioning Systems: The Semantic Publishing Versioning ...
PDF
The DEBS Grand Challenge 2018
PPTX
Benchmarking of distributed linked data streaming systems
PPTX
The DEBS Grand Challenge 2017
PDF
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
PDF
Scalable Link Discovery for Modern Data-Driven Applications (poster)
PDF
An Evaluation of Models for Runtime Approximation in Link Discovery
PDF
Scalable Link Discovery for Modern Data-Driven Applications
PPTX
SPgen: A Benchmark Generator for Spatial Link Discovery Tools
PDF
Introducing the HOBBIT platform into the Ontology Alignment Evaluation Campaign
PDF
OKE2018 Challenge @ ESWC2018
PDF
MOCHA 2018 Challenge @ ESWC2018
PDF
Dynamic planning for link discovery - ESWC 2018
PDF
Hobbit project overview presented at EBDVF 2017
PDF
Leopard ISWC Semantic Web Challenge 2017 (poster)
PDF
Leopard ISWC Semantic Web Challenge 2017
PDF
Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK ISWC2017.
PDF
Instance Matching Benchmarks in the ERA of Linked Data - ISWC2017
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
Benchmarking Big Linked Data: The case of the HOBBIT Project
Assessing Linked Data Versioning Systems: The Semantic Publishing Versioning ...
The DEBS Grand Challenge 2018
Benchmarking of distributed linked data streaming systems
The DEBS Grand Challenge 2017
4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QAL...
Scalable Link Discovery for Modern Data-Driven Applications (poster)
An Evaluation of Models for Runtime Approximation in Link Discovery
Scalable Link Discovery for Modern Data-Driven Applications
SPgen: A Benchmark Generator for Spatial Link Discovery Tools
Introducing the HOBBIT platform into the Ontology Alignment Evaluation Campaign
OKE2018 Challenge @ ESWC2018
MOCHA 2018 Challenge @ ESWC2018
Dynamic planning for link discovery - ESWC 2018
Hobbit project overview presented at EBDVF 2017
Leopard ISWC Semantic Web Challenge 2017 (poster)
Leopard ISWC Semantic Web Challenge 2017
Benchmarking Link Discovery Systems for Geo-Spatial Data - BLINK ISWC2017.
Instance Matching Benchmarks in the ERA of Linked Data - ISWC2017
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding

Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint Federation

  • 1. Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint Federation Ali Hasnain, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo, Dietrich Rebholz-Schuhmann
  • 2. Agenda • SPARQL Endpoints Query Federation • Federated Queries Benchmark – Datasets – Input queries – Important query features – Benchmark generation • LargeRDFBench and extended LargeRDFBench • Evaluation and results • Conclusion 2
  • 3. SPARQL Endpoints Query federation 3 Endpoint 1 Endpoint 2 Endpoint 3 Endpoint 4 RDF RDF RDF RDF Parsing Source Selection Federator Optimzer Integrator Rewrite query and get Individual Triple Patterns Identify capable source against Individual Triple Patterns Generate optimized sub- query Exe. Plan Integrate sub- queries results Execute sub- queries Federation Engine
  • 4. Federated Benchmark Components • Datasets • Queries • Performance metrics • Execution rules 4
  • 5. Important Datasets Features RDF Datasets used in the federation benchmark should vary: – Number of triples – Number of classes – Number of resources – Number of properties – Number of objects – Average properties per class – Average instances per class – Average in-degree and out-degree – Structuredness or coherence 5
  • 6. Important Queries Features SPARQL queries used in the federation benchmark should vary: – Number of triple patterns – Number of join vertices – Mean join vertex degree, – Number of sources span – Query result set sizes – Mean triple pattern selectivity – BGP-restricted triple pattern selectivity – Join-restricted triple pattern selectivity – Join vertex types (`star', `path', `hybrid', `sink') – SPARQL clauses used (e.g., LIMIT, UNION, OPTIONAL, FILTER etc.) 6
  • 7. SPARQL Queries as Directed Hypergraphs 7
  • 8. Important performance metrics – Result set completeness and correctness – Number of sources selected – Number of SPARQL ASK requests used during source selection – Source selection time – Number of endpoint requests – Number of intermediate results – Overall query execution time 8
  • 9. LargeRDFBench • SPARQL query federation benchmark • 13 interconnect real datasets – 4 Life sciences – 6 Cross domain – 3 Large data • 32 queries of varying complexities – 14 simple – 10 complex – 8 large data • Multiple performance metrics • Contains both SPARQL 1.0 and SPARQL 1.1 versions of the same queries 9
  • 11. Why Extended LargeRDFBench? • 24 queries only span over 2 datasets, i.e. 2 SPARQL SERVICES used in the SPARQL 1.1 version of the queries • Federation engines that optimize the ordering of SPARQL SERVICES cannot be fully tested • Two SERVICES means only Two possible ordering • Goal: add more queries which span over more than two datasets 11 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 S1 S3 S5 S7 S9 S11 S13 C1 C3 C5 C7 C9 L1 L3 L5 L7#RelevantSources
  • 12. Extended LargeRDFBench • SPARQL query federation benchmark • 13 interconnect real datasets – 4 Life sciences – 6 Cross domain – 3 Large data • 32 queries of varying complexities – 14 simple – 10 complex – 8 large data – 8 Complex+High sources (CH1-CH8) • Multiple performance metrics • Contains both SPARQL 1.0 and SPARQL 1.1 versions of the same queries 12
  • 13. Complex+High Data Sources Queries • CH1: – 4 LargeRDFBench data sources involved – Total triple patterns = 16, result size = 384 – High mean join vertex degree – High number of incoming and outgoing edges of a join node – High number of triple patterns in a single SERVICE – Runtime for this query ranges from 1 second (costFed) to over 1 hour (SemaGrow) 13
  • 14. Complex+High Data Sources Queries • CH2: – 4 LargeRDFBench data sources involved – Total triple patterns = 10, result size = 840 – Low mean triple patterns selectivity (i.e., 0.00005) and high BGP- restricted (i.e., 0.2595) and Join-restricted (i.e., 0.58115) triple pattern selectivities – Triple patterns of this query are selective, i.e., they can have smaller result sizes – Less selective when involved in joins with other triple patterns. – Choosing the right join order which quick converges to smaller result size is crucial – The FILTER combined with REGEX made this query particularly very selective 14
  • 15. Complex+High Data Sources Queries • CH3: – 5 LargeRDFBench data sources involved – Triple patterns = 11 and result size = 48 – High number of join vertices (i.e., 7 from 11 triple patterns), moderate join vertex degree (i.e., 2.71), and low mean BGP- restricted triple pattern selectivity (i.e., 0.0196) – Mix values for the important query features that can challenge the federation engines 15
  • 16. Complex+High Data Sources Queries • CH4: – 6 LargeRDFBench data sources involved – Total number of triple patterns = 12, result size = 1248 – High join vertices (i.e., 10 join vertices from 12 triple patterns) – Join order optimisation is challenging due to more joins vs. less number of triple patterns 16
  • 17. Complex+High Data Sources Queries • CH5: – 7 LargeRDFBench data sources involved – Triple patterns = 18, result size = 5 using LIMIT clause – Number of bounds subjects and objects in the triple patterns – 4 triple patterns with subject bound and 2 triple patterns with object bound – One triple pattern that contains unbound predicate – Challenging to accurately estimate the triple patterns as well as the joins cardinalities 17
  • 18. • CH6: – 8 LargeRDFBench data sources involved along with bound subjects and objects – Triple patterns = 24, result size = 16 – Low mean triple patterns selectivity (i.e., 0.00002) and high BGP- restricted (i.e., 0.2522) and Join-restricted (i.e., 0.3186) triple pattern selectivities 18 Complex+High Data Sources Queries
  • 19. • CH7: – 9 LargeRDFBench data sources involved – Triple patterns = 21, result size = 775 using LIMIT clause – There are a total of 14 join nodes in this query with 5 Star, 3 Path, 4 Sink, and 2 Hybrid join nodes – more join nodes and join ordering could not be a trivial task 19 Complex+High Data Sources Queries
  • 20. • CH8 – 10 LargeRDFBench data sources involved and contains OPTIONAL, FILTER, and LIMIT – Highest number of triple patterns = 33, result size = 1 – 19 join nodes in this query with 7 Star, 4 Path, 6 Sink, and 2 Hybrid join nodes – None of the federation engines is able to execute this query within the limit of 1 hour 20 Complex+High Data Sources Queries
  • 22. Evaluation Results 22 Hard to rank the federation engines due to many RE, ZR, TO Unstability exposed
  • 23. This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
  • 26. Key Findings • Extended LargeRDF-Bench can be extremely costly or can be executed extremely fast when proper optimised query plan is selected. • However, the number of timeout and runtime errors suggesting that choosing the optimised query plans for these queries is not a trivial task. • The results revealed FedX, CostFed, and ANAPSID can result in incomplete or zero results. 26
  • 27. Why extended LargeRDFBench • In LargeRDFBench 24 out of total 32 queries which only require 2 data sources to get the complete result set of the queries. • Federation engines which optimise the ordering of the execution of SPARQL SERVICES in federated SPARQL 1.1 queries (e.g. QuWeDa) cannot be fully tested with existing LargeRDFBench queries. • This is because if there are only two SERVICES used in the query, there are only two possible orderings of the execution of these SERVICES. • The goal of this extension was to fill this gap by adding more federated queries which require more data sources to get the complete result set of the query. 28
  • 28. Number of Triple patterns 29 0 5 10 15 20 25 30 35 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 #TriplePatterns FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean STD. ± 1.33 (FedBench) vs. ± 6.15 (LargeRDBFench)
  • 29. Number of Join Vertices 30 STD. ± 1.33 (FedBench) vs. ± 3.63 (LargeRDBFench) 0 2 4 6 8 10 12 14 16 18 20 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 #JoinVertices FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean
  • 30. Mean Join Vertices degree 31 STD. ± 1.33 (FedBench) vs. ± 6.15 (LargeRDBFench) 0 1 2 3 4 5 6 7 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 MeanJoinVerticesDegree FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean
  • 31. Number of Results 32 STD. ± 2397 (FedBench) vs. ± 104236 (LargeRDBFench) 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 S1 S3 S5 S7 S9 S11 S13 C1 C3 C5 C7 C9 L1 L3 L5 L7 CH1 CH3 CH5 CH7 #Results(logscale) FedBench FedBench-Mean LargeRDFBench LargeRDFBench-Mean
  • 32. Mean triple patterns selectivity 33 STD. ± 0.11 (FedBench) vs. ± 0.13 (LargeRDBFench) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 MeanTriplePatternSelectivity FedBench LargeRDFBench
  • 33. Mean BGP-Rest. TP selectivity 34 STD. ± 0.31 (FedBench) vs. ± 0.22 (LargeRDBFench) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 MeanBGP-restrictedTriplePatternSelectivity FedBench LargeRDFBench
  • 34. Mean Join-Rest. TP selectivity 35 STD. ± 0.13 (FedBench) vs. ± 0.15 (LargeRDBFench) 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 L1 L2 L3 L4 L5 L6 L7 L8 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 MeanJoin-restrictedTriplePatternSelectivity(logscale) FedBench LargeRDFBench