Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint Federation

Extending LargeRDFBench for Multi-Source
Data at Scale for SPARQL Endpoint Federation
Ali Hasnain, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo,
Dietrich Rebholz-Schuhmann

Agenda
• SPARQL Endpoints Query Federation
• Federated Queries Benchmark
– Datasets
– Input queries
– Important query features
– Benchmark generation
• LargeRDFBench and extended LargeRDFBench
• Evaluation and results
• Conclusion
2

SPARQL Endpoints Query federation
3
Endpoint 1 Endpoint 2 Endpoint 3 Endpoint 4
RDF RDF RDF RDF
Parsing
Source Selection
Federator Optimzer
Integrator
Rewrite query and
get Individual Triple
Patterns
Identify capable
source against
Individual Triple
Patterns
Generate
optimized sub-
query Exe. Plan
Integrate sub-
queries results
Execute sub-
queries
Federation
Engine

Federated Benchmark Components
• Datasets
• Queries
• Performance metrics
• Execution rules
4

Important Datasets Features
RDF Datasets used in the federation benchmark should vary:
– Number of triples
– Number of classes
– Number of resources
– Number of properties
– Number of objects
– Average properties per class
– Average instances per class
– Average in-degree and out-degree
– Structuredness or coherence
5

Important Queries Features
SPARQL queries used in the federation benchmark should vary:
– Number of triple patterns
– Number of join vertices
– Mean join vertex degree,
– Number of sources span
– Query result set sizes
– Mean triple pattern selectivity
– BGP-restricted triple pattern selectivity
– Join-restricted triple pattern selectivity
– Join vertex types (`star', `path', `hybrid', `sink')
– SPARQL clauses used (e.g., LIMIT, UNION, OPTIONAL, FILTER etc.)
6

SPARQL Queries as Directed Hypergraphs
7

Important performance metrics
– Result set completeness and correctness
– Number of sources selected
– Number of SPARQL ASK requests used during source selection
– Source selection time
– Number of endpoint requests
– Number of intermediate results
– Overall query execution time
8

LargeRDFBench
• SPARQL query federation
benchmark
• 13 interconnect real datasets
– 4 Life sciences
– 6 Cross domain
– 3 Large data
• 32 queries of varying complexities
– 14 simple
– 10 complex
– 8 large data
• Multiple performance metrics
• Contains both SPARQL 1.0 and
SPARQL 1.1 versions of the
same queries
9

LargeRDFBench Datasets Statistics
10

Why Extended LargeRDFBench?
• 24 queries only span over 2
datasets, i.e. 2 SPARQL
SERVICES used in the SPARQL
1.1 version of the queries
• Federation engines that optimize
the ordering of SPARQL
SERVICES cannot be fully tested
• Two SERVICES means only Two
possible ordering
• Goal: add more queries which span
over more than two datasets
11
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
S1 S3 S5 S7 S9 S11 S13 C1 C3 C5 C7 C9 L1 L3 L5 L7#RelevantSources

Extended LargeRDFBench
• SPARQL query federation benchmark
• 13 interconnect real datasets
– 4 Life sciences
– 6 Cross domain
– 3 Large data
• 32 queries of varying complexities
– 14 simple
– 10 complex
– 8 large data
– 8 Complex+High sources (CH1-CH8)
• Multiple performance metrics
• Contains both SPARQL 1.0 and
SPARQL 1.1 versions of the same
queries
12

Complex+High Data Sources Queries
• CH1:
– 4 LargeRDFBench data sources involved
– Total triple patterns = 16, result size = 384
– High mean join vertex degree
– High number of incoming and outgoing edges of a
join node
– High number of triple patterns in a single SERVICE
– Runtime for this query ranges from 1 second
(costFed) to over 1 hour (SemaGrow)
13

• CH2:
– Total triple patterns = 10, result size = 840
– Low mean triple patterns selectivity (i.e., 0.00005) and high BGP-
restricted (i.e., 0.2595) and Join-restricted (i.e., 0.58115) triple
pattern selectivities
– Triple patterns of this query are selective, i.e., they can have
smaller result sizes
– Less selective when involved in joins with other triple patterns.
– Choosing the right join order which quick converges to smaller
result size is crucial
– The FILTER combined with REGEX made this query particularly
very selective
14

• CH3:
– Triple patterns = 11 and result size = 48
– High number of join vertices (i.e., 7 from 11 triple patterns),
moderate join vertex degree (i.e., 2.71), and low mean BGP-
restricted triple pattern selectivity (i.e., 0.0196)
– Mix values for the important query features that can challenge the
federation engines
15

• CH4:
– Total number of triple patterns = 12, result size = 1248
– High join vertices (i.e., 10 join vertices from 12 triple patterns)
– Join order optimisation is challenging due to more joins vs. less number
of triple patterns
16

• CH5:
– Triple patterns = 18, result size = 5 using LIMIT clause
– Number of bounds subjects and objects in the triple patterns
– 4 triple patterns with subject bound and 2 triple patterns with object
bound
– One triple pattern that contains unbound predicate
– Challenging to accurately estimate the triple patterns as well as the
joins cardinalities
17

• CH6:
– 8 LargeRDFBench data sources involved along with bound subjects
and objects
– Triple patterns = 24, result size = 16
– Low mean triple patterns selectivity (i.e., 0.00002) and high BGP-
restricted (i.e., 0.2522) and Join-restricted (i.e., 0.3186) triple pattern
selectivities
18

• CH7:
– Triple patterns = 21, result size = 775 using LIMIT clause
– There are a total of 14 join nodes in this query with 5 Star, 3 Path, 4
Sink, and 2 Hybrid join nodes
– more join nodes and join ordering could not be a trivial task
19

• CH8
– 10 LargeRDFBench data sources involved and contains OPTIONAL,
FILTER, and LIMIT
– Highest number of triple patterns = 33, result size = 1
– 19 join nodes in this query with 7 Star, 4 Path, 6 Sink, and 2 Hybrid join
nodes
– None of the federation engines is able to execute this query within the
limit of 1 hour
20

Evaluation Results
22
Hard to rank the federation engines due to many RE, ZR, TO
Unstability exposed

This work was supported by grants from the EU H2020 Framework Programme
provided for the project HOBBIT (GA no. 688227).

Key Findings
• Extended LargeRDF-Bench can be extremely costly or can be
executed extremely fast when proper optimised query plan is selected.
• However, the number of timeout and runtime errors suggesting that
choosing the optimised query plans for these queries is not a trivial
task.
• The results revealed FedX, CostFed, and ANAPSID can result in
incomplete or zero results.
26

Why extended LargeRDFBench
• In LargeRDFBench 24 out of total 32 queries which only require 2 data
sources to get the complete result set of the queries.
• Federation engines which optimise the ordering of the execution of
SPARQL SERVICES in federated SPARQL 1.1 queries (e.g. QuWeDa)
cannot be fully tested with existing LargeRDFBench queries.
• This is because if there are only two SERVICES used in the query,
there are only two possible orderings of the execution of these
SERVICES.
• The goal of this extension was to fill this gap by adding more federated
queries which require more data sources to get the complete result set
of the query.
28

Number of Triple patterns
29
0
5
10
15
20
25
30
35 S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
#TriplePatterns
FedBench
FedBench-Mean
LargeRDFBench
LargeRDFBench-Mean
STD. ± 1.33 (FedBench) vs. ± 6.15 (LargeRDBFench)

Number of Join Vertices
30
0
2
4
6
8
10
12
14
16
18
20
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
#JoinVertices
FedBench
FedBench-Mean
LargeRDFBench
LargeRDFBench-Mean

Mean Join Vertices degree
31
0
1
2
3
4
5
6
7 S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
MeanJoinVerticesDegree
FedBench FedBench-Mean
LargeRDFBench LargeRDFBench-Mean

Number of Results
32
STD. ± 2397 (FedBench) vs. ± 104236 (LargeRDBFench)
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
S1 S3 S5 S7 S9 S11 S13 C1 C3 C5 C7 C9 L1 L3 L5 L7 CH1 CH3 CH5 CH7
#Results(logscale)
FedBench FedBench-Mean
LargeRDFBench LargeRDFBench-Mean

Mean triple patterns selectivity
33
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
MeanTriplePatternSelectivity
FedBench LargeRDFBench

Mean BGP-Rest. TP selectivity
34
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
MeanBGP-restrictedTriplePatternSelectivity

Mean Join-Rest. TP selectivity
35
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
L1
L2
L3
L4
L5
L6
L7
L8
CH1
CH2
CH3
CH4
CH5
CH6
CH7
CH8
MeanJoin-restrictedTriplePatternSelectivity(logscale)

Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint Federation

More Related Content

Similar to Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint Federation (20)

More from Holistic Benchmarking of Big Linked Data (20)

Recently uploaded (20)

Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint Federation