A Machine Learning Approach to SPARQL Query Performance Prediction

A Machine Learning Approach to
SPARQL Query Performance
Prediction
Rakebul Hasan
Wimmics Research Team
INRIA Sophia Antipolis
France

Context
2
Slide derived from Andreas Blumauer’s Linked Data slides
• Linked Data Principles
1. Use URIs as names for things.
2. Use HTTP URIs, so that people can
look up those names.
3. When someone looks up a URI,
provide useful information, using the
standards (RDF, SPARQL).
4. Include links to other URIs, so that
they can discover more things.

Context
• W3C Linking Open Data (LOD) Initiative
• An initiative to publish open data as Linked Data
• From 2 billion triples in 2007 to 30 billion triples in 2011
• Accessing Linked Data
– Dereferencing URIs
– SPARQL Endpoints
3

Context
• Querying Linked Data
– SPARQL Endpoints: SPARQL query service via HTTP
implementing SPARQL Protocol
– 68% of the data sets provide SPARQL Endpoints as
of September 2011
– As of Today, 98% of the triples in LOD cloud are
accessible via SPARQL
• 57,856,463,005 out of 58,882,358,557 triples
http://stats.lod2.eu/
4

Context
• Understanding Query Behavior in the context
of Linked Data
– Workload allocation to ensure specific QoS
requirements are met
– Predicting query performance metrics
5

Query Performance Prediction
• Traditional approaches use underlying data statistics-based
cost models to predict query performance
• Data statistics are often missing in the Linked Data scenario
– Only 32.20 % (95 out of 295) data sources provide a voiD
description.
• Basic statistics such as number of triples, often not detailed enough
for statistics based models
– In fact, what makes effective statistics for query cost estimation on RDF is
unclear.
• Challenge
– How to predict query performance without using data statistics?
6

Understanding performance of
database queries
• Ganapathi et al. predicting performance
metrics of database queries prior to query
execution using machine learning.
• Akdere et al. use machine learning for
predicting query execution time.
Ganapathi et al.: Predicting multiple metrics for queries: Better decisions enabled by machine learning, ICDE’09
Akdere et al, Learning-based query performance modeling and prediction, ICDE’12, 7

Predicting Query Performance
• Learn query performance from already
executed queries
• Challenge: how to model SPARQL query
characteristics for machine learning
algorithms - feature extraction?
8

Modeling SPARQL Query Execution
• Two types of features
– Algebra features: extracted from SPARQL algebraic
expression of a query
– Graph pattern features: a vector representation of
the query pattern of a query relative to the
training queries
9

Modeling SPARQL Query Execution
• Algebra features
– Jena API to extract
SPARQL algebra
expressions
10

• Graph pattern features
– Find landmarks in training
queries by clustering
• K-medoids with
approximate graph edit
distance
– Compute distance
between landmark queries
and the query in
examination to construct a
graph pattern feature
vector
• Approximate graph edit
distance for distance
computation
11

Graph Edit Distance
• Minimum amount of distortion needed to
transform one graph to another
– Bipartite matching based approximated graph edit
distance with
• Previous research shows accurate results with
classification problems
Riesen et al. “A Novel Software Toolkit for Graph Edit Distance Computation”, 9th IAPR-TC-15, GbRPR 2013
12

Experiments
• 1260 training, 420 validation, and 420 test
queries generated from DBPSB benchmark query
templates
– DBPSB templates cover most commonly used SPARQL
query features in the queries sent to DBPedia
• DBpedia as RDF data set
• Predicting query execution time
– k-NN regression with k-D tree
– SVM with nu-SVR for regression
13
DBpedia: http://dbpedia.org/
DBPSB: http://aksw.org/Projects/DBPSB.html

Algebra and graph pattern features
15

Time Required for Training and
Predictions
16

Summary
• Understanding SPARQL query behavior in the
Linked Data scenario
– Predicting query performance metrics
• learn query execution times from already
executed queries
– without using statistics about the underlying RDF
data.
– Modeling (vector representation) SPARQL queries for
machine learning algorithms
• Feature extraction
– Highly accurate predictions for common Linked Data
queries
17

Future Work on QPP
• Incorporating bandwidth related features.
• Query optimization for Linked Data applications:
– in place of selectivity estimation for alternative
queries?
• How to accurately predict performance for single
triple patterns
– Alternative query construction for Linked Data
applications – join order optimization. E.g. Federated
Query Processing over Linked Data
• How to generate training queries?
– Next slide
18

Statistical Analysis of Query Logs
• Approach to systematically generating training queries
Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries,
1st International Workshop on Usage Analysis and the Web of Data,
co-located with the 20th International World Wide Web Conference (WWW2011)
19

A Machine Learning Approach to SPARQL Query Performance Prediction

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to A Machine Learning Approach to SPARQL Query Performance Prediction (20)

Recently uploaded (20)

A Machine Learning Approach to SPARQL Query Performance Prediction

Editor's Notes