SlideShare a Scribd company logo
A Machine Learning Approach to
SPARQL Query Performance
Prediction
Rakebul Hasan
Wimmics Research Team
INRIA Sophia Antipolis
France
Context
2
Slide derived from Andreas Blumauer’s Linked Data slides
• Linked Data Principles
1. Use URIs as names for things.
2. Use HTTP URIs, so that people can
look up those names.
3. When someone looks up a URI,
provide useful information, using the
standards (RDF, SPARQL).
4. Include links to other URIs, so that
they can discover more things.
Context
• W3C Linking Open Data (LOD) Initiative
• An initiative to publish open data as Linked Data
• From 2 billion triples in 2007 to 30 billion triples in 2011
• Accessing Linked Data
– Dereferencing URIs
– SPARQL Endpoints
3
Context
• Querying Linked Data
– SPARQL Endpoints: SPARQL query service via HTTP
implementing SPARQL Protocol
– 68% of the data sets provide SPARQL Endpoints as
of September 2011
– As of Today, 98% of the triples in LOD cloud are
accessible via SPARQL
• 57,856,463,005 out of 58,882,358,557 triples
http://stats.lod2.eu/
4
Context
• Understanding Query Behavior in the context
of Linked Data
– Workload allocation to ensure specific QoS
requirements are met
– Predicting query performance metrics
5
Query Performance Prediction
• Traditional approaches use underlying data statistics-based
cost models to predict query performance
• Data statistics are often missing in the Linked Data scenario
– Only 32.20 % (95 out of 295) data sources provide a voiD
description.
• Basic statistics such as number of triples, often not detailed enough
for statistics based models
– In fact, what makes effective statistics for query cost estimation on RDF is
unclear.
• Challenge
– How to predict query performance without using data statistics?
6
Understanding performance of
database queries
• Ganapathi et al. predicting performance
metrics of database queries prior to query
execution using machine learning.
• Akdere et al. use machine learning for
predicting query execution time.
Ganapathi et al.: Predicting multiple metrics for queries: Better decisions enabled by machine learning, ICDE’09
Akdere et al, Learning-based query performance modeling and prediction, ICDE’12, 7
Predicting Query Performance
• Learn query performance from already
executed queries
• Challenge: how to model SPARQL query
characteristics for machine learning
algorithms - feature extraction?
8
Modeling SPARQL Query Execution
• Two types of features
– Algebra features: extracted from SPARQL algebraic
expression of a query
– Graph pattern features: a vector representation of
the query pattern of a query relative to the
training queries
9
Modeling SPARQL Query Execution
• Algebra features
– Jena API to extract
SPARQL algebra
expressions
10
• Graph pattern features
– Find landmarks in training
queries by clustering
• K-medoids with
approximate graph edit
distance
– Compute distance
between landmark queries
and the query in
examination to construct a
graph pattern feature
vector
• Approximate graph edit
distance for distance
computation
11
Graph Edit Distance
• Minimum amount of distortion needed to
transform one graph to another
– Bipartite matching based approximated graph edit
distance with
• Previous research shows accurate results with
classification problems
Riesen et al. “A Novel Software Toolkit for Graph Edit Distance Computation”, 9th IAPR-TC-15, GbRPR 2013
12
Experiments
• 1260 training, 420 validation, and 420 test
queries generated from DBPSB benchmark query
templates
– DBPSB templates cover most commonly used SPARQL
query features in the queries sent to DBPedia
• DBpedia as RDF data set
• Predicting query execution time
– k-NN regression with k-D tree
– SVM with nu-SVR for regression
13
DBpedia: http://dbpedia.org/
DBPSB: http://aksw.org/Projects/DBPSB.html
Algebra Features
14
Algebra and graph pattern features
15
Time Required for Training and
Predictions
16
Summary
• Understanding SPARQL query behavior in the
Linked Data scenario
– Predicting query performance metrics
• learn query execution times from already
executed queries
– without using statistics about the underlying RDF
data.
– Modeling (vector representation) SPARQL queries for
machine learning algorithms
• Feature extraction
– Highly accurate predictions for common Linked Data
queries
17
Future Work on QPP
• Incorporating bandwidth related features.
• Query optimization for Linked Data applications:
– in place of selectivity estimation for alternative
queries?
• How to accurately predict performance for single
triple patterns
– Alternative query construction for Linked Data
applications – join order optimization. E.g. Federated
Query Processing over Linked Data
• How to generate training queries?
– Next slide
18
Statistical Analysis of Query Logs
• Approach to systematically generating training queries
Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries,
1st International Workshop on Usage Analysis and the Web of Data,
co-located with the 20th International World Wide Web Conference (WWW2011)
19
• Thank you
20

More Related Content

PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
PDF
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
PDF
Splice Machine's use of Apache Spark and MLflow
PPTX
Spark MLlib - Training Material
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
GraphFrames: DataFrame-based graphs for Apache® Spark™
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Splice Machine's use of Apache Spark and MLflow
Spark MLlib - Training Material
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...

What's hot (20)

PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
Scaling Machine Learning with Apache Spark
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
PPTX
AMP Camp 5 Intro
PDF
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
PPTX
Machine Learning With Spark
PDF
Machine Learning by Example - Apache Spark
PPTX
Spark ML Pipeline serving
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
PDF
Automated Hyperparameter Tuning, Scaling and Tracking
PDF
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
PDF
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
PDF
Advanced Hyperparameter Optimization for Deep Learning with MLflow
PDF
H2O World - H2O Rains with Databricks Cloud
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Practical Distributed Machine Learning Pipelines on Hadoop
Scaling Machine Learning with Apache Spark
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
AMP Camp 5 Intro
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Machine Learning With Spark
Machine Learning by Example - Apache Spark
Spark ML Pipeline serving
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Automated Hyperparameter Tuning, Scaling and Tracking
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
What to Expect for Big Data and Apache Spark in 2017
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Advanced Hyperparameter Optimization for Deep Learning with MLflow
H2O World - H2O Rains with Databricks Cloud
Pandas UDF: Scalable Analysis with Python and PySpark
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Ad

Viewers also liked (20)

PPTX
Inference on the Semantic Web
PPTX
Strategies for Processing and Explaining Distributed Queries on Linked Data
PDF
Sigir12 tutorial: Query Perfromance Prediction for IR
PPT
Jess Tab Tutorial
PDF
SWRL Overview
PDF
서울시 열린데이터 광장 문화관광 분야 LOD 서비스
PPT
Jena
PPT
070517 Jena
PPTX
17 using rules of inference to build arguments
PDF
Jena based implementation of a iso 11179 meta data registry
PDF
An Introduction to the Jena API
PDF
Tutorial - Introduction to Rule Technologies and Systems
PDF
Semantic Integration with Apache Jena and Stanbol
PPTX
Unit 1 rules of inference
PDF
LOD(Linked Open Data) Recommendations
PDF
Introduction of Deep Learning
PDF
Semtech web-protege-tutorial
PPTX
devices and methods for automatic data capture
PDF
LODAC 2017 Linked Open Data Workshop
PPT
Storage And Retrieval Of Information
Inference on the Semantic Web
Strategies for Processing and Explaining Distributed Queries on Linked Data
Sigir12 tutorial: Query Perfromance Prediction for IR
Jess Tab Tutorial
SWRL Overview
서울시 열린데이터 광장 문화관광 분야 LOD 서비스
Jena
070517 Jena
17 using rules of inference to build arguments
Jena based implementation of a iso 11179 meta data registry
An Introduction to the Jena API
Tutorial - Introduction to Rule Technologies and Systems
Semantic Integration with Apache Jena and Stanbol
Unit 1 rules of inference
LOD(Linked Open Data) Recommendations
Introduction of Deep Learning
Semtech web-protege-tutorial
devices and methods for automatic data capture
LODAC 2017 Linked Open Data Workshop
Storage And Retrieval Of Information
Ad

Similar to A Machine Learning Approach to SPARQL Query Performance Prediction (20)

PDF
The Analytics Frontier of the Hadoop Eco-System
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
Large Scale Machine learning with Spark
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Fighting Fraud with Apache Spark
PDF
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
PDF
Deep learning and Apache Spark
PDF
Spark DataFrames and ML Pipelines
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PPTX
RDF-Gen: Generating RDF from streaming and archival data
PDF
Big Data for Data Scientists - WeCloudData
PDF
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
PDF
End-to-End Data Pipelines with Apache Spark
PPTX
A machine learning and data science pipeline for real companies
PDF
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PPTX
Apache Spark sql
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
PPTX
Metail and Elastic MapReduce
The Analytics Frontier of the Hadoop Eco-System
Combining Machine Learning frameworks with Apache Spark
Large Scale Machine learning with Spark
Combining Machine Learning Frameworks with Apache Spark
Fighting Fraud with Apache Spark
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Deep learning and Apache Spark
Spark DataFrames and ML Pipelines
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
RDF-Gen: Generating RDF from streaming and archival data
Big Data for Data Scientists - WeCloudData
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
End-to-End Data Pipelines with Apache Spark
A machine learning and data science pipeline for real companies
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Apache Spark sql
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Metail and Elastic MapReduce

Recently uploaded (20)

PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
top salesforce developer skills in 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
System and Network Administraation Chapter 3
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Transform Your Business with a Software ERP System
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
history of c programming in notes for students .pptx
Reimagine Home Health with the Power of Agentic AI​
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Softaken Excel to vCard Converter Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
top salesforce developer skills in 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
System and Network Administraation Chapter 3
Designing Intelligence for the Shop Floor.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Understanding Forklifts - TECH EHS Solution
Transform Your Business with a Software ERP System
How to Choose the Right IT Partner for Your Business in Malaysia
How to Migrate SBCGlobal Email to Yahoo Easily
Design an Analysis of Algorithms II-SECS-1021-03
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PTS Company Brochure 2025 (1).pdf.......
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
history of c programming in notes for students .pptx

A Machine Learning Approach to SPARQL Query Performance Prediction

  • 1. A Machine Learning Approach to SPARQL Query Performance Prediction Rakebul Hasan Wimmics Research Team INRIA Sophia Antipolis France
  • 2. Context 2 Slide derived from Andreas Blumauer’s Linked Data slides • Linked Data Principles 1. Use URIs as names for things. 2. Use HTTP URIs, so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL). 4. Include links to other URIs, so that they can discover more things.
  • 3. Context • W3C Linking Open Data (LOD) Initiative • An initiative to publish open data as Linked Data • From 2 billion triples in 2007 to 30 billion triples in 2011 • Accessing Linked Data – Dereferencing URIs – SPARQL Endpoints 3
  • 4. Context • Querying Linked Data – SPARQL Endpoints: SPARQL query service via HTTP implementing SPARQL Protocol – 68% of the data sets provide SPARQL Endpoints as of September 2011 – As of Today, 98% of the triples in LOD cloud are accessible via SPARQL • 57,856,463,005 out of 58,882,358,557 triples http://stats.lod2.eu/ 4
  • 5. Context • Understanding Query Behavior in the context of Linked Data – Workload allocation to ensure specific QoS requirements are met – Predicting query performance metrics 5
  • 6. Query Performance Prediction • Traditional approaches use underlying data statistics-based cost models to predict query performance • Data statistics are often missing in the Linked Data scenario – Only 32.20 % (95 out of 295) data sources provide a voiD description. • Basic statistics such as number of triples, often not detailed enough for statistics based models – In fact, what makes effective statistics for query cost estimation on RDF is unclear. • Challenge – How to predict query performance without using data statistics? 6
  • 7. Understanding performance of database queries • Ganapathi et al. predicting performance metrics of database queries prior to query execution using machine learning. • Akdere et al. use machine learning for predicting query execution time. Ganapathi et al.: Predicting multiple metrics for queries: Better decisions enabled by machine learning, ICDE’09 Akdere et al, Learning-based query performance modeling and prediction, ICDE’12, 7
  • 8. Predicting Query Performance • Learn query performance from already executed queries • Challenge: how to model SPARQL query characteristics for machine learning algorithms - feature extraction? 8
  • 9. Modeling SPARQL Query Execution • Two types of features – Algebra features: extracted from SPARQL algebraic expression of a query – Graph pattern features: a vector representation of the query pattern of a query relative to the training queries 9
  • 10. Modeling SPARQL Query Execution • Algebra features – Jena API to extract SPARQL algebra expressions 10
  • 11. • Graph pattern features – Find landmarks in training queries by clustering • K-medoids with approximate graph edit distance – Compute distance between landmark queries and the query in examination to construct a graph pattern feature vector • Approximate graph edit distance for distance computation 11
  • 12. Graph Edit Distance • Minimum amount of distortion needed to transform one graph to another – Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with classification problems Riesen et al. “A Novel Software Toolkit for Graph Edit Distance Computation”, 9th IAPR-TC-15, GbRPR 2013 12
  • 13. Experiments • 1260 training, 420 validation, and 420 test queries generated from DBPSB benchmark query templates – DBPSB templates cover most commonly used SPARQL query features in the queries sent to DBPedia • DBpedia as RDF data set • Predicting query execution time – k-NN regression with k-D tree – SVM with nu-SVR for regression 13 DBpedia: http://dbpedia.org/ DBPSB: http://aksw.org/Projects/DBPSB.html
  • 15. Algebra and graph pattern features 15
  • 16. Time Required for Training and Predictions 16
  • 17. Summary • Understanding SPARQL query behavior in the Linked Data scenario – Predicting query performance metrics • learn query execution times from already executed queries – without using statistics about the underlying RDF data. – Modeling (vector representation) SPARQL queries for machine learning algorithms • Feature extraction – Highly accurate predictions for common Linked Data queries 17
  • 18. Future Work on QPP • Incorporating bandwidth related features. • Query optimization for Linked Data applications: – in place of selectivity estimation for alternative queries? • How to accurately predict performance for single triple patterns – Alternative query construction for Linked Data applications – join order optimization. E.g. Federated Query Processing over Linked Data • How to generate training queries? – Next slide 18
  • 19. Statistical Analysis of Query Logs • Approach to systematically generating training queries Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011) 19

Editor's Notes

  • #3: Web of Documents -> documents were described using HTML and globally identified using URLs Retrieval mechanism: HTTP protocol all these ensured creating a single global data space Data on the Web Many formats and APIs Proprietary interfaces No single global data space – no hyperlinks between data items within different data sources Web of Data -> a single global data space using RDF to publish data on the Web links between data items within different data sources
  • #7: Histograms -> on which to create histograms for effective estimation
  • #13: The graph edit distance between two graphs is the minimum amount of distortion needed to transform one graph to another. The minimum amount of distortion is the sequence of edit operations with minimum cost. The edit operations are deletions, insertions, and substitutions of nodes and edges.
  • #20: Refining training queries from query logs by considering the statistically significant characteristics Bootstrapping: Starting with a initial set of properties, resources and literals and than generate training queries by permutations and combinations of the statistically significant features Simplifying the pattern features: join features, triple pattern features, pattern graph features to represent query patterns