SlideShare a Scribd company logo
Towards efficient 
processing of 
RDF data streams 
Alejandro Llaves 
Javier D. Fernández 
Oscar Corcho 
Ontology Engineering Group 
Universidad Politécnica de Madrid 
Madrid, Spain 
allaves@fi.upm.es 
OrdRing workshop - ISWC 2014 
Riva del Garda 
October 20th, 2014
Efficient?? Scalable? 
http://blog.mikiobraun.de/
Outline 
 Introduction 
 Background: Storm and Lambda Architecture 
 Efficient processing of queries over RDF streams 
 Architecture overview 
 Storm-based operators for querying RDF streams 
 Adaptive query processing for data streams 
 RDF stream compression 
 Conclusions & future work 
 Open questions
Introduction 
 Origins of Linked Stream Data 
 Extracting information from data streams is complex: 
heterogeneity, rate of generation, volume, provenance,… 
 Challenges 
 C1. Efficient processing of user queries over RDF streams 
 C2. Continuous transmission of data increases latency 
 C3. Integration of historical and real-time data with background 
knowledge 
Source: http://webnotations.com
Background 
Storm - http://storm.incubator.apache.org/ 
 Distributed system for real-time processing of streams 
 Why Storm? 
 Simple processing model (parallelization) 
 Open source community backing the project 
 Used by relevant companies, e.g. Twitter. 
Lambda Architecture (Marz 2013) 
 Batch layer: stores ALL the incoming data in an immutable 
master dataset and pre-computes batch views on historic data. 
 Serving layer: indexes views on the master dataset. 
 Real-time processing layer: requests data views depending on 
incoming queries.
Efficient processing of RDF streams 
Goal: to develop a stream processing engine capable of 
adapting to variable conditions, such as changing rates of 
input data, failure of processing nodes, or distribution of 
workload, while serving complex continuous queries. 
Methodology 
 State of the art of (RDF) stream processing 
 Evaluate how to parallelize SPARQLStream queries 
 Implementation of RDF query operators 
 Optimize parallelization for common queries 
 Design self-adaptive strategies that allow the engine to 
react in front of changes
Architecture overview
Storm-based operators for querying RDF streams 
 Triple2Graph operator 
 Time Window operator 
 Simple Join operator 
 Projection operator 
Simple Join Stream<Set<Tuple>> 
(join attribute) 
Stream<Set<Tuple>, Set<Tuple>> 
Windowing Stream<Set<Graph>> 
(window size, 
emission time) 
Stream<Graph, t> 
Project Stream<Tuple<o1, o2,..., on>> 
(input, output) 
Stream<Tuple<i1, i2,..., in>> 
Triple2Graph Stream<Graph, t> 
(graph starter) 
Stream<s, p, o>
Storm-based operators for querying RDF streams 
Project 
Project 
Project 
Project 
Storm topology example (4 nodes) 
SELECT ?obs.value ?sensors.location 
FROM NAMED STREAM <obs> [60 SEC TO NOW] 
FROM NAMED STREAM <sensors> [60 SEC TO NOW] 
WHERE obs.sensorId = sensors.id ; 
Simple Join 
Simple Join 
Simple Join 
Simple Join 
Windowing 
<t0, t2...> 
Windowing 
<t1, t3...> 
Windowing 
<t0, t2...> 
Windowing 
<t1, t3...> 
SPOUT 
<obs> 
Triple2Graph 
Output 
SPOUT 
<sensors> Triple2Graph 
t0 
t1 
t2 
t3
RDF stream compression (1/2) 
Efficient RDF Interchange (ERI) format 
 Based on Efficient XML Interchange (EXI) 
 Main assumption: RDF streams have regular structure and 
are redundant 
 Information encoded at 2 levels 
 Structural dictionary 
 Presets (values) 
 Example: SSN observations
Where can we apply RDF stream compression?
Conclusions and future work 
Conclusions 
 We have addressed challenges C1 (scalability) and C2 (transmission) 
 Catalogue of Storm-based operators to parallelize query processing over RDF 
streams. 
 New format for RDF stream compression called ERI. 
 Challenge C3 (integration) involves storage of historical data and the 
deployment of batch and serving layers OR the migration to a more 
general system, e.g. Apache Spark. 
Future work 
 Finish the implementation of RDF query operators 
 Test the parallelization of a set of common queries → SRBench 
 Adaptive strategies based on Adaptive Query Processing 
 Evaluation → Benchmarking: comparison to CQELS Cloud 
 Integration of ERI into our engine
Open questions 
 How does the order of tuple arrival affect the 
parallelization of join processing tasks? 
 Are the spatial (or spatio-temporal) properties of a 
tuple a dimension to have into account for ordering? 
In such case, what influence does it have on 
reasoning tasks? And on parallelization tasks? 
 How does the out-of-order tuples affect the 
processing of streams? In case of discarding out-of-order 
tuples, how to communicate this in the 
results?
Thanks! 
The research leading to this results has received funding from the 
EU's Seventh Framework Programme (FP7/2007-2013) under 
grant agreement no. 257641, PlanetData network of excellence, 
from Ministerio de Economía y Competitividad (Spain) under the 
project “4V: Volumen, Velocidad, Variedad y Validez en la Gestión 
Innovadora de Datos” (TIN2013-46238-C4-2-R), and has been 
supported by an AWS in Education Research Grant award. 
Alejandro Llaves 
allaves@fi.upm.es
Adaptive Query Processing (AQP) for data streams 
 Traditional databases include a query optimizer that 
designs the execution plan based on the data 
statistics. 
 AQP (Deshpande 2007) techniques allow adjusting 
the query execution plan to varying conditions of 
the data input, the incoming queries, and the 
system.
RDF stream compression (2/2) 
Evaluation 
 Datasets: streaming, statistical, and general static. 
 Compression ratio, compression time, and parsing 
throughput (transmission + decompression) 
 Comparison to other formats, such as N-Triples, Turtle, 
RDSZ, HDT, with different configurations of ERI w.r.t. 
transmitted data block (1K – 4K) and the presence of 
dictionary. 
 Conclusion: ERI produces state-of-the-art compression for 
RDF streams and excels for regularly-structured static RDF 
datasets. ERI compression ratios remain competitive in 
general datasets and the time overheads for ERI 
processing are relatively low. 
http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf

More Related Content

PPTX
RDF-Gen: Generating RDF from streaming and archival data
PPTX
TripleWave: Spreading RDF Streams on the Web
PPT
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
PPTX
Connecting Stream Reasoners on the Web
PPTX
Swift Parallel Scripting for High-Performance Workflow
PDF
towards_analytics_query_engine
PDF
Introduction to Microsoft R Services
PPTX
LD4KD 2015 - Demos and tools
RDF-Gen: Generating RDF from streaming and archival data
TripleWave: Spreading RDF Streams on the Web
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Connecting Stream Reasoners on the Web
Swift Parallel Scripting for High-Performance Workflow
towards_analytics_query_engine
Introduction to Microsoft R Services
LD4KD 2015 - Demos and tools

What's hot (20)

PDF
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
PPTX
RDF Stream Processing: Let's React
PPTX
RDF Stream Processing Tutorial: RSP implementations
PPTX
Building a scalable data science platform with R
PDF
RSP4J: An API for RDF Stream Processing
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PPT
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
PDF
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
PDF
final_copy_camera_ready_paper (7)
PDF
Introduction to Spark R with R studio - Mr. Pragith
PDF
An Introduction to Spark with Scala
PPTX
Are You Ready for Big Data Big Analytics?
PDF
Spark graphx
PPTX
RDF Stream Processing and the role of Semantics
PDF
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
PPTX
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
PPTX
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
PDF
Summary of the Stream Reasoning workshop at ISWC 2016
PPTX
AMP Camp 5 Intro
PDF
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
RDF Stream Processing: Let's React
RDF Stream Processing Tutorial: RSP implementations
Building a scalable data science platform with R
RSP4J: An API for RDF Stream Processing
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
final_copy_camera_ready_paper (7)
Introduction to Spark R with R studio - Mr. Pragith
An Introduction to Spark with Scala
Are You Ready for Big Data Big Analytics?
Spark graphx
RDF Stream Processing and the role of Semantics
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Summary of the Stream Reasoning workshop at ISWC 2016
AMP Camp 5 Intro
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
Ad

Similar to Towards efficient processing of RDF data streams (20)

PPT
On the need for a W3C community group on RDF Stream Processing
PPTX
Efficient RDF Interchange (ERI) Format for RDF Data Streams
PPTX
On correctness in RDF stream processor benchmarking
PDF
RDF Stream Processing Models (SR4LD2013)
PDF
On web stream processing
PPT
Stream Reasoning : Where We Got So Far
PDF
Stream processing: The Matrix Revolutions
PDF
Toward Semantic Data Stream - Technologies and Applications
PDF
On the need for applications aware adaptive middleware in real-time RDF data ...
PPTX
Mining and Managing Large-scale Linked Open Data
PPTX
Mining and Managing Large-scale Linked Open Data
PPT
Stream Reasoning: Where we got so far. Oxford 2010.1.18
PPTX
Reactconf 2014 - Event Stream Processing
PPTX
Stream Reasoning: a summary of ten years of research and a vision for the nex...
PDF
On Unified Stream Reasoning - The RDF Stream Processing realm
PDF
RDF Stream Processing Models (RSP2014)
PPTX
ACQUA: Approximate Continuous Query Answering over Streams and Dynamic Linked...
PPTX
Why do they call it Linked Data when they want to say...?
PDF
Indexing data on the web a comparison of schema level indices for data search
PPTX
LiveLinkedData - TransWebData - Nantes 2013
On the need for a W3C community group on RDF Stream Processing
Efficient RDF Interchange (ERI) Format for RDF Data Streams
On correctness in RDF stream processor benchmarking
RDF Stream Processing Models (SR4LD2013)
On web stream processing
Stream Reasoning : Where We Got So Far
Stream processing: The Matrix Revolutions
Toward Semantic Data Stream - Technologies and Applications
On the need for applications aware adaptive middleware in real-time RDF data ...
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
Stream Reasoning: Where we got so far. Oxford 2010.1.18
Reactconf 2014 - Event Stream Processing
Stream Reasoning: a summary of ten years of research and a vision for the nex...
On Unified Stream Reasoning - The RDF Stream Processing realm
RDF Stream Processing Models (RSP2014)
ACQUA: Approximate Continuous Query Answering over Streams and Dynamic Linked...
Why do they call it Linked Data when they want to say...?
Indexing data on the web a comparison of schema level indices for data search
LiveLinkedData - TransWebData - Nantes 2013
Ad

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Lecture1 pattern recognition............
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to machine learning and Linear Models
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Lecture1 pattern recognition............
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Data_Analytics_and_PowerBI_Presentation.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Acumen Training GuidePresentation.pptx
Quality review (1)_presentation of this 21
IBA_Chapter_11_Slides_Final_Accessible.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
IB Computer Science - Internal Assessment.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to machine learning and Linear Models
STUDY DESIGN details- Lt Col Maksud (21).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

Towards efficient processing of RDF data streams

  • 1. Towards efficient processing of RDF data streams Alejandro Llaves Javier D. Fernández Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Madrid, Spain allaves@fi.upm.es OrdRing workshop - ISWC 2014 Riva del Garda October 20th, 2014
  • 3. Outline  Introduction  Background: Storm and Lambda Architecture  Efficient processing of queries over RDF streams  Architecture overview  Storm-based operators for querying RDF streams  Adaptive query processing for data streams  RDF stream compression  Conclusions & future work  Open questions
  • 4. Introduction  Origins of Linked Stream Data  Extracting information from data streams is complex: heterogeneity, rate of generation, volume, provenance,…  Challenges  C1. Efficient processing of user queries over RDF streams  C2. Continuous transmission of data increases latency  C3. Integration of historical and real-time data with background knowledge Source: http://webnotations.com
  • 5. Background Storm - http://storm.incubator.apache.org/  Distributed system for real-time processing of streams  Why Storm?  Simple processing model (parallelization)  Open source community backing the project  Used by relevant companies, e.g. Twitter. Lambda Architecture (Marz 2013)  Batch layer: stores ALL the incoming data in an immutable master dataset and pre-computes batch views on historic data.  Serving layer: indexes views on the master dataset.  Real-time processing layer: requests data views depending on incoming queries.
  • 6. Efficient processing of RDF streams Goal: to develop a stream processing engine capable of adapting to variable conditions, such as changing rates of input data, failure of processing nodes, or distribution of workload, while serving complex continuous queries. Methodology  State of the art of (RDF) stream processing  Evaluate how to parallelize SPARQLStream queries  Implementation of RDF query operators  Optimize parallelization for common queries  Design self-adaptive strategies that allow the engine to react in front of changes
  • 8. Storm-based operators for querying RDF streams  Triple2Graph operator  Time Window operator  Simple Join operator  Projection operator Simple Join Stream<Set<Tuple>> (join attribute) Stream<Set<Tuple>, Set<Tuple>> Windowing Stream<Set<Graph>> (window size, emission time) Stream<Graph, t> Project Stream<Tuple<o1, o2,..., on>> (input, output) Stream<Tuple<i1, i2,..., in>> Triple2Graph Stream<Graph, t> (graph starter) Stream<s, p, o>
  • 9. Storm-based operators for querying RDF streams Project Project Project Project Storm topology example (4 nodes) SELECT ?obs.value ?sensors.location FROM NAMED STREAM <obs> [60 SEC TO NOW] FROM NAMED STREAM <sensors> [60 SEC TO NOW] WHERE obs.sensorId = sensors.id ; Simple Join Simple Join Simple Join Simple Join Windowing <t0, t2...> Windowing <t1, t3...> Windowing <t0, t2...> Windowing <t1, t3...> SPOUT <obs> Triple2Graph Output SPOUT <sensors> Triple2Graph t0 t1 t2 t3
  • 10. RDF stream compression (1/2) Efficient RDF Interchange (ERI) format  Based on Efficient XML Interchange (EXI)  Main assumption: RDF streams have regular structure and are redundant  Information encoded at 2 levels  Structural dictionary  Presets (values)  Example: SSN observations
  • 11. Where can we apply RDF stream compression?
  • 12. Conclusions and future work Conclusions  We have addressed challenges C1 (scalability) and C2 (transmission)  Catalogue of Storm-based operators to parallelize query processing over RDF streams.  New format for RDF stream compression called ERI.  Challenge C3 (integration) involves storage of historical data and the deployment of batch and serving layers OR the migration to a more general system, e.g. Apache Spark. Future work  Finish the implementation of RDF query operators  Test the parallelization of a set of common queries → SRBench  Adaptive strategies based on Adaptive Query Processing  Evaluation → Benchmarking: comparison to CQELS Cloud  Integration of ERI into our engine
  • 13. Open questions  How does the order of tuple arrival affect the parallelization of join processing tasks?  Are the spatial (or spatio-temporal) properties of a tuple a dimension to have into account for ordering? In such case, what influence does it have on reasoning tasks? And on parallelization tasks?  How does the out-of-order tuples affect the processing of streams? In case of discarding out-of-order tuples, how to communicate this in the results?
  • 14. Thanks! The research leading to this results has received funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257641, PlanetData network of excellence, from Ministerio de Economía y Competitividad (Spain) under the project “4V: Volumen, Velocidad, Variedad y Validez en la Gestión Innovadora de Datos” (TIN2013-46238-C4-2-R), and has been supported by an AWS in Education Research Grant award. Alejandro Llaves allaves@fi.upm.es
  • 15. Adaptive Query Processing (AQP) for data streams  Traditional databases include a query optimizer that designs the execution plan based on the data statistics.  AQP (Deshpande 2007) techniques allow adjusting the query execution plan to varying conditions of the data input, the incoming queries, and the system.
  • 16. RDF stream compression (2/2) Evaluation  Datasets: streaming, statistical, and general static.  Compression ratio, compression time, and parsing throughput (transmission + decompression)  Comparison to other formats, such as N-Triples, Turtle, RDSZ, HDT, with different configurations of ERI w.r.t. transmitted data block (1K – 4K) and the presence of dictionary.  Conclusion: ERI produces state-of-the-art compression for RDF streams and excels for regularly-structured static RDF datasets. ERI compression ratios remain competitive in general datasets and the time overheads for ERI processing are relatively low. http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf