SlideShare a Scribd company logo
Towards efficient 
processing of 
RDF data streams 
Alejandro Llaves 
Javier D. Fernández 
Oscar Corcho 
Ontology Engineering Group 
Universidad Politécnica de Madrid 
Madrid, Spain 
allaves@fi.upm.es 
OrdRing workshop - ISWC 2014 
Riva del Garda 
October 20th, 2014
Efficient?? Scalable? 
http://blog.mikiobraun.de/
Outline 
 Introduction 
 Background: Storm and Lambda Architecture 
 Efficient processing of queries over RDF streams 
 Architecture overview 
 Storm-based operators for querying RDF streams 
 Adaptive query processing for data streams 
 RDF stream compression 
 Conclusions & future work 
 Open questions
Introduction 
 Origins of Linked Stream Data 
 Extracting information from data streams is complex: 
heterogeneity, rate of generation, volume, provenance,… 
 Challenges 
 C1. Efficient processing of user queries over RDF streams 
 C2. Continuous transmission of data increases latency 
 C3. Integration of historical and real-time data with background 
knowledge 
Source: http://webnotations.com
Background 
Storm - http://storm.incubator.apache.org/ 
 Distributed system for real-time processing of streams 
 Why Storm? 
 Simple processing model (parallelization) 
 Open source community backing the project 
 Used by relevant companies, e.g. Twitter. 
Lambda Architecture (Marz 2013) 
 Batch layer: stores ALL the incoming data in an immutable 
master dataset and pre-computes batch views on historic data. 
 Serving layer: indexes views on the master dataset. 
 Real-time processing layer: requests data views depending on 
incoming queries.
Efficient processing of RDF streams 
Goal: to develop a stream processing engine capable of 
adapting to variable conditions, such as changing rates of 
input data, failure of processing nodes, or distribution of 
workload, while serving complex continuous queries. 
Methodology 
 State of the art of (RDF) stream processing 
 Evaluate how to parallelize SPARQLStream queries 
 Implementation of RDF query operators 
 Optimize parallelization for common queries 
 Design self-adaptive strategies that allow the engine to 
react in front of changes
Architecture overview
Storm-based operators for querying RDF streams 
 Triple2Graph operator 
 Time Window operator 
 Simple Join operator 
 Projection operator 
Simple Join Stream<Set<Tuple>> 
(join attribute) 
Stream<Set<Tuple>, Set<Tuple>> 
Windowing Stream<Set<Graph>> 
(window size, 
emission time) 
Stream<Graph, t> 
Project Stream<Tuple<o1, o2,..., on>> 
(input, output) 
Stream<Tuple<i1, i2,..., in>> 
Triple2Graph Stream<Graph, t> 
(graph starter) 
Stream<s, p, o>
Storm-based operators for querying RDF streams 
Project 
Project 
Project 
Project 
Storm topology example (4 nodes) 
SELECT ?obs.value ?sensors.location 
FROM NAMED STREAM <obs> [60 SEC TO NOW] 
FROM NAMED STREAM <sensors> [60 SEC TO NOW] 
WHERE obs.sensorId = sensors.id ; 
Simple Join 
Simple Join 
Simple Join 
Simple Join 
Windowing 
<t0, t2...> 
Windowing 
<t1, t3...> 
Windowing 
<t0, t2...> 
Windowing 
<t1, t3...> 
SPOUT 
<obs> 
Triple2Graph 
Output 
SPOUT 
<sensors> Triple2Graph 
t0 
t1 
t2 
t3
RDF stream compression (1/2) 
Efficient RDF Interchange (ERI) format 
 Based on Efficient XML Interchange (EXI) 
 Main assumption: RDF streams have regular structure and 
are redundant 
 Information encoded at 2 levels 
 Structural dictionary 
 Presets (values) 
 Example: SSN observations
Where can we apply RDF stream compression?
Conclusions and future work 
Conclusions 
 We have addressed challenges C1 (scalability) and C2 (transmission) 
 Catalogue of Storm-based operators to parallelize query processing over RDF 
streams. 
 New format for RDF stream compression called ERI. 
 Challenge C3 (integration) involves storage of historical data and the 
deployment of batch and serving layers OR the migration to a more 
general system, e.g. Apache Spark. 
Future work 
 Finish the implementation of RDF query operators 
 Test the parallelization of a set of common queries → SRBench 
 Adaptive strategies based on Adaptive Query Processing 
 Evaluation → Benchmarking: comparison to CQELS Cloud 
 Integration of ERI into our engine
Open questions 
 How does the order of tuple arrival affect the 
parallelization of join processing tasks? 
 Are the spatial (or spatio-temporal) properties of a 
tuple a dimension to have into account for ordering? 
In such case, what influence does it have on 
reasoning tasks? And on parallelization tasks? 
 How does the out-of-order tuples affect the 
processing of streams? In case of discarding out-of-order 
tuples, how to communicate this in the 
results?
Thanks! 
The research leading to this results has received funding from the 
EU's Seventh Framework Programme (FP7/2007-2013) under 
grant agreement no. 257641, PlanetData network of excellence, 
from Ministerio de Economía y Competitividad (Spain) under the 
project “4V: Volumen, Velocidad, Variedad y Validez en la Gestión 
Innovadora de Datos” (TIN2013-46238-C4-2-R), and has been 
supported by an AWS in Education Research Grant award. 
Alejandro Llaves 
allaves@fi.upm.es
Adaptive Query Processing (AQP) for data streams 
 Traditional databases include a query optimizer that 
designs the execution plan based on the data 
statistics. 
 AQP (Deshpande 2007) techniques allow adjusting 
the query execution plan to varying conditions of 
the data input, the incoming queries, and the 
system.
RDF stream compression (2/2) 
Evaluation 
 Datasets: streaming, statistical, and general static. 
 Compression ratio, compression time, and parsing 
throughput (transmission + decompression) 
 Comparison to other formats, such as N-Triples, Turtle, 
RDSZ, HDT, with different configurations of ERI w.r.t. 
transmitted data block (1K – 4K) and the presence of 
dictionary. 
 Conclusion: ERI produces state-of-the-art compression for 
RDF streams and excels for regularly-structured static RDF 
datasets. ERI compression ratios remain competitive in 
general datasets and the time overheads for ERI 
processing are relatively low. 
http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf

More Related Content

PPTX
RDF-Gen: Generating RDF from streaming and archival data
PPTX
TripleWave: Spreading RDF Streams on the Web
PPT
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
PPTX
Connecting Stream Reasoners on the Web
PPTX
Swift Parallel Scripting for High-Performance Workflow
PDF
towards_analytics_query_engine
PDF
Introduction to Microsoft R Services
PPTX
LD4KD 2015 - Demos and tools
RDF-Gen: Generating RDF from streaming and archival data
TripleWave: Spreading RDF Streams on the Web
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Connecting Stream Reasoners on the Web
Swift Parallel Scripting for High-Performance Workflow
towards_analytics_query_engine
Introduction to Microsoft R Services
LD4KD 2015 - Demos and tools

What's hot (20)

PDF
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
PPTX
RDF Stream Processing: Let's React
PPTX
RDF Stream Processing Tutorial: RSP implementations
PPTX
Building a scalable data science platform with R
PDF
RSP4J: An API for RDF Stream Processing
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PPT
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
PDF
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
PDF
final_copy_camera_ready_paper (7)
PDF
Introduction to Spark R with R studio - Mr. Pragith
PDF
An Introduction to Spark with Scala
PPTX
Are You Ready for Big Data Big Analytics?
PDF
Spark graphx
PDF
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
PPTX
RDF Stream Processing and the role of Semantics
PPTX
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
PPTX
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
PDF
Summary of the Stream Reasoning workshop at ISWC 2016
PPTX
AMP Camp 5 Intro
PDF
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
RDF Stream Processing: Let's React
RDF Stream Processing Tutorial: RSP implementations
Building a scalable data science platform with R
RSP4J: An API for RDF Stream Processing
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
final_copy_camera_ready_paper (7)
Introduction to Spark R with R studio - Mr. Pragith
An Introduction to Spark with Scala
Are You Ready for Big Data Big Analytics?
Spark graphx
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
RDF Stream Processing and the role of Semantics
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Summary of the Stream Reasoning workshop at ISWC 2016
AMP Camp 5 Intro
A Hierarchical approach towards Efficient and Expressive Stream Reasoning
Ad

Viewers also liked (18)

PDF
Obligaciones Extracontraxtuales
PPTX
PB+J = Alteryx & Tableau
PPT
PARTISIPASI MASYARAKAT DALAM PEMBUATAN AKTA KELAHIRAN DI DINAS KEPENDUDUKAN D...
PDF
bab 6 kls x
PPTX
Microbiologically influenced corrosion (mic) or biological corrosion
PDF
Мінрегіон: Концепція реформування місцевого самоврядування та територіальної ...
PPTX
Intertextuality in music videos slideshare
PDF
Dracula d
DOCX
Risk
PPTX
Reconocimiento grupo 176
PPTX
Исследования Константина Иовкова
PPTX
Thermos pp
PPTX
PDF
Web focus overview presentation 2015
PPTX
Analisa Kebutuhan Windows 2008
PDF
bab 4 kls x
PPT
Modello documentazione udc_ clima
Obligaciones Extracontraxtuales
PB+J = Alteryx & Tableau
PARTISIPASI MASYARAKAT DALAM PEMBUATAN AKTA KELAHIRAN DI DINAS KEPENDUDUKAN D...
bab 6 kls x
Microbiologically influenced corrosion (mic) or biological corrosion
Мінрегіон: Концепція реформування місцевого самоврядування та територіальної ...
Intertextuality in music videos slideshare
Dracula d
Risk
Reconocimiento grupo 176
Исследования Константина Иовкова
Thermos pp
Web focus overview presentation 2015
Analisa Kebutuhan Windows 2008
bab 4 kls x
Modello documentazione udc_ clima
Ad

Similar to Towards efficient processing of RDF data streams (20)

PPTX
Transient and persistent RDF views over relational databases in the context o...
PDF
PDF
Fossasia 2018-chetan-khatri
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
PPTX
Scientific
PDF
Unified Big Data Processing with Apache Spark
PDF
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
DOC
Heuristic based query optimisation for rsp(rdf stream processing) engines
DOC
My thesis
PDF
Reactive Stream Processing for Data-centric Publish/Subscribe
PDF
What we do to improve scalability in our RDF processing system
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Overview of the SPARQL-Generate language and latest developments
DOCX
Database Integrated Analytics using R InitialExperiences wi
PPT
On the need for a W3C community group on RDF Stream Processing
PDF
Big data distributed processing: Spark introduction
PDF
Large scale logistic regression and linear support vector machines using spark
PDF
A look under the hood at Apache Spark's API and engine evolutions
Transient and persistent RDF views over relational databases in the context o...
Fossasia 2018-chetan-khatri
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Scientific
Unified Big Data Processing with Apache Spark
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Heuristic based query optimisation for rsp(rdf stream processing) engines
My thesis
Reactive Stream Processing for Data-centric Publish/Subscribe
What we do to improve scalability in our RDF processing system
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Overview of the SPARQL-Generate language and latest developments
Database Integrated Analytics using R InitialExperiences wi
On the need for a W3C community group on RDF Stream Processing
Big data distributed processing: Spark introduction
Large scale logistic regression and linear support vector machines using spark
A look under the hood at Apache Spark's API and engine evolutions

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Computer network topology notes for revision
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Computer network topology notes for revision
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IB Computer Science - Internal Assessment.pptx
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Knowledge Engineering Part 1
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Towards efficient processing of RDF data streams

  • 1. Towards efficient processing of RDF data streams Alejandro Llaves Javier D. Fernández Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Madrid, Spain allaves@fi.upm.es OrdRing workshop - ISWC 2014 Riva del Garda October 20th, 2014
  • 3. Outline  Introduction  Background: Storm and Lambda Architecture  Efficient processing of queries over RDF streams  Architecture overview  Storm-based operators for querying RDF streams  Adaptive query processing for data streams  RDF stream compression  Conclusions & future work  Open questions
  • 4. Introduction  Origins of Linked Stream Data  Extracting information from data streams is complex: heterogeneity, rate of generation, volume, provenance,…  Challenges  C1. Efficient processing of user queries over RDF streams  C2. Continuous transmission of data increases latency  C3. Integration of historical and real-time data with background knowledge Source: http://webnotations.com
  • 5. Background Storm - http://storm.incubator.apache.org/  Distributed system for real-time processing of streams  Why Storm?  Simple processing model (parallelization)  Open source community backing the project  Used by relevant companies, e.g. Twitter. Lambda Architecture (Marz 2013)  Batch layer: stores ALL the incoming data in an immutable master dataset and pre-computes batch views on historic data.  Serving layer: indexes views on the master dataset.  Real-time processing layer: requests data views depending on incoming queries.
  • 6. Efficient processing of RDF streams Goal: to develop a stream processing engine capable of adapting to variable conditions, such as changing rates of input data, failure of processing nodes, or distribution of workload, while serving complex continuous queries. Methodology  State of the art of (RDF) stream processing  Evaluate how to parallelize SPARQLStream queries  Implementation of RDF query operators  Optimize parallelization for common queries  Design self-adaptive strategies that allow the engine to react in front of changes
  • 8. Storm-based operators for querying RDF streams  Triple2Graph operator  Time Window operator  Simple Join operator  Projection operator Simple Join Stream<Set<Tuple>> (join attribute) Stream<Set<Tuple>, Set<Tuple>> Windowing Stream<Set<Graph>> (window size, emission time) Stream<Graph, t> Project Stream<Tuple<o1, o2,..., on>> (input, output) Stream<Tuple<i1, i2,..., in>> Triple2Graph Stream<Graph, t> (graph starter) Stream<s, p, o>
  • 9. Storm-based operators for querying RDF streams Project Project Project Project Storm topology example (4 nodes) SELECT ?obs.value ?sensors.location FROM NAMED STREAM <obs> [60 SEC TO NOW] FROM NAMED STREAM <sensors> [60 SEC TO NOW] WHERE obs.sensorId = sensors.id ; Simple Join Simple Join Simple Join Simple Join Windowing <t0, t2...> Windowing <t1, t3...> Windowing <t0, t2...> Windowing <t1, t3...> SPOUT <obs> Triple2Graph Output SPOUT <sensors> Triple2Graph t0 t1 t2 t3
  • 10. RDF stream compression (1/2) Efficient RDF Interchange (ERI) format  Based on Efficient XML Interchange (EXI)  Main assumption: RDF streams have regular structure and are redundant  Information encoded at 2 levels  Structural dictionary  Presets (values)  Example: SSN observations
  • 11. Where can we apply RDF stream compression?
  • 12. Conclusions and future work Conclusions  We have addressed challenges C1 (scalability) and C2 (transmission)  Catalogue of Storm-based operators to parallelize query processing over RDF streams.  New format for RDF stream compression called ERI.  Challenge C3 (integration) involves storage of historical data and the deployment of batch and serving layers OR the migration to a more general system, e.g. Apache Spark. Future work  Finish the implementation of RDF query operators  Test the parallelization of a set of common queries → SRBench  Adaptive strategies based on Adaptive Query Processing  Evaluation → Benchmarking: comparison to CQELS Cloud  Integration of ERI into our engine
  • 13. Open questions  How does the order of tuple arrival affect the parallelization of join processing tasks?  Are the spatial (or spatio-temporal) properties of a tuple a dimension to have into account for ordering? In such case, what influence does it have on reasoning tasks? And on parallelization tasks?  How does the out-of-order tuples affect the processing of streams? In case of discarding out-of-order tuples, how to communicate this in the results?
  • 14. Thanks! The research leading to this results has received funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257641, PlanetData network of excellence, from Ministerio de Economía y Competitividad (Spain) under the project “4V: Volumen, Velocidad, Variedad y Validez en la Gestión Innovadora de Datos” (TIN2013-46238-C4-2-R), and has been supported by an AWS in Education Research Grant award. Alejandro Llaves allaves@fi.upm.es
  • 15. Adaptive Query Processing (AQP) for data streams  Traditional databases include a query optimizer that designs the execution plan based on the data statistics.  AQP (Deshpande 2007) techniques allow adjusting the query execution plan to varying conditions of the data input, the incoming queries, and the system.
  • 16. RDF stream compression (2/2) Evaluation  Datasets: streaming, statistical, and general static.  Compression ratio, compression time, and parsing throughput (transmission + decompression)  Comparison to other formats, such as N-Triples, Turtle, RDSZ, HDT, with different configurations of ERI w.r.t. transmitted data block (1K – 4K) and the presence of dictionary.  Conclusion: ERI produces state-of-the-art compression for RDF streams and excels for regularly-structured static RDF datasets. ERI compression ratios remain competitive in general datasets and the time overheads for ERI processing are relatively low. http://dataweb.infor.uva.es/wp-content/uploads/2014/07/iswc14.pdf