SlideShare a Scribd company logo
Data Streaming (in a Nutshell)...
... and Spark’s window operations
1
Vincenzo Gulisano, Ph.D.
Chalmers University
of technology
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
2
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
3
https://vincenzogulisano.com/
Assistant Professor
Distributed Computing and Systems Research Group
Department of Computer Science and engineering
Chalmers University of Technology
4
At our research team:
Research expertise & projects
Cyber
Security
Efficient
parallel &
stream
computing
Distributed
systems
IoT &Sensor
Networks
5
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
6
Motivation
• Since the year 2000, applications such as:
– Sensor networks
– Network Traffic Analysis
– Financial tickers
– Transaction Log Analysis
– Fraud Detection
• Require:
– Continuous processing of data streams
– Real Time Fashion
7
Motivation
• Relying 100% on store and process (i.e., DBs) is not feasible
– high-speed networks, nanoseconds to handle a packet
– ISP router: gigabytes of headers every hour,…
• Data Streaming:
– In memory
– Bounded resources
– Efficient one-pass analysis
8
Main Memory
Motivation
• DBMS vs. DSMS
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
Query
Data
Query
results
9
What about
?
10
Stonebraker, Michael, Uǧur Çetintemel and Stan Zdonik. The 8
requirements of real-time stream processing. (2005)
1. Keep the data moving
2. Query interface, e.g., extended SQL
3. Handle imperfections
4. Generate predictable outcomes
5. Integrate stored and streaming data
6. Guarantee data safety and availability
7. Partition and scale applications automatically
8. Process and respond instantaneously
System Model
• Data Stream: unbounded sequence of tuples
– Example: Call Description Record (CDR)
time
Field Field
Caller text
Callee text
Time (secs) int
Price (€) double
A B 8:00 3 C D 8:20 7 A E 8:35 6
11
System Model
• Operators:
OP
Stateless
1 input tuple
1 output tuple
OP
Stateful
1+ input tuple(s)
1 output tuple
12
Stateless Operators
Map: transform tuples schema
Example: convert price €  $
Filter: discard / route tuples
Example: route depending on price
Union: merge multiple streams
(sharing the same schema)
Example: merge CDRs from
different sources
System Model
13
Map
Filter
Union
…
…
Stateful Operators
Aggregate: compute aggregate
functions (group-by)
Example: compute avg. call duration
Join: match tuples from 2 streams
(equality predicate)
Example: match CDRs with prices in the
same range
System Model
14
Aggregate
Join2
System Model
• Continuous Query: graph operators/streams
Convert
€  $
Only
> 10$
Count calls
made by each
Caller number
Map Filter Agg
15
Field
Caller
Callee
Time (secs)
Price (€)
Field
Caller
Callee
Time (secs)
Price ($)
Field
Caller
Callee
Time (secs)
Price ($)
Field
Caller
Calls
Time (secs)
System Model
• Infinite sequence of tuples / bounded memory
 windows
• Example: 1 hour windows
time
[8:00,9:00)
[8:20,9:20)
[8:40,9:40)
16
System Model
• Infinite sequence of tuples / bounded memory
 windows
• Example: count tuples - 1 hour windows
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
Output: 4
17
[8:20,9:20)
What about
out-of-order tuples?
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
18
Spark’s window operations
(source: http://spark.apache.org/docs/latest/streaming-programming-guide.html)
19
20
Spark’s window operations
(source: http://spark.apache.org/docs/latest/streaming-programming-guide.html)
// Reduce function adding two integers, defined separately for clarity
Function2<Integer, Integer, Integer> reduceFunc = new Function2<Integer, Integer, Integer>() {
@Override public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
};
// Reduce last 30 seconds of data, every 10 seconds JavaPairDStream<String, Integer>
windowedWordCounts = pairs.reduceByKeyAndWindow(reduceFunc, Durations.seconds(30), Durations.seconds(10));
# Reduce last 30 seconds of data, every 10 seconds windowedWordCounts =
pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
21
Spark’s window operations
(source: http://spark.apache.org/docs/latest/streaming-programming-guide.html)
countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength,slideInterval) Return a new single-element stream, created by aggregating
elements in the stream over a sliding interval using func. The
function should be associative so that it can be computed
correctly in parallel.
reduceByKeyAndWindow(func,windowLength,
slideInterval, [numTasks])
When called on a DStream of (K, V) pairs, returns a new
DStream of (K, V) pairs where the values for each key are
aggregated using the given reduce function func over batches in a
sliding window [...]
reduceByKeyAndWindow(func, invFunc,windowLength,
slideInterval, [numTasks])
A more efficient version of the
above reduceByKeyAndWindow() where the reduce value of
each window is calculated incrementally using the reduce values
of the previous window. This is done by reducing the new data
that enters the sliding window, and “inverse reducing” the old
data that leaves the window. An example would be that of
“adding” and “subtracting” counts of keys as the window slides.
However, it is applicable only to “invertible reduce functions”
[...]
Maintaining tuples or windows?
22
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
[8:20,9:20)
Maintain tuples
When the window shifts:
1. Remove contribution of stale tuples
2. Go on adding new incoming tuples
Need to maintain a
single window instance
Need to maintain all
the tuples (how many?)
Maintaining tuples or windows?
23
time
[8:00,9:00) – 3 (so far...)
8:05 8:15 8:22 8:45 9:05
[8:20,9:20) – 1 (so far...)
Maintain windows
When a tuple arrives:
1. Add its contribution to all the
windows it falls in
No need to maintain
tuples
Need to maintain all
windows to which each
tuple contributes to
Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References (non exhaustive list)
24
References (non exhaustive list)
Bed time reading about Data Streaming
1. Gulisano, Vincenzo. StreamCloud: An Elastic Parallel-Distributed Stream
Processing Engine. Ph.D. Thesis. Polytechnic University Madrid, 2012.
Shared-nothing parallelism / Elasticity
1. StreamCloud: A Large Scale Data Streaming System. Vincenzo Gulisano,
Ricardo Jimenez-Peris, Marta Patiño-Martinez, Patrick Valduriez. 30th
International Conference on Distributed Computing Systems (ICDCS) 2010
2. StreamCloud: An Elastic and Scalable Data Streaming System. Vincenzo
Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente,
Patrick Valduriez. IEEE Transactions on Parallel and Distributed Processing
(TPDS)
25
References (non exhaustive list)
Shared-memory parallelism / fine-grained synchronization
1. ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join. Vincenzo Gulisano, Yiannis
Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. IEEE International Conference on Big Data
(IEEE Big Data 2015)
2. DEBS Grand Challenge: Deterministic Real-Time Analytics of Geospatial Data Streams through ScaleGate
Objects. Vincenzo Gulisano, Yiannis Nikolakopoulos, Ivan Walulya, Marina Papatriantafilou, Philippas
Tsigas. The 9th ACM International Conference on Distributed Event-Based Systems (DEBS 2015)
3. Concurrent Data Structures for Efficient Streaming Aggregation (brief announcement). Daniel Cederman,
Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. The 26th Annual
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) 2014
Streaming + Security / Privacy / Cyber-physical systems
1. Understanding the Data-Processing Challenges in Intelligent Vehicular Systems. Stefania Costache, Vincenzo
Gulisano, Marina Papatriantafilou. 2016 IEEE Intelligent Vehicles Symposium (IV16)
2. BES – Differentially Private and Distributed Event Aggregation in Advanced Metering
Infrastructures. Vincenzo Gulisano, Valentin Tudor, Magnus Almgren and Marina Papatriantafilou. 2nd
ACM Cyber-Physical System Security Workshop (CPSS 2016) [held in conjunction with ACM AsiaCCS’16],
2016.
3. METIS: a Two-Tier Intrusion Detection System for Advanced Metering Infrastructures. Vincenzo Gulisano,
Magnus Almgren, Marina Papatriantafilou. 10th International Conference on Security and Privacy in
Communication Networks (SecureComm) 2014
26
References (non exhaustive list)
• Motivation / System Model
1. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues
in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, PODS ’02, New York, NY, USA, 2002. ACM.
2. Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream
processing. SIGMOD Rec., 34(4), December 2005.
3. Nesime Tatbul. QoS-Driven load shedding on data streams. In Proceedings of the Workshops XMLDM,
MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers,
EDBT ’02, London, UK, UK, 2002. Springer-Verlag.
27
References (non exhaustive list)
• Centralized Stream Processing Engines
1. Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Keith Ito, Rajeev Motwani, Utkarsh
Srivastava, and Jennifer Widom. Stream: The Stanford data stream management system. Springer, 2004.
2. Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: semantic
foundations and query execution. The VLDB Journal, 15(2), June 2006.
3. Daniel J. Abadi, Don Carney, Uǧur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee,
Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data
stream management. The VLDB Journal, 12(2), August 2003.
4. Nesime Tatbul and Stan Zdonik. Window-aware load shedding for aggregation queries over data
streams. In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06.
VLDB Endowment, 2006.
28
References (non exhaustive list)
• Distributed Stream Processing Engines
1. Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uǧur Çetintemel, Mitch Cherniack, Jeong-Hyon
Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and
Stanley B. Zdonik. The design of the borealis stream processing engine. In CIDR, pages 277–289, 2005.
2. Magdalena Balazinska, Hari Balakrishnan, Samuel R Madden, and Michael Stonebraker. Fault-tolerance
in the borealis distributed stream processing system. ACM Trans. Database Syst., 33(1), March 2008.
ACM ID: 1331907.
3. Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. Towards sensor database systems. In
Proceedings of the Second International Conference on Mobile Data Management, MDM ’01, London,
UK, UK, 2001. Springer-Verlag.
4. Jeong-hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and
Stan Zdonik. A comparison of stream-oriented high availability algorithms. Technical report, Brown CS,
2003.
5. Jeong-Hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker,
and Stan Zdonik. High-Availability algorithms for distributed stream processing. In Data Engineering,
International Conference on, volume 0, Los Alamitos, CA, USA, 2005. IEEE Computer Society.
29
References (non exhaustive list)
• Parallel Stream Processing Engines
1. Vincenzo Gulisano, Ricardo Jiménez-Peris, Marta Patiño-Martínez, and Patrick Valduriez. Streamcloud:
A large scale data streaming system. In ICDCS 2010: International Conference on Distributed
Computing Systems, pages 126–137, June 2010.
2. Mehul Shah Joseph, Joseph M. Hellerstein, Sirish Ch, and Michael J. Franklin. Flux: An adaptive
partitioning operator for continuous query systems. In In ICDE, 2002.
30
References (non exhaustive list)
• Elastic Stream Processing Engines
1. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente, and Patrick
Valduriez. Streamcloud: An elastic and scalable data streaming system. IEEE Transactions on Parallel
and Distributed Systems, 99(PrePrints), 2012.
2. Thomas Heinze. Elastic complex event processing. In Proceedings of the 8th Middleware Doctoral
Symposium, MDS ’11, New York, NY, USA, 2011. ACM.
3. Simon Loesing, Martin Hentschel, Tim Kraska, and Donald Kossmann. Stormy: an elastic and highly
available streaming service in the cloud. In Proceedings of the 2012 Joint EDBT/ICDT Workshops,
EDBT-ICDT ’12, New York, NY, USA, 2012. ACM.
4. Scott Schneider, Henrique Andrade, Bugra Gedik, Alain Biem, and Kun-Lung Wu. Elastic scaling of
data parallel operators in stream processing. In Proceedings of the 2009 IEEE International Symposium
on Parallel&Distributed Processing, IPDPS ’09, Washington, DC, USA, 2009. IEEE Computer Society.
31

More Related Content

PPTX
The benefits of fine-grained synchronization in deterministic and efficient ...
PPTX
ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join
PPTX
The data streaming processing paradigm and its use in modern fog architectures
PPTX
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
PPTX
Crash course on data streaming (with examples using Apache Flink)
PPTX
Data Streaming in Big Data Analysis
PDF
The data streaming paradigm and its use in Fog architectures
PPTX
Mining high speed data streams: Hoeffding and VFDT
The benefits of fine-grained synchronization in deterministic and efficient ...
ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join
The data streaming processing paradigm and its use in modern fog architectures
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Crash course on data streaming (with examples using Apache Flink)
Data Streaming in Big Data Analysis
The data streaming paradigm and its use in Fog architectures
Mining high speed data streams: Hoeffding and VFDT

What's hot (18)

PDF
A Brief History of Stream Processing
PPTX
20220201_semi dynamic STAQ application on BBMB.pptx
PPTX
From Trill to Quill and Beyond
PPTX
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
PDF
PPTX
The lifecycle of reproducible science data and what provenance has got to do ...
PDF
Asymmetry in Large-Scale Graph Analysis, Explained
PDF
Streaming SQL Foundations: Why I ❤ Streams+Tables
PDF
Mining Big Data in Real Time
PDF
Introduction to transport resilience
PDF
Capacity Planning for Linux Systems
PDF
High-Performance Analysis of Streaming Graphs
PDF
Building Conclave: a decentralized, real-time collaborative text editor
PDF
Quantum algorithms for pattern matching in genomic sequences - 2018-06-22
PDF
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
PPTX
ReComp: challenges in selective recomputation of (expensive) data analytics t...
PPTX
Traffic Modeling for Aggregated Periodic IoT Data
PPTX
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
A Brief History of Stream Processing
20220201_semi dynamic STAQ application on BBMB.pptx
From Trill to Quill and Beyond
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
The lifecycle of reproducible science data and what provenance has got to do ...
Asymmetry in Large-Scale Graph Analysis, Explained
Streaming SQL Foundations: Why I ❤ Streams+Tables
Mining Big Data in Real Time
Introduction to transport resilience
Capacity Planning for Linux Systems
High-Performance Analysis of Streaming Graphs
Building Conclave: a decentralized, real-time collaborative text editor
Quantum algorithms for pattern matching in genomic sequences - 2018-06-22
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
ReComp: challenges in selective recomputation of (expensive) data analytics t...
Traffic Modeling for Aggregated Periodic IoT Data
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Ad

Viewers also liked (20)

PDF
Stream Processing Everywhere - What to use?
PDF
Introduction to Streaming Analytics
PDF
Introduction to Real-time data processing
PPTX
Real-Time Event & Stream Processing on MS Azure
PPTX
Introduction To Streaming Data and Stream Processing with Apache Kafka
PPTX
Hive Poster
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
PDF
Building Big Data Streaming Architectures
PPTX
KDD 2016 Streaming Analytics Tutorial
PDF
Real-time Stream Processing with Apache Flink @ Hadoop Summit
PDF
RBea: Scalable Real-Time Analytics at King
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Real-time analytics as a service at King
PDF
Streaming Analytics
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PPTX
Stream Analytics in the Enterprise
PDF
Reliable Data Intestion in BigData / IoT
PPTX
Ingest and Stream Processing - What will you choose?
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Stream Processing Everywhere - What to use?
Introduction to Streaming Analytics
Introduction to Real-time data processing
Real-Time Event & Stream Processing on MS Azure
Introduction To Streaming Data and Stream Processing with Apache Kafka
Hive Poster
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Building Big Data Streaming Architectures
KDD 2016 Streaming Analytics Tutorial
Real-time Stream Processing with Apache Flink @ Hadoop Summit
RBea: Scalable Real-Time Analytics at King
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Large-Scale Stream Processing in the Hadoop Ecosystem
Real-time analytics as a service at King
Streaming Analytics
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Stream Analytics in the Enterprise
Reliable Data Intestion in BigData / IoT
Ingest and Stream Processing - What will you choose?
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Ad

Similar to Data Streaming (in a Nutshell) ... and Spark's window operations (20)

PDF
Stream Processing Overview
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
Trivento summercamp fast data 9/9/2016
PDF
Data Stream Processing - Concepts and Frameworks
PDF
Towards Data Operations
PDF
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
PPTX
Software architecture for data applications
PDF
Reflections on Almost Two Decades of Research into Stream Processing
PPTX
Data streaming fundamentals
PDF
The Live: Stream Computing
PPTX
Data Stream Management
PDF
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
PDF
Productizing Structured Streaming Jobs
PDF
Introduction to Apache Apex by Thomas Weise
PPTX
How to extract valueable information from real time data feeds
PDF
Introduction to Data streaming - 05/12/2014
PDF
Streaming analytics state of the art
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PDF
AI-Powered Streaming Analytics for Real-Time Customer Experience
Stream Processing Overview
Trivento summercamp masterclass 9/9/2016
Trivento summercamp fast data 9/9/2016
Data Stream Processing - Concepts and Frameworks
Towards Data Operations
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
Software architecture for data applications
Reflections on Almost Two Decades of Research into Stream Processing
Data streaming fundamentals
The Live: Stream Computing
Data Stream Management
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Productizing Structured Streaming Jobs
Introduction to Apache Apex by Thomas Weise
How to extract valueable information from real time data feeds
Introduction to Data streaming - 05/12/2014
Streaming analytics state of the art
Architectual Comparison of Apache Apex and Spark Streaming
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
AI-Powered Streaming Analytics for Real-Time Customer Experience

Recently uploaded (20)

PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
2. Earth - The Living Planet earth and life
PDF
The scientific heritage No 166 (166) (2025)
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
An interstellar mission to test astrophysical black holes
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
Microbiology with diagram medical studies .pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
. Radiology Case Scenariosssssssssssssss
The KM-GBF monitoring framework – status & key messages.pptx
2. Earth - The Living Planet earth and life
The scientific heritage No 166 (166) (2025)
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
HPLC-PPT.docx high performance liquid chromatography
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
neck nodes and dissection types and lymph nodes levels
Classification Systems_TAXONOMY_SCIENCE8.pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
An interstellar mission to test astrophysical black holes
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Microbiology with diagram medical studies .pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
TOTAL hIP ARTHROPLASTY Presentation.pptx
2. Earth - The Living Planet Module 2ELS
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
. Radiology Case Scenariosssssssssssssss

Data Streaming (in a Nutshell) ... and Spark's window operations

  • 1. Data Streaming (in a Nutshell)... ... and Spark’s window operations 1 Vincenzo Gulisano, Ph.D. Chalmers University of technology
  • 2. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References 2
  • 3. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References 3
  • 4. https://vincenzogulisano.com/ Assistant Professor Distributed Computing and Systems Research Group Department of Computer Science and engineering Chalmers University of Technology 4
  • 5. At our research team: Research expertise & projects Cyber Security Efficient parallel & stream computing Distributed systems IoT &Sensor Networks 5
  • 6. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References 6
  • 7. Motivation • Since the year 2000, applications such as: – Sensor networks – Network Traffic Analysis – Financial tickers – Transaction Log Analysis – Fraud Detection • Require: – Continuous processing of data streams – Real Time Fashion 7
  • 8. Motivation • Relying 100% on store and process (i.e., DBs) is not feasible – high-speed networks, nanoseconds to handle a packet – ISP router: gigabytes of headers every hour,… • Data Streaming: – In memory – Bounded resources – Efficient one-pass analysis 8
  • 9. Main Memory Motivation • DBMS vs. DSMS Disk 1 Data Query Processing 3 Query results 2 Query Main Memory Query Processing Continuous Query Data Query results 9 What about ?
  • 10. 10 Stonebraker, Michael, Uǧur Çetintemel and Stan Zdonik. The 8 requirements of real-time stream processing. (2005) 1. Keep the data moving 2. Query interface, e.g., extended SQL 3. Handle imperfections 4. Generate predictable outcomes 5. Integrate stored and streaming data 6. Guarantee data safety and availability 7. Partition and scale applications automatically 8. Process and respond instantaneously
  • 11. System Model • Data Stream: unbounded sequence of tuples – Example: Call Description Record (CDR) time Field Field Caller text Callee text Time (secs) int Price (€) double A B 8:00 3 C D 8:20 7 A E 8:35 6 11
  • 12. System Model • Operators: OP Stateless 1 input tuple 1 output tuple OP Stateful 1+ input tuple(s) 1 output tuple 12
  • 13. Stateless Operators Map: transform tuples schema Example: convert price €  $ Filter: discard / route tuples Example: route depending on price Union: merge multiple streams (sharing the same schema) Example: merge CDRs from different sources System Model 13 Map Filter Union … …
  • 14. Stateful Operators Aggregate: compute aggregate functions (group-by) Example: compute avg. call duration Join: match tuples from 2 streams (equality predicate) Example: match CDRs with prices in the same range System Model 14 Aggregate Join2
  • 15. System Model • Continuous Query: graph operators/streams Convert €  $ Only > 10$ Count calls made by each Caller number Map Filter Agg 15 Field Caller Callee Time (secs) Price (€) Field Caller Callee Time (secs) Price ($) Field Caller Callee Time (secs) Price ($) Field Caller Calls Time (secs)
  • 16. System Model • Infinite sequence of tuples / bounded memory  windows • Example: 1 hour windows time [8:00,9:00) [8:20,9:20) [8:40,9:40) 16
  • 17. System Model • Infinite sequence of tuples / bounded memory  windows • Example: count tuples - 1 hour windows time [8:00,9:00) 8:05 8:15 8:22 8:45 9:05 Output: 4 17 [8:20,9:20) What about out-of-order tuples?
  • 18. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References 18
  • 19. Spark’s window operations (source: http://spark.apache.org/docs/latest/streaming-programming-guide.html) 19
  • 20. 20 Spark’s window operations (source: http://spark.apache.org/docs/latest/streaming-programming-guide.html) // Reduce function adding two integers, defined separately for clarity Function2<Integer, Integer, Integer> reduceFunc = new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }; // Reduce last 30 seconds of data, every 10 seconds JavaPairDStream<String, Integer> windowedWordCounts = pairs.reduceByKeyAndWindow(reduceFunc, Durations.seconds(30), Durations.seconds(10)); # Reduce last 30 seconds of data, every 10 seconds windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
  • 21. 21 Spark’s window operations (source: http://spark.apache.org/docs/latest/streaming-programming-guide.html) countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream. reduceByWindow(func, windowLength,slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel. reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window [...] reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks]) A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions” [...]
  • 22. Maintaining tuples or windows? 22 time [8:00,9:00) 8:05 8:15 8:22 8:45 9:05 [8:20,9:20) Maintain tuples When the window shifts: 1. Remove contribution of stale tuples 2. Go on adding new incoming tuples Need to maintain a single window instance Need to maintain all the tuples (how many?)
  • 23. Maintaining tuples or windows? 23 time [8:00,9:00) – 3 (so far...) 8:05 8:15 8:22 8:45 9:05 [8:20,9:20) – 1 (so far...) Maintain windows When a tuple arrives: 1. Add its contribution to all the windows it falls in No need to maintain tuples Need to maintain all windows to which each tuple contributes to
  • 24. Agenda • Who am I? • Introduction – Motivation – System Model • Spark’s window operations • References (non exhaustive list) 24
  • 25. References (non exhaustive list) Bed time reading about Data Streaming 1. Gulisano, Vincenzo. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Ph.D. Thesis. Polytechnic University Madrid, 2012. Shared-nothing parallelism / Elasticity 1. StreamCloud: A Large Scale Data Streaming System. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Patrick Valduriez. 30th International Conference on Distributed Computing Systems (ICDCS) 2010 2. StreamCloud: An Elastic and Scalable Data Streaming System. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente, Patrick Valduriez. IEEE Transactions on Parallel and Distributed Processing (TPDS) 25
  • 26. References (non exhaustive list) Shared-memory parallelism / fine-grained synchronization 1. ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join. Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. IEEE International Conference on Big Data (IEEE Big Data 2015) 2. DEBS Grand Challenge: Deterministic Real-Time Analytics of Geospatial Data Streams through ScaleGate Objects. Vincenzo Gulisano, Yiannis Nikolakopoulos, Ivan Walulya, Marina Papatriantafilou, Philippas Tsigas. The 9th ACM International Conference on Distributed Event-Based Systems (DEBS 2015) 3. Concurrent Data Structures for Efficient Streaming Aggregation (brief announcement). Daniel Cederman, Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. The 26th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) 2014 Streaming + Security / Privacy / Cyber-physical systems 1. Understanding the Data-Processing Challenges in Intelligent Vehicular Systems. Stefania Costache, Vincenzo Gulisano, Marina Papatriantafilou. 2016 IEEE Intelligent Vehicles Symposium (IV16) 2. BES – Differentially Private and Distributed Event Aggregation in Advanced Metering Infrastructures. Vincenzo Gulisano, Valentin Tudor, Magnus Almgren and Marina Papatriantafilou. 2nd ACM Cyber-Physical System Security Workshop (CPSS 2016) [held in conjunction with ACM AsiaCCS’16], 2016. 3. METIS: a Two-Tier Intrusion Detection System for Advanced Metering Infrastructures. Vincenzo Gulisano, Magnus Almgren, Marina Papatriantafilou. 10th International Conference on Security and Privacy in Communication Networks (SecureComm) 2014 26
  • 27. References (non exhaustive list) • Motivation / System Model 1. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’02, New York, NY, USA, 2002. ACM. 2. Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream processing. SIGMOD Rec., 34(4), December 2005. 3. Nesime Tatbul. QoS-Driven load shedding on data streams. In Proceedings of the Workshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers, EDBT ’02, London, UK, UK, 2002. Springer-Verlag. 27
  • 28. References (non exhaustive list) • Centralized Stream Processing Engines 1. Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom. Stream: The Stanford data stream management system. Springer, 2004. 2. Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, 15(2), June 2006. 3. Daniel J. Abadi, Don Carney, Uǧur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2), August 2003. 4. Nesime Tatbul and Stan Zdonik. Window-aware load shedding for aggregation queries over data streams. In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06. VLDB Endowment, 2006. 28
  • 29. References (non exhaustive list) • Distributed Stream Processing Engines 1. Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uǧur Çetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stanley B. Zdonik. The design of the borealis stream processing engine. In CIDR, pages 277–289, 2005. 2. Magdalena Balazinska, Hari Balakrishnan, Samuel R Madden, and Michael Stonebraker. Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst., 33(1), March 2008. ACM ID: 1331907. 3. Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. Towards sensor database systems. In Proceedings of the Second International Conference on Mobile Data Management, MDM ’01, London, UK, UK, 2001. Springer-Verlag. 4. Jeong-hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and Stan Zdonik. A comparison of stream-oriented high availability algorithms. Technical report, Brown CS, 2003. 5. Jeong-Hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and Stan Zdonik. High-Availability algorithms for distributed stream processing. In Data Engineering, International Conference on, volume 0, Los Alamitos, CA, USA, 2005. IEEE Computer Society. 29
  • 30. References (non exhaustive list) • Parallel Stream Processing Engines 1. Vincenzo Gulisano, Ricardo Jiménez-Peris, Marta Patiño-Martínez, and Patrick Valduriez. Streamcloud: A large scale data streaming system. In ICDCS 2010: International Conference on Distributed Computing Systems, pages 126–137, June 2010. 2. Mehul Shah Joseph, Joseph M. Hellerstein, Sirish Ch, and Michael J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In In ICDE, 2002. 30
  • 31. References (non exhaustive list) • Elastic Stream Processing Engines 1. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente, and Patrick Valduriez. Streamcloud: An elastic and scalable data streaming system. IEEE Transactions on Parallel and Distributed Systems, 99(PrePrints), 2012. 2. Thomas Heinze. Elastic complex event processing. In Proceedings of the 8th Middleware Doctoral Symposium, MDS ’11, New York, NY, USA, 2011. ACM. 3. Simon Loesing, Martin Hentschel, Tim Kraska, and Donald Kossmann. Stormy: an elastic and highly available streaming service in the cloud. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, EDBT-ICDT ’12, New York, NY, USA, 2012. ACM. 4. Scott Schneider, Henrique Andrade, Bugra Gedik, Alain Biem, and Kun-Lung Wu. Elastic scaling of data parallel operators in stream processing. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, IPDPS ’09, Washington, DC, USA, 2009. IEEE Computer Society. 31

Editor's Notes

  • #13: These are the original definitions / evolved – modified over time
  • #21: Interesting: one or two functions?
  • #23: Hortoghonal thing: when to compute the final results
  • #24: Hortoghonal thing: when to compute the final results