Data Streaming (in a Nutshell) ... and Spark's window operations

Data Streaming (in a Nutshell)...
... and Spark’s window operations
1
Vincenzo Gulisano, Ph.D.
Chalmers University
of technology

Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• Spark’s window operations
• References
2

Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• References
3

https://vincenzogulisano.com/
Assistant Professor
Distributed Computing and Systems Research Group
Department of Computer Science and engineering
Chalmers University of Technology
4

At our research team:
Research expertise & projects
Cyber
Security
Efficient
parallel &
stream
computing
Distributed
systems
IoT &Sensor
Networks
5

Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• References
6

Motivation
• Since the year 2000, applications such as:
– Sensor networks
– Network Traffic Analysis
– Financial tickers
– Transaction Log Analysis
– Fraud Detection
• Require:
– Continuous processing of data streams
– Real Time Fashion
7

Motivation
• Relying 100% on store and process (i.e., DBs) is not feasible
– high-speed networks, nanoseconds to handle a packet
– ISP router: gigabytes of headers every hour,…
• Data Streaming:
– In memory
– Bounded resources
– Efficient one-pass analysis
8

Main Memory
Motivation
• DBMS vs. DSMS
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
Query
Data
Query
results
9
What about
?

10
Stonebraker, Michael, Uǧur Çetintemel and Stan Zdonik. The 8
requirements of real-time stream processing. (2005)
1. Keep the data moving
2. Query interface, e.g., extended SQL
3. Handle imperfections
4. Generate predictable outcomes
5. Integrate stored and streaming data
6. Guarantee data safety and availability
7. Partition and scale applications automatically
8. Process and respond instantaneously

System Model
• Data Stream: unbounded sequence of tuples
– Example: Call Description Record (CDR)
time
Field Field
Caller text
Callee text
Time (secs) int
Price (€) double
A B 8:00 3 C D 8:20 7 A E 8:35 6
11

System Model
• Operators:
OP
Stateless
1 input tuple
1 output tuple
OP
Stateful
1+ input tuple(s)
1 output tuple
12

Stateless Operators
Map: transform tuples schema
Example: convert price €  $
Filter: discard / route tuples
Example: route depending on price
Union: merge multiple streams
(sharing the same schema)
Example: merge CDRs from
different sources
System Model
13
Map
Filter
Union
…
…

Stateful Operators
Aggregate: compute aggregate
functions (group-by)
Example: compute avg. call duration
Join: match tuples from 2 streams
(equality predicate)
Example: match CDRs with prices in the
same range
System Model
14
Aggregate
Join2

System Model
• Continuous Query: graph operators/streams
Convert
€  $
Only
> 10$
Count calls
made by each
Caller number
Map Filter Agg
15
Field
Caller
Callee
Time (secs)
Price (€)
Field
Caller
Callee
Time (secs)
Price ($)
Field
Caller
Callee
Time (secs)
Price ($)
Field
Caller
Calls
Time (secs)

System Model
• Infinite sequence of tuples / bounded memory
 windows
• Example: 1 hour windows
time
[8:00,9:00)
[8:20,9:20)
[8:40,9:40)
16

System Model
• Infinite sequence of tuples / bounded memory
 windows
• Example: count tuples - 1 hour windows
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
Output: 4
17
[8:20,9:20)
What about
out-of-order tuples?

Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• References
18

Spark’s window operations
(source: http://spark.apache.org/docs/latest/streaming-programming-guide.html)
19

20
// Reduce function adding two integers, defined separately for clarity
Function2<Integer, Integer, Integer> reduceFunc = new Function2<Integer, Integer, Integer>() {
@Override public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
};
// Reduce last 30 seconds of data, every 10 seconds JavaPairDStream<String, Integer>
windowedWordCounts = pairs.reduceByKeyAndWindow(reduceFunc, Durations.seconds(30), Durations.seconds(10));
# Reduce last 30 seconds of data, every 10 seconds windowedWordCounts =
pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)

21
countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength,slideInterval) Return a new single-element stream, created by aggregating
elements in the stream over a sliding interval using func. The
function should be associative so that it can be computed
correctly in parallel.
reduceByKeyAndWindow(func,windowLength,
slideInterval, [numTasks])
When called on a DStream of (K, V) pairs, returns a new
DStream of (K, V) pairs where the values for each key are
aggregated using the given reduce function func over batches in a
sliding window [...]
reduceByKeyAndWindow(func, invFunc,windowLength,
slideInterval, [numTasks])
A more efficient version of the
above reduceByKeyAndWindow() where the reduce value of
each window is calculated incrementally using the reduce values
of the previous window. This is done by reducing the new data
that enters the sliding window, and “inverse reducing” the old
data that leaves the window. An example would be that of
“adding” and “subtracting” counts of keys as the window slides.
However, it is applicable only to “invertible reduce functions”
[...]

Maintaining tuples or windows?
22
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
[8:20,9:20)
Maintain tuples
When the window shifts:
1. Remove contribution of stale tuples
2. Go on adding new incoming tuples
Need to maintain a
single window instance
Need to maintain all
the tuples (how many?)

Maintaining tuples or windows?
23
time
[8:00,9:00) – 3 (so far...)
8:05 8:15 8:22 8:45 9:05
[8:20,9:20) – 1 (so far...)
Maintain windows
When a tuple arrives:
1. Add its contribution to all the
windows it falls in
No need to maintain
tuples
Need to maintain all
windows to which each
tuple contributes to

Agenda
• Who am I?
• Introduction
– Motivation
– System Model
• References (non exhaustive list)
24

References (non exhaustive list)
Bed time reading about Data Streaming
1. Gulisano, Vincenzo. StreamCloud: An Elastic Parallel-Distributed Stream
Processing Engine. Ph.D. Thesis. Polytechnic University Madrid, 2012.
Shared-nothing parallelism / Elasticity
1. StreamCloud: A Large Scale Data Streaming System. Vincenzo Gulisano,
Ricardo Jimenez-Peris, Marta Patiño-Martinez, Patrick Valduriez. 30th
International Conference on Distributed Computing Systems (ICDCS) 2010
2. StreamCloud: An Elastic and Scalable Data Streaming System. Vincenzo
Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente,
Patrick Valduriez. IEEE Transactions on Parallel and Distributed Processing
(TPDS)
25

Shared-memory parallelism / fine-grained synchronization
1. ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join. Vincenzo Gulisano, Yiannis
Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. IEEE International Conference on Big Data
(IEEE Big Data 2015)
2. DEBS Grand Challenge: Deterministic Real-Time Analytics of Geospatial Data Streams through ScaleGate
Objects. Vincenzo Gulisano, Yiannis Nikolakopoulos, Ivan Walulya, Marina Papatriantafilou, Philippas
Tsigas. The 9th ACM International Conference on Distributed Event-Based Systems (DEBS 2015)
3. Concurrent Data Structures for Efficient Streaming Aggregation (brief announcement). Daniel Cederman,
Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. The 26th Annual
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) 2014
Streaming + Security / Privacy / Cyber-physical systems
1. Understanding the Data-Processing Challenges in Intelligent Vehicular Systems. Stefania Costache, Vincenzo
Gulisano, Marina Papatriantafilou. 2016 IEEE Intelligent Vehicles Symposium (IV16)
2. BES – Differentially Private and Distributed Event Aggregation in Advanced Metering
Infrastructures. Vincenzo Gulisano, Valentin Tudor, Magnus Almgren and Marina Papatriantafilou. 2nd
ACM Cyber-Physical System Security Workshop (CPSS 2016) [held in conjunction with ACM AsiaCCS’16],
2016.
3. METIS: a Two-Tier Intrusion Detection System for Advanced Metering Infrastructures. Vincenzo Gulisano,
Magnus Almgren, Marina Papatriantafilou. 10th International Conference on Security and Privacy in
Communication Networks (SecureComm) 2014
26

• Motivation / System Model
1. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues
in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, PODS ’02, New York, NY, USA, 2002. ACM.
2. Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream
processing. SIGMOD Rec., 34(4), December 2005.
3. Nesime Tatbul. QoS-Driven load shedding on data streams. In Proceedings of the Workshops XMLDM,
MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers,
EDBT ’02, London, UK, UK, 2002. Springer-Verlag.
27

• Centralized Stream Processing Engines
1. Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Keith Ito, Rajeev Motwani, Utkarsh
Srivastava, and Jennifer Widom. Stream: The Stanford data stream management system. Springer, 2004.
2. Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: semantic
foundations and query execution. The VLDB Journal, 15(2), June 2006.
3. Daniel J. Abadi, Don Carney, Uǧur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee,
Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data
stream management. The VLDB Journal, 12(2), August 2003.
4. Nesime Tatbul and Stan Zdonik. Window-aware load shedding for aggregation queries over data
streams. In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06.
VLDB Endowment, 2006.
28

• Distributed Stream Processing Engines
1. Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uǧur Çetintemel, Mitch Cherniack, Jeong-Hyon
Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and
Stanley B. Zdonik. The design of the borealis stream processing engine. In CIDR, pages 277–289, 2005.
2. Magdalena Balazinska, Hari Balakrishnan, Samuel R Madden, and Michael Stonebraker. Fault-tolerance
in the borealis distributed stream processing system. ACM Trans. Database Syst., 33(1), March 2008.
ACM ID: 1331907.
3. Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. Towards sensor database systems. In
Proceedings of the Second International Conference on Mobile Data Management, MDM ’01, London,
UK, UK, 2001. Springer-Verlag.
4. Jeong-hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and
Stan Zdonik. A comparison of stream-oriented high availability algorithms. Technical report, Brown CS,
2003.
5. Jeong-Hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker,
and Stan Zdonik. High-Availability algorithms for distributed stream processing. In Data Engineering,
International Conference on, volume 0, Los Alamitos, CA, USA, 2005. IEEE Computer Society.
29

• Parallel Stream Processing Engines
1. Vincenzo Gulisano, Ricardo Jiménez-Peris, Marta Patiño-Martínez, and Patrick Valduriez. Streamcloud:
A large scale data streaming system. In ICDCS 2010: International Conference on Distributed
Computing Systems, pages 126–137, June 2010.
2. Mehul Shah Joseph, Joseph M. Hellerstein, Sirish Ch, and Michael J. Franklin. Flux: An adaptive
partitioning operator for continuous query systems. In In ICDE, 2002.
30

• Elastic Stream Processing Engines
1. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente, and Patrick
Valduriez. Streamcloud: An elastic and scalable data streaming system. IEEE Transactions on Parallel
and Distributed Systems, 99(PrePrints), 2012.
2. Thomas Heinze. Elastic complex event processing. In Proceedings of the 8th Middleware Doctoral
Symposium, MDS ’11, New York, NY, USA, 2011. ACM.
3. Simon Loesing, Martin Hentschel, Tim Kraska, and Donald Kossmann. Stormy: an elastic and highly
available streaming service in the cloud. In Proceedings of the 2012 Joint EDBT/ICDT Workshops,
EDBT-ICDT ’12, New York, NY, USA, 2012. ACM.
4. Scott Schneider, Henrique Andrade, Bugra Gedik, Alain Biem, and Kun-Lung Wu. Elastic scaling of
data parallel operators in stream processing. In Proceedings of the 2009 IEEE International Symposium
on Parallel&Distributed Processing, IPDPS ’09, Washington, DC, USA, 2009. IEEE Computer Society.
31

Data Streaming (in a Nutshell) ... and Spark's window operations

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Data Streaming (in a Nutshell) ... and Spark's window operations (20)

Recently uploaded (20)

Data Streaming (in a Nutshell) ... and Spark's window operations

Editor's Notes