SlideShare a Scribd company logo
Flink Forward Europe
8 October 2019
VASILIKI KALAVRI
vasia@apache.org
SELF-MANAGED AND
AUTOMATICALLY RECONFIGURABLE
STREAM PROCESSING
@vkalavri
2
20001992 2013
MapReduce
2004
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Next-gen streaming
Now
Single-node execution
Synopses and sketches
Stream Database Systems
2
20001992 2013
MapReduce
2004
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Next-gen streaming
Now
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM
Dataflow Systems
Distributed execution
Partitioned state
Single-node execution
Synopses and sketches
Stream Database Systems
2
20001992 2013
MapReduce
2004
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Next-gen streaming
Now
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Next-gen streaming
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM
3
2013
MapReduce
2004
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Now
Next-gen streaming
3
2013
MapReduce
2004
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Now
Re-configurable Systems
Automatic scaling
Analyzer
invoke
re-configure job
performance metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
Next-gen streaming
SNAILTRAIL: GENERALIZING CRITICAL
PATHS FOR ONLINE ANALYSIS OF
DISTRIBUTED DATAFLOWS
NSDI’18
CONVENTIONAL PROFILING TELLS ONLY PART OF THE STORY
5
Duration
Aggregate data exchange
Dataflow graph
CONVENTIONAL PROFILING TELLS ONLY PART OF THE STORY
5
Duration
Aggregate data exchange
Dataflow graph
Custom
aggregate metrics
6
DRIVER
W1
W2
W3
PROFILING SPARK SCHEDULING
processing
scheduling
6
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Processing Scheduling
DRIVER
W1
W2
W3
PROFILING SPARK SCHEDULING
processing
scheduling
7
worker 1
worker 2
worker 3
receive
message
deserialization
processing
serialization
send
message
waiting
waiting
8
worker 1
worker 2
worker 3
processing
OPTIMIZING PROCESSING…
9
worker 1
worker 2
worker 3
OPTIMIZING PROCESSING INCREASED WAITING
10
worker 1
worker 2
worker 3
CRITICAL PATH ANALYSIS
CRITICAL PATH: LONGEST EXECUTION PATH
(not considering waiting activities)
12
W1
W2
W3
a b
c d
CRITICAL PATH: LONGEST EXECUTION PATH
(not considering waiting activities)
12
W1
W2
W3
a b
c d
CRITICAL PATH: LONGEST EXECUTION PATH
(not considering waiting activities)
12
W1
W2
W3
a b
c d
CRITICAL PATH: LONGEST EXECUTION PATH
(not considering waiting activities)
12
W1
W2
W3
a b
c d
CRITICAL PATH: LONGEST EXECUTION PATH
(not considering waiting activities)
12
W1
W2
W3
a b
c d
CRITICAL PATH: LONGEST EXECUTION PATH
(not considering waiting activities)
12
W1
W2
W3
a b
c d
CRITICAL PATH: LONGEST EXECUTION PATH
(not considering waiting activities)
12
W1
W2
W3
a b
c d
OPTIMIZING CRITICAL ACTIVITIES CAN REDUCE LATENCY
13
W1
W2
W3
a b
c d
14
W1
W2
W3
a b
c d
Reduced execution time
OPTIMIZING CRITICAL ACTIVITIES CAN REDUCE LATENCY
ONLINE CRITICAL PATH ANALYSIS
ONLINE ANALYSIS OF TRACE SNAPSHOTS
16
input stream output stream
ONLINE ANALYSIS OF TRACE SNAPSHOTS
16
input stream output stream
periodic
snapshot
trace snapshot
stream
analyzer
performance
summaries
stream
17
W1
W2
W3
a b
c d
x u v z
ts te
17
All paths are potentially part of an evolving critical path
W1
W2
W3
a b
c d
x u v z
ts te
W1
W2
W3
a b
c d
x u v z
ts te
18
▸ All paths have the same length: te - ts
W1
W2
W3
a b
c d
x u v z
ts te
19
▸ All paths have the same length: te - ts
W1
W2
W3
a b
c d
x u v z
ts te
20
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
W1
W2
W3
a b
c d
x u v z
ts te
20
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
W1
W2
W3
a b
c d
x u v z
ts te
20
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
21
How to rank activities with regard to criticality?
21
How to rank activities with regard to criticality?
Intuition: the more paths an activity appears on
the more probable this activity is critical
1
2
3
4
5
6
7
8 9
22
W1
W2
W3
a b
c d
x u v z
ts te
1
2
3
4
5
6
7
8 9
22
W1
W2
W3
a b
c d
x u v z
ts te
1
2
3
4
5
6
7
8 9
22
W1
W2
W3
a b
c d
x u v z
ts te
9
0
0
6 6
CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
23
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Definition 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
be the set of N transient paths of snapshot G[ts,te]. The tran-
sient path centrality of an edge e 2 G[ts,te] is defined as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
The following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Spark, Flink
di↵erent, but act
ysis: all execute
graphs whose v
whose edges den
ers (threads, pr
graph can be tran
all workers appl
tions of the data
1 We provide proofs
4
CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
23
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Can be computed
without path
enumeration!
Definition 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
be the set of N transient paths of snapshot G[ts,te]. The tran-
sient path centrality of an edge e 2 G[ts,te] is defined as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
The following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Spark, Flink
di↵erent, but act
ysis: all execute
graphs whose v
whose edges den
ers (threads, pr
graph can be tran
all workers appl
tions of the data
1 We provide proofs
4
SNAILTRAIL IN ACTION
25
reference application SnailTrail
Timely
Trace ingestion
CP-based
performance
summaries
PAG construction
CP computation and
activity ranking
trace streams
Profiling
Trace generation
Apache Flink,
Apache Spark,
TensorFlow,
Heron,
Timely Dataflow, ...
26
DRIVER
W1
W2
W3
DRIVER SCHEDULING IS CRITICAL
processing
scheduling
26
DRIVER
W1
W2
W3
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0
%weight
Processing
DRIVER SCHEDULING IS CRITICAL
processing
scheduling
SNAILTRAIL V.2 DEMO
28
2013
MapReduce
2004
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Now
Re-configurable Systems
Automatic scaling
Analyzer
invoke
re-configure job
performance metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
Next-gen streaming
FAST AND ACCURATE
AUTOMATIC SCALING DECISIONS
FOR DISTRIBUTED STREAMING DATAFLOWS
OSDI’18
30
Streaming systems must be capable of adapting the level
of parallelism when conditions change at runtime
events/s
time
: input rate : throughput
Data loss SLO violationsIdle resources
events/s
time
events/s
time
AUTOMATIC SCALING OVERVIEW
31
scaling
controller
detect
symptoms
decide whether
to scale
decide how
much to scale
metrics
policy
scaling action
HEURISTIC SCALING APPROACHES
32
CPU utilization
backlog, tuples/s
backpressure signal
threshold and
rule-based
if CPU > 80% => scale
small changes,
one operator
at a time
Borealis
StreamCloud
Seep
IBM Streams
Spark Streaming
Google Dataflow
Dhalion
scaling actionmetrics policy
HEURISTIC SCALING APPROACHES
32
CPU utilization
backlog, tuples/s
backpressure signal
threshold and
rule-based
if CPU > 80% => scale
small changes,
one operator
at a time
Problematic under
interference,
multi-tenancy
Sensitive to
noise, manual,
hard to tune
Non-predictive,
speculative steps
Borealis
StreamCloud
Seep
IBM Streams
Spark Streaming
Google Dataflow
Dhalion
scaling actionmetrics policy
Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataflow
33
Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataflow
33
o1src o2
back-pressure!
target: 40 rec/s
Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataflow
33
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataflow
33
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Which operator is the bottleneck?
What if we scale ο1 x 4?
How much to scale ο2?
34
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Which operator is the bottleneck?
What if we scale ο1 x 4?
How much to scale ο2?
34
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Which operator is the bottleneck?
What if we scale ο1 x 4?
How much to scale ο2?
o1 cannot keep up
waiting for
output
waiting for
input
src
o1
o2
34
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Which operator is the bottleneck?
What if we scale ο1 x 4?
How much to scale ο2?
o1 cannot keep up
waiting for
output
waiting for
input
src
o1
o2
o2 cannot keep up
src
o1
o2
THE DS2 MODEL
36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataflow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
target: 40 rec/s
0.5s
36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataflow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
x4 instances
to keep up
with src rate
target: 40 rec/s
0.5s
36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataflow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
True rate = 200 recs/s
x4 instances
to keep up
with src rate
target: 40 rec/s
0.5s
36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataflow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
True rate = 200 recs/s
x4 instances
to keep up
with src rate
x2 instances
to keep up
with x4 o1
instances
target: 40 rec/s
0.5s
If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
prediction
DS2 MAKES LINEAR PREDICTIONS
If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
prediction
DS2 MAKES LINEAR PREDICTIONS
x
x
p’
p’
If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
Ideal rates act as un upper bound when
scaling up and as a lower bound when
scaling down:
▸ DS2 will converge monotonically to
the target rate
prediction
DS2 MAKES LINEAR PREDICTIONS
p’
p’
If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
Ideal rates act as un upper bound when
scaling up and as a lower bound when
scaling down:
▸ DS2 will converge monotonically to
the target rate
prediction
DS2 MAKES LINEAR PREDICTIONS
actual
actual
DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE
38
parallelism
initial rate
target
actual
error
p0 p1
prediction
x
x
x
DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE
38
parallelism
initial rate
target
actual
p0 p1
x
new
prediction
DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE
38
parallelism
initial rate
target
actual
p0 p1
x
error
p1’
new
prediction
Gradually minimizes error
EVALUATION
40
Scaling Manager Scaling Policy
Metrics
Repository
invoke
re-scale job
report metrics
monitor
pull metrics
decision
Timely dataflow
Apache Flink
Instrumented
stream processor
DS2 VS. STATE-OF-THE-ART ON HERON
41
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 VS. STATE-OF-THE-ART ON HERON
41
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
and converges in
60s, as soon as it
receives the
Heron metrics
DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics
DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
and converges in 2000s
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics
DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
+10 counts
+12 mappers
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
and converges in 2000s
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics
DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s
DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s
DS2 converges in
2 steps for both
operators
1
2
DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s
DS2 reacts within
3s when the target
rate drops
DS2 converges in
2 steps for both
operators
1
2
DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s
DS2 reacts within
3s when the target
rate drops
DS2 converges in
2 steps for both
operators
1
2
Transient
underpovisioning
by 1 instance
44
github.com/strymon-system
Kalavri V, Liagouris J, Hoffmann M, Dimitrova D, Forshaw M, Roscoe T. 

Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows.

OSDI ’18.
Hoffmann M, Lattuada A, Liagouris J, Kalavri V, Dimitrova D, Wicki S, Chothia Z, Roscoe T.

Snailtrail: Generalizing critical paths for online analysis of distributed dataflows.

NSDI’18.
github.com/li1/snailtrail
45
Zaheer Chothia
Andrea Lattuada
Timothy Roscoe
Moritz Hoffmann Desislava Dimitrova
John Liagouris
Malte Sandstede
Matthew ForshawSebastian Wicki
strymon.systems.ethz.ch
46
www.bu.edu/cs/phd-program/phd/
Let’s work on streaming
research together
Flink Forward Europe
8 October 2019
VASILIKI KALAVRI
vasia@apache.org
SELF-MANAGED AND
AUTOMATICALLY RECONFIGURABLE
STREAM PROCESSING
@vkalavri

More Related Content

PPTX
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
PDF
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PDF
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PDF
Predictive Datacenter Analytics with Strymon
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Tech Talk @ Google on Flink Fault Tolerance and HA
Continuous Processing with Apache Flink - Strata London 2016
Predictive Datacenter Analytics with Strymon

What's hot (20)

PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Apache Flink internals
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
PDF
Data Stream Analytics - Why they are important
PPTX
Apache Beam: A unified model for batch and stream processing data
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
PPTX
Apache Flink Training: System Overview
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PDF
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
PDF
Universal metrics with Apache Beam
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Single-Pass Graph Stream Analytics with Apache Flink
PDF
Batch and Stream Graph Processing with Apache Flink
PDF
Pulsar connector on flink 1.14
PPTX
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
PDF
Unified Stream and Batch Processing with Apache Flink
PPTX
Debunking Common Myths in Stream Processing
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Apache Flink: API, runtime, and project roadmap
Apache Flink internals
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Don't Cross The Streams - Data Streaming And Apache Flink
Data Stream Analytics - Why they are important
Apache Beam: A unified model for batch and stream processing data
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Apache Flink Training: System Overview
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Universal metrics with Apache Beam
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Single-Pass Graph Stream Analytics with Apache Flink
Batch and Stream Graph Processing with Apache Flink
Pulsar connector on flink 1.14
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Unified Stream and Batch Processing with Apache Flink
Debunking Common Myths in Stream Processing
Ad

Similar to Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Kalavri, ETH Zurich (20)

PDF
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
PDF
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
PDF
Big Data Analytics Platforms by KTH and RISE SICS
PDF
Complex event processing platform handling millions of users - Krzysztof Zarz...
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PDF
Intelligent cyber security solutions
PPTX
Will it Scale? The Secrets behind Scaling Stream Processing Applications
PDF
Observability cookbook JUGtoberfest Poznan 2024-10-16
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
PDF
Towards Data Operations
PPTX
Shikha fdp 62_14july2017
PDF
Pintrace: Distributed tracing @Pinterest
PDF
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
PPTX
Cloud Security Monitoring and Spark Analytics
PPTX
Performance Comparison of Streaming Big Data Platforms
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
PDF
Cyber Analytics Applications for Data-Intensive Computing
PDF
Cloud operations with streaming analytics using big data tools
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
Big Data Analytics Platforms by KTH and RISE SICS
Complex event processing platform handling millions of users - Krzysztof Zarz...
Chicago Flink Meetup: Flink's streaming architecture
Intelligent cyber security solutions
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Observability cookbook JUGtoberfest Poznan 2024-10-16
End-to-end pipeline agility - Berlin Buzzwords 2024
Towards Data Operations
Shikha fdp 62_14july2017
Pintrace: Distributed tracing @Pinterest
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Cloud Security Monitoring and Spark Analytics
Performance Comparison of Streaming Big Data Platforms
Stephan Ewen - Experiences running Flink at Very Large Scale
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billio...
Cyber Analytics Applications for Data-Intensive Computing
Cloud operations with streaming analytics using big data tools
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Modernizing your data center with Dell and AMD
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Modernizing your data center with Dell and AMD
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
“AI and Expert System Decision Support & Business Intelligence Systems”
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Kalavri, ETH Zurich

  • 1. Flink Forward Europe 8 October 2019 VASILIKI KALAVRI vasia@apache.org SELF-MANAGED AND AUTOMATICALLY RECONFIGURABLE STREAM PROCESSING @vkalavri
  • 2. 2 20001992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Samza Flink Millwheel Storm S4 Google Dataflow Next-gen streaming Now
  • 3. Single-node execution Synopses and sketches Stream Database Systems 2 20001992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Samza Flink Millwheel Storm S4 Google Dataflow Next-gen streaming Now Tapestry NiagaraCQ Aurora TelegraphCQ STREAM
  • 4. Dataflow Systems Distributed execution Partitioned state Single-node execution Synopses and sketches Stream Database Systems 2 20001992 2013 MapReduce 2004 Tapestry NiagaraCQ Aurora TelegraphCQ STREAM Naiad Spark Streaming Samza Flink Millwheel Storm S4 Google Dataflow Next-gen streaming Now Naiad Spark Streaming Samza Flink Millwheel Storm S4 Google Dataflow Next-gen streaming Tapestry NiagaraCQ Aurora TelegraphCQ STREAM
  • 6. 3 2013 MapReduce 2004 Naiad Spark Streaming Samza Flink Millwheel Storm S4 Google Dataflow Now Re-configurable Systems Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor Next-gen streaming
  • 7. SNAILTRAIL: GENERALIZING CRITICAL PATHS FOR ONLINE ANALYSIS OF DISTRIBUTED DATAFLOWS NSDI’18
  • 8. CONVENTIONAL PROFILING TELLS ONLY PART OF THE STORY 5 Duration Aggregate data exchange Dataflow graph
  • 9. CONVENTIONAL PROFILING TELLS ONLY PART OF THE STORY 5 Duration Aggregate data exchange Dataflow graph Custom aggregate metrics
  • 11. 6 0 5 10 15 Snapshot 0.0 0.2 0.4 0.6 0.8 CP 0 5 10 15 Snapshot %weight Processing Scheduling DRIVER W1 W2 W3 PROFILING SPARK SCHEDULING processing scheduling
  • 12. 7 worker 1 worker 2 worker 3 receive message deserialization processing serialization send message waiting waiting
  • 15. OPTIMIZING PROCESSING INCREASED WAITING 10 worker 1 worker 2 worker 3
  • 17. CRITICAL PATH: LONGEST EXECUTION PATH (not considering waiting activities) 12 W1 W2 W3 a b c d
  • 18. CRITICAL PATH: LONGEST EXECUTION PATH (not considering waiting activities) 12 W1 W2 W3 a b c d
  • 19. CRITICAL PATH: LONGEST EXECUTION PATH (not considering waiting activities) 12 W1 W2 W3 a b c d
  • 20. CRITICAL PATH: LONGEST EXECUTION PATH (not considering waiting activities) 12 W1 W2 W3 a b c d
  • 21. CRITICAL PATH: LONGEST EXECUTION PATH (not considering waiting activities) 12 W1 W2 W3 a b c d
  • 22. CRITICAL PATH: LONGEST EXECUTION PATH (not considering waiting activities) 12 W1 W2 W3 a b c d
  • 23. CRITICAL PATH: LONGEST EXECUTION PATH (not considering waiting activities) 12 W1 W2 W3 a b c d
  • 24. OPTIMIZING CRITICAL ACTIVITIES CAN REDUCE LATENCY 13 W1 W2 W3 a b c d
  • 25. 14 W1 W2 W3 a b c d Reduced execution time OPTIMIZING CRITICAL ACTIVITIES CAN REDUCE LATENCY
  • 27. ONLINE ANALYSIS OF TRACE SNAPSHOTS 16 input stream output stream
  • 28. ONLINE ANALYSIS OF TRACE SNAPSHOTS 16 input stream output stream periodic snapshot trace snapshot stream analyzer performance summaries stream
  • 29. 17 W1 W2 W3 a b c d x u v z ts te
  • 30. 17 All paths are potentially part of an evolving critical path W1 W2 W3 a b c d x u v z ts te
  • 31. W1 W2 W3 a b c d x u v z ts te 18 ▸ All paths have the same length: te - ts
  • 32. W1 W2 W3 a b c d x u v z ts te 19 ▸ All paths have the same length: te - ts
  • 33. W1 W2 W3 a b c d x u v z ts te 20 ▸ All paths have the same length: te - ts ▸ Choosing a random path might miss critical activities
  • 34. W1 W2 W3 a b c d x u v z ts te 20 ▸ All paths have the same length: te - ts ▸ Choosing a random path might miss critical activities
  • 35. W1 W2 W3 a b c d x u v z ts te 20 ▸ All paths have the same length: te - ts ▸ Choosing a random path might miss critical activities
  • 36. 21 How to rank activities with regard to criticality?
  • 37. 21 How to rank activities with regard to criticality? Intuition: the more paths an activity appears on the more probable this activity is critical
  • 40. 1 2 3 4 5 6 7 8 9 22 W1 W2 W3 a b c d x u v z ts te 9 0 0 6 6
  • 41. CRITICAL PARTICIPATION (CP METRIC) An estimation of the activity’s participation in the critical path 23 total number of paths in the snapshot activity duration: edge weight centrality: the number of paths this activity appears on Definition 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN} be the set of N transient paths of snapshot G[ts,te]. The tran- sient path centrality of an edge e 2 G[ts,te] is defined as c(e) = NX i=1 ci(e), where ci(e) = 8 >>< >>: 0 : e < ~pi 1 : e 2 ~pi The following holds: CPa = TPC(a) · aw N(te ts) (3) Spark, Flink di↵erent, but act ysis: all execute graphs whose v whose edges den ers (threads, pr graph can be tran all workers appl tions of the data 1 We provide proofs 4
  • 42. CRITICAL PARTICIPATION (CP METRIC) An estimation of the activity’s participation in the critical path 23 total number of paths in the snapshot activity duration: edge weight centrality: the number of paths this activity appears on Can be computed without path enumeration! Definition 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN} be the set of N transient paths of snapshot G[ts,te]. The tran- sient path centrality of an edge e 2 G[ts,te] is defined as c(e) = NX i=1 ci(e), where ci(e) = 8 >>< >>: 0 : e < ~pi 1 : e 2 ~pi The following holds: CPa = TPC(a) · aw N(te ts) (3) Spark, Flink di↵erent, but act ysis: all execute graphs whose v whose edges den ers (threads, pr graph can be tran all workers appl tions of the data 1 We provide proofs 4
  • 44. 25 reference application SnailTrail Timely Trace ingestion CP-based performance summaries PAG construction CP computation and activity ranking trace streams Profiling Trace generation Apache Flink, Apache Spark, TensorFlow, Heron, Timely Dataflow, ...
  • 45. 26 DRIVER W1 W2 W3 DRIVER SCHEDULING IS CRITICAL processing scheduling
  • 46. 26 DRIVER W1 W2 W3 0 5 10 15 Snapshot 0.0 0.2 0.4 0.6 0.8 CP 0 %weight Processing DRIVER SCHEDULING IS CRITICAL processing scheduling
  • 48. 28 2013 MapReduce 2004 Naiad Spark Streaming Samza Flink Millwheel Storm S4 Google Dataflow Now Re-configurable Systems Automatic scaling Analyzer invoke re-configure job performance metrics decision Profiler Adaptive scheduling Straggler mitigation Query optimization Instrumented stream processor Next-gen streaming
  • 49. FAST AND ACCURATE AUTOMATIC SCALING DECISIONS FOR DISTRIBUTED STREAMING DATAFLOWS OSDI’18
  • 50. 30 Streaming systems must be capable of adapting the level of parallelism when conditions change at runtime events/s time : input rate : throughput Data loss SLO violationsIdle resources events/s time events/s time
  • 51. AUTOMATIC SCALING OVERVIEW 31 scaling controller detect symptoms decide whether to scale decide how much to scale metrics policy scaling action
  • 52. HEURISTIC SCALING APPROACHES 32 CPU utilization backlog, tuples/s backpressure signal threshold and rule-based if CPU > 80% => scale small changes, one operator at a time Borealis StreamCloud Seep IBM Streams Spark Streaming Google Dataflow Dhalion scaling actionmetrics policy
  • 53. HEURISTIC SCALING APPROACHES 32 CPU utilization backlog, tuples/s backpressure signal threshold and rule-based if CPU > 80% => scale small changes, one operator at a time Problematic under interference, multi-tenancy Sensitive to noise, manual, hard to tune Non-predictive, speculative steps Borealis StreamCloud Seep IBM Streams Spark Streaming Google Dataflow Dhalion scaling actionmetrics policy
  • 54. Effect of Dhalion’s scaling actions in an initially under-provisioned wordcount dataflow 33
  • 55. Effect of Dhalion’s scaling actions in an initially under-provisioned wordcount dataflow 33 o1src o2 back-pressure! target: 40 rec/s
  • 56. Effect of Dhalion’s scaling actions in an initially under-provisioned wordcount dataflow 33 o1src o2 back-pressure! target: 40 rec/s 10 rec/s 100 rec/s
  • 57. Effect of Dhalion’s scaling actions in an initially under-provisioned wordcount dataflow 33 o1src o2 back-pressure! target: 40 rec/s 10 rec/s 100 rec/s Which operator is the bottleneck? What if we scale ο1 x 4? How much to scale ο2?
  • 58. 34 o1src o2 back-pressure! target: 40 rec/s 10 rec/s 100 rec/s Which operator is the bottleneck? What if we scale ο1 x 4? How much to scale ο2?
  • 59. 34 o1src o2 back-pressure! target: 40 rec/s 10 rec/s 100 rec/s Which operator is the bottleneck? What if we scale ο1 x 4? How much to scale ο2? o1 cannot keep up waiting for output waiting for input src o1 o2
  • 60. 34 o1src o2 back-pressure! target: 40 rec/s 10 rec/s 100 rec/s Which operator is the bottleneck? What if we scale ο1 x 4? How much to scale ο2? o1 cannot keep up waiting for output waiting for input src o1 o2 o2 cannot keep up src o1 o2
  • 62. 36 src o1 o2 10 recs 10 recs 1 2 3 4 100 rec 100 recs Intuition: use the dataflow graph to extract operator dependencies and system instrumentation to collect accurate, representative metrics. target: 40 rec/s 0.5s
  • 63. 36 src o1 o2 10 recs 10 recs 1 2 3 4 100 rec 100 recs Intuition: use the dataflow graph to extract operator dependencies and system instrumentation to collect accurate, representative metrics. x4 instances to keep up with src rate target: 40 rec/s 0.5s
  • 64. 36 src o1 o2 10 recs 10 recs 1 2 3 4 100 rec 100 recs Intuition: use the dataflow graph to extract operator dependencies and system instrumentation to collect accurate, representative metrics. True rate = 200 recs/s x4 instances to keep up with src rate target: 40 rec/s 0.5s
  • 65. 36 src o1 o2 10 recs 10 recs 1 2 3 4 100 rec 100 recs Intuition: use the dataflow graph to extract operator dependencies and system instrumentation to collect accurate, representative metrics. True rate = 200 recs/s x4 instances to keep up with src rate x2 instances to keep up with x4 o1 instances target: 40 rec/s 0.5s
  • 66. If operator scaling is linear, then: ▸ no overshoot when scaling up ▸ no undershoot when scaling down 37 parallelism initial rate target prediction p0 p1 parallelism initial rate target p0p1 prediction DS2 MAKES LINEAR PREDICTIONS
  • 67. If operator scaling is linear, then: ▸ no overshoot when scaling up ▸ no undershoot when scaling down 37 parallelism initial rate target prediction p0 p1 parallelism initial rate target p0p1 prediction DS2 MAKES LINEAR PREDICTIONS x x p’ p’
  • 68. If operator scaling is linear, then: ▸ no overshoot when scaling up ▸ no undershoot when scaling down 37 parallelism initial rate target prediction p0 p1 parallelism initial rate target p0p1 Ideal rates act as un upper bound when scaling up and as a lower bound when scaling down: ▸ DS2 will converge monotonically to the target rate prediction DS2 MAKES LINEAR PREDICTIONS p’ p’
  • 69. If operator scaling is linear, then: ▸ no overshoot when scaling up ▸ no undershoot when scaling down 37 parallelism initial rate target prediction p0 p1 parallelism initial rate target p0p1 Ideal rates act as un upper bound when scaling up and as a lower bound when scaling down: ▸ DS2 will converge monotonically to the target rate prediction DS2 MAKES LINEAR PREDICTIONS actual actual
  • 70. DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE 38 parallelism initial rate target actual error p0 p1 prediction x x x
  • 71. DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE 38 parallelism initial rate target actual p0 p1 x new prediction
  • 72. DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE 38 parallelism initial rate target actual p0 p1 x error p1’ new prediction Gradually minimizes error
  • 74. 40 Scaling Manager Scaling Policy Metrics Repository invoke re-scale job report metrics monitor pull metrics decision Timely dataflow Apache Flink Instrumented stream processor
  • 75. DS2 VS. STATE-OF-THE-ART ON HERON 41 Initially under-provisioned wordcount dataflow Target rate: 16.700 rec/s
  • 76. DS2 VS. STATE-OF-THE-ART ON HERON 41 Initially under-provisioned wordcount dataflow Target rate: 16.700 rec/s
  • 77. DS2 VS. STATE-OF-THE-ART ON HERON 42 Initially under-provisioned wordcount dataflow Target rate: 16.700 rec/s
  • 78. DS2 VS. STATE-OF-THE-ART ON HERON 42 Initially under-provisioned wordcount dataflow Target rate: 16.700 rec/s DS2 converges in a single step for both operators
  • 79. DS2 VS. STATE-OF-THE-ART ON HERON 42 Initially under-provisioned wordcount dataflow Target rate: 16.700 rec/s DS2 converges in a single step for both operators and converges in 60s, as soon as it receives the Heron metrics
  • 80. DS2 VS. STATE-OF-THE-ART ON HERON 42 Initially under-provisioned wordcount dataflow Target rate: 16.700 rec/s DS2 converges in a single step for both operators Dhalion scales one operator at a time, and needs six steps in total 1 6 5 43 2and converges in 60s, as soon as it receives the Heron metrics
  • 81. DS2 VS. STATE-OF-THE-ART ON HERON 42 Initially under-provisioned wordcount dataflow Target rate: 16.700 rec/s DS2 converges in a single step for both operators and converges in 2000s Dhalion scales one operator at a time, and needs six steps in total 1 6 5 43 2and converges in 60s, as soon as it receives the Heron metrics
  • 82. DS2 VS. STATE-OF-THE-ART ON HERON 42 Initially under-provisioned wordcount dataflow +10 counts +12 mappers Target rate: 16.700 rec/s DS2 converges in a single step for both operators and converges in 2000s Dhalion scales one operator at a time, and needs six steps in total 1 6 5 43 2and converges in 60s, as soon as it receives the Heron metrics
  • 83. DS2 ON APACHE FLINK 43 Initially under-provisioned wordcount Target rate: 2.000.000 rec/s, drops to half at 800s
  • 84. DS2 ON APACHE FLINK 43 Initially under-provisioned wordcount Target rate: 2.000.000 rec/s, drops to half at 800s DS2 converges in 2 steps for both operators 1 2
  • 85. DS2 ON APACHE FLINK 43 Initially under-provisioned wordcount Target rate: 2.000.000 rec/s, drops to half at 800s DS2 reacts within 3s when the target rate drops DS2 converges in 2 steps for both operators 1 2
  • 86. DS2 ON APACHE FLINK 43 Initially under-provisioned wordcount Target rate: 2.000.000 rec/s, drops to half at 800s DS2 reacts within 3s when the target rate drops DS2 converges in 2 steps for both operators 1 2 Transient underpovisioning by 1 instance
  • 87. 44 github.com/strymon-system Kalavri V, Liagouris J, Hoffmann M, Dimitrova D, Forshaw M, Roscoe T. 
 Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows.
 OSDI ’18. Hoffmann M, Lattuada A, Liagouris J, Kalavri V, Dimitrova D, Wicki S, Chothia Z, Roscoe T.
 Snailtrail: Generalizing critical paths for online analysis of distributed dataflows.
 NSDI’18. github.com/li1/snailtrail
  • 88. 45 Zaheer Chothia Andrea Lattuada Timothy Roscoe Moritz Hoffmann Desislava Dimitrova John Liagouris Malte Sandstede Matthew ForshawSebastian Wicki strymon.systems.ethz.ch
  • 90. Flink Forward Europe 8 October 2019 VASILIKI KALAVRI vasia@apache.org SELF-MANAGED AND AUTOMATICALLY RECONFIGURABLE STREAM PROCESSING @vkalavri