Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Kalavri, ETH Zurich

Flink Forward Europe
8 October 2019
VASILIKI KALAVRI
vasia@apache.org
SELF-MANAGED AND
AUTOMATICALLY RECONFIGURABLE
STREAM PROCESSING
@vkalavri

2
20001992 2013
MapReduce
2004
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Next-gen streaming
Now

Single-node execution
Synopses and sketches
Stream Database Systems
2
20001992 2013
MapReduce
2004
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Next-gen streaming
Now
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM

Dataflow Systems
Distributed execution
Partitioned state
Single-node execution
Synopses and sketches
Stream Database Systems
2
20001992 2013
MapReduce
2004
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Next-gen streaming
Now
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Next-gen streaming
Tapestry
NiagaraCQ Aurora
TelegraphCQ
STREAM

3
2013
MapReduce
2004
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Now
Next-gen streaming

3
2013
MapReduce
2004
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Now
Re-configurable Systems
Automatic scaling
Analyzer
invoke
re-configure job
performance metrics
decision
Proﬁler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
Next-gen streaming

SNAILTRAIL: GENERALIZING CRITICAL
PATHS FOR ONLINE ANALYSIS OF
DISTRIBUTED DATAFLOWS
NSDI’18

CONVENTIONAL PROFILING TELLS ONLY PART OF THE STORY
5
Duration
Aggregate data exchange
Dataﬂow graph

CONVENTIONAL PROFILING TELLS ONLY PART OF THE STORY
5
Duration
Aggregate data exchange
Dataﬂow graph
Custom
aggregate metrics

6
DRIVER
W1
W2
W3
PROFILING SPARK SCHEDULING
processing
scheduling

6
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Processing Scheduling
DRIVER
W1
W2
W3
PROFILING SPARK SCHEDULING
processing
scheduling

7
worker 1
worker 2
worker 3
receive
message
deserialization
processing
serialization
send
message
waiting
waiting

8
worker 1
worker 2
worker 3
processing

OPTIMIZING PROCESSING…
9
worker 1
worker 2
worker 3

OPTIMIZING PROCESSING INCREASED WAITING
10
worker 1
worker 2
worker 3

CRITICAL PATH: LONGEST EXECUTION PATH
(not considering waiting activities)
12
W1
W2
W3
a b
c d

OPTIMIZING CRITICAL ACTIVITIES CAN REDUCE LATENCY
13
W1
W2
W3
a b
c d

14
W1
W2
W3
a b
c d
Reduced execution time
OPTIMIZING CRITICAL ACTIVITIES CAN REDUCE LATENCY

ONLINE ANALYSIS OF TRACE SNAPSHOTS
16
input stream output stream

ONLINE ANALYSIS OF TRACE SNAPSHOTS
16
input stream output stream
periodic
snapshot
trace snapshot
stream
analyzer
performance
summaries
stream

17
W1
W2
W3
a b
c d
x u v z
ts te

17
All paths are potentially part of an evolving critical path
W1
W2
W3
a b
c d
x u v z
ts te

W1
W2
W3
a b
c d
x u v z
ts te
18
▸ All paths have the same length: te - ts

W1
W2
W3
a b
c d
x u v z
ts te
19

W1
W2
W3
a b
c d
x u v z
ts te
20
▸ Choosing a random path might miss critical activities

21
How to rank activities with regard to criticality?

21
How to rank activities with regard to criticality?
Intuition: the more paths an activity appears on
the more probable this activity is critical

1
2
3
4
5
6
7
8 9
22
W1
W2
W3
a b
c d
x u v z
ts te

1
2
3
4
5
6
7
8 9
22
W1
W2
W3
a b
c d
x u v z
ts te
9
0
0
6 6

CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
23
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Definition 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
be the set of N transient paths of snapshot G[ts,te]. The tran-
sient path centrality of an edge e 2 G[ts,te] is deﬁned as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
The following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Spark, Flink
di↵erent, but act
ysis: all execute
graphs whose v
whose edges den
ers (threads, pr
graph can be tran
all workers appl
tions of the data
1 We provide proofs
4

CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
23
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Can be computed
without path
enumeration!
Definition 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
be the set of N transient paths of snapshot G[ts,te]. The tran-
sient path centrality of an edge e 2 G[ts,te] is deﬁned as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
The following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Spark, Flink
di↵erent, but act
ysis: all execute
graphs whose v
whose edges den
ers (threads, pr
graph can be tran
all workers appl
tions of the data
1 We provide proofs
4

25
reference application SnailTrail
Timely
Trace ingestion
CP-based
performance
summaries
PAG construction
CP computation and
activity ranking
trace streams
Proﬁling
Trace generation
Apache Flink,
Apache Spark,
TensorFlow,
Heron,
Timely Dataﬂow, ...

26
DRIVER
W1
W2
W3
DRIVER SCHEDULING IS CRITICAL
processing
scheduling

26
DRIVER
W1
W2
W3
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0
%weight
Processing
DRIVER SCHEDULING IS CRITICAL
processing
scheduling

28
2013
MapReduce
2004
Naiad
Spark Streaming
Samza
Flink
Millwheel
Storm
S4 Google Dataflow
Now
Re-configurable Systems
Automatic scaling
Analyzer
invoke
re-configure job
performance metrics
decision
Proﬁler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
Next-gen streaming

FAST AND ACCURATE
AUTOMATIC SCALING DECISIONS
FOR DISTRIBUTED STREAMING DATAFLOWS
OSDI’18

30
Streaming systems must be capable of adapting the level
of parallelism when conditions change at runtime
events/s
time
: input rate : throughput
Data loss SLO violationsIdle resources
events/s
time
events/s
time

AUTOMATIC SCALING OVERVIEW
31
scaling
controller
detect
symptoms
decide whether
to scale
decide how
much to scale
metrics
policy
scaling action

HEURISTIC SCALING APPROACHES
32
CPU utilization
backlog, tuples/s
backpressure signal
threshold and
rule-based
if CPU > 80% => scale
small changes,
one operator
at a time
Borealis
StreamCloud
Seep
IBM Streams
Spark Streaming
Google Dataﬂow
Dhalion
scaling actionmetrics policy

HEURISTIC SCALING APPROACHES
32
CPU utilization
backlog, tuples/s
backpressure signal
threshold and
rule-based
if CPU > 80% => scale
small changes,
one operator
at a time
Problematic under
interference,
multi-tenancy
Sensitive to
noise, manual,
hard to tune
Non-predictive,
speculative steps
Borealis
StreamCloud
Seep
IBM Streams
Spark Streaming
Google Dataﬂow
Dhalion
scaling actionmetrics policy

Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataﬂow
33

wordcount dataﬂow
33
o1src o2
back-pressure!
target: 40 rec/s

wordcount dataﬂow
33
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s

wordcount dataﬂow
33
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Which operator is the bottleneck?
What if we scale ο1 x 4?
How much to scale ο2?

34
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s

34
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
o1 cannot keep up
waiting for
output
waiting for
input
src
o1
o2

34
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
o1 cannot keep up
waiting for
output
waiting for
input
src
o1
o2
o2 cannot keep up
src
o1
o2

36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataﬂow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
target: 40 rec/s
0.5s

36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
x4 instances
to keep up
with src rate
target: 40 rec/s
0.5s

36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
True rate = 200 recs/s
x4 instances
to keep up
with src rate
target: 40 rec/s
0.5s

36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
True rate = 200 recs/s
x4 instances
to keep up
with src rate
x2 instances
to keep up
with x4 o1
instances
target: 40 rec/s
0.5s

If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
prediction
DS2 MAKES LINEAR PREDICTIONS

37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
prediction
x
x
p’
p’

37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
Ideal rates act as un upper bound when
scaling up and as a lower bound when
scaling down:
▸ DS2 will converge monotonically to
the target rate
prediction
p’
p’

37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
Ideal rates act as un upper bound when
scaling up and as a lower bound when
scaling down:
▸ DS2 will converge monotonically to
the target rate
prediction
actual
actual

DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE
38
parallelism
initial rate
target
actual
error
p0 p1
prediction
x
x
x

38
parallelism
initial rate
target
actual
p0 p1
x
new
prediction

38
parallelism
initial rate
target
actual
p0 p1
x
error
p1’
new
prediction
Gradually minimizes error

40
Scaling Manager Scaling Policy
Metrics
Repository
invoke
re-scale job
report metrics
monitor
pull metrics
decision
Timely dataﬂow
Apache Flink
Instrumented
stream processor

DS2 VS. STATE-OF-THE-ART ON HERON
41
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s

42

42
DS2 converges in a
single step for
both operators

42
DS2 converges in a
single step for
both operators
and converges in
60s, as soon as it
receives the
Heron metrics

42
DS2 converges in a
single step for
both operators
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics

42
DS2 converges in a
single step for
both operators
and converges in 2000s
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics

42
+10 counts
+12 mappers
DS2 converges in a
single step for
both operators
and converges in 2000s
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics

DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s

DS2 ON APACHE FLINK
43
DS2 converges in
2 steps for both
operators
1
2

DS2 ON APACHE FLINK
43
DS2 reacts within
3s when the target
rate drops
DS2 converges in
2 steps for both
operators
1
2

DS2 ON APACHE FLINK
43
DS2 reacts within
3s when the target
rate drops
DS2 converges in
2 steps for both
operators
1
2
Transient
underpovisioning
by 1 instance

44
github.com/strymon-system
Kalavri V, Liagouris J, Hoffmann M, Dimitrova D, Forshaw M, Roscoe T.  
Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataﬂows. 
OSDI ’18.
Hoffmann M, Lattuada A, Liagouris J, Kalavri V, Dimitrova D, Wicki S, Chothia Z, Roscoe T. 
Snailtrail: Generalizing critical paths for online analysis of distributed dataﬂows. 
NSDI’18.
github.com/li1/snailtrail

45
Zaheer Chothia
Andrea Lattuada
Timothy Roscoe
Moritz Hoffmann Desislava Dimitrova
John Liagouris
Malte Sandstede
Matthew ForshawSebastian Wicki
strymon.systems.ethz.ch

46
www.bu.edu/cs/phd-program/phd/
Let’s work on streaming
research together

Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Kalavri, ETH Zurich

More Related Content

What's hot (20)

Similar to Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Kalavri, ETH Zurich (20)

More from Flink Forward (20)

Recently uploaded (20)

Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Kalavri, ETH Zurich