SlideShare a Scribd company logo
ONLINE
PERFORMANCE
ANALYSIS
OF DISTRIBUTED DATAFLOW SYSTEMS
Vasia Kalavri
kalavriv@inf.ethz.ch
Support:
O’Reilly Velocity London
20 October 2017
ABOUT ME
▸ Postdoc at ETH Zürich
▸ Systems Group: https://www.systems.ethz.ch/
▸ PMC member of Apache Flink
▸ Research interests
▸ Large-scale graph processing
▸ Streaming dataflow engines
▸ Current project
▸ Predictive datacenter analytics and
management
2
@vkalavri
33
traces, configuration,
topology updates, …
Datacenter
Strymon
queries, complex analytics,
simulations, …
policy enforcement,
what-if scenarios, …
STRYMON: ONLINE DATACENTER ANALYTICS AND MANAGEMENT
Datacenter
model
event streams
strymon.systems.ethz.ch
STRYMON IS BUILT ON TIMELY
4
▸ A steaming framework for data-parallel computations
▸ Arbitrary cyclic dataflows
▸ Logical timestamps (epochs)
▸ Asynchronous execution
▸ Low latency
D. Murray, F. McSherry, M. Isard, R. Isaacs, P. Barham, M. Abadi.
Naiad: A Timely Dataflow System. In SOSP, 2013.
4
https://github.com/frankmcsherry/timely-dataflow
What-if scenarios
Real-time
datacenter analytics
Incremental
network routing
Online critical
path analysis
Explaining
simulations
Strymon
5
Apache Flink
SnailTrail: Strymon’s
component for online
performance analysis of
distributed dataflows
6
What-if scenarios
Real-time
datacenter analytics
Incremental
network routing
Online critical
path analysis
Explaining
simulations
Strymon
DATAFLOW PROGRAMMING
7
sources
sinks
operators
data exchange
UNDERSTANDING THE PERFORMANCE OF DISTRIBUTED DATAFLOWS
Hard to troubleshoot
▸ long-running, dynamic
workloads
▸ many tasks, activities,
operators, dependencies
▸ bottleneck causes are
usually not isolated but
span multiple processes
8
client
W1
W1
scheduler
PERFORMANCE METRICS
9
PERFORMANCE METRICS
9
Dataflow graph
PERFORMANCE METRICS
9
Duration
Dataflow graph
PERFORMANCE METRICS
9
Duration
Aggregate data exchange
Dataflow graph
PERFORMANCE METRICS
9
Duration
Aggregate data exchange
Custom
aggregate metrics
Dataflow graph
W1
W2
W3
serialization processing
waiting
deserialization
10
processing
message
A PARALLEL EXECUTION
W1
W2
W3
serialization processing
waiting
deserialization
10
processing
message
A PARALLEL EXECUTION
W1
W2
W3
serialization processing
waiting
deserialization
10
processing
messageProcessing is the most time-consuming activity
A PARALLEL EXECUTION
W1
W2
W3
serialization processing
waiting
deserialization
11
processing
What if we optimize it?
A PARALLEL EXECUTION
W1
W2
W3
serialization
waiting
deserialization
12
A PARALLEL EXECUTION
13
W1
W2
W3
serialization
waiting
deserialization
A PARALLEL EXECUTION
13
W1
W2
W3
serialization
waiting
deserialization
A PARALLEL EXECUTION
13
W1
W2
W3
serialization
waiting
deserialization
No performance benefit
for the parallel execution!
A PARALLEL EXECUTION
CONVENTIONAL
PROFILING CAN
BE MISLEADING
CRITICAL PATH ANALYSIS
THE PROGRAM ACTIVITY GRAPH (PAG)
16
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
THE PROGRAM ACTIVITY GRAPH (PAG)
16
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Nodes are timestamped events:
start or end of a worker activity
THE PROGRAM ACTIVITY GRAPH (PAG)
16
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Nodes are timestamped events:
start or end of a worker activity
u = {

timestamp: k+1,

worker: 2
}
THE PROGRAM ACTIVITY GRAPH (PAG)
17
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Edges represent activities
annotated with a type and duration
THE PROGRAM ACTIVITY GRAPH (PAG)
17
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
Edges represent activities
annotated with a type and duration
(u, v) = {

type: serialization

duration: 1
}
W1
W2
W3
The Program Activity
Graph captures
computational
dependencies among
parallel workers
▸ Which activities delay the overall execution?
▸ i.e. which activities lie on the critical path of execution?
18
CRITICAL PATH
19
W1
W2
W3
a b
c d
The longest path in the execution history
(not considering waiting activities)
CRITICAL PATH
19
W1
W2
W3
a b
c d
The longest path in the execution history
(not considering waiting activities)
CRITICAL PATH
19
W1
W2
W3
a b
c d
The longest path in the execution history
(not considering waiting activities)
CRITICAL PATH
19
W1
W2
W3
a b
c d
The longest path in the execution history
(not considering waiting activities)
CRITICAL PATH
19
W1
W2
W3
a b
c d
The longest path in the execution history
(not considering waiting activities)
CRITICAL PATH
19
W1
W2
W3
a b
c d
The longest path in the execution history
(not considering waiting activities)
CRITICAL PATH
19
W1
W2
W3
a b
c d
The longest path in the execution history
(not considering waiting activities)
20
W1
W2
W3
a b
c d
CRITICAL PATH
CRITICAL PATH
21
W1
W2
W3
a b
c d
CRITICAL PATH
21
W1
W2
W3
a b
c d
Reduced execution time
POST-MORTEM CRITICAL PATH ANALYSIS IS EASY
1. Collect traces during execution
job start job end
profiler
22
POST-MORTEM CRITICAL PATH ANALYSIS IS EASY
1. Collect traces during execution
job start job end
profiler
22
2. Analyze traces offline
analyzer
performance
summaries
How to compute the critical path for
continuously running,
distributed streaming applications,
with potentially unbounded input?
23
How to compute the critical path for
continuously running,
distributed streaming applications,
with potentially unbounded input?
▸ There might be no “job end”
▸ The PAG and critical path are continuously evolving
▸ Stale profiling information is not useful
23
ONLINE CRITICAL PATH ANALYSIS
ONLINE ANALYSIS OF TRACE SNAPSHOTS
25
input stream output stream
ONLINE ANALYSIS OF TRACE SNAPSHOTS
25
input stream output stream
periodic
snapshot
ONLINE ANALYSIS OF TRACE SNAPSHOTS
25
input stream output stream
periodic
snapshot
trace snapshot
stream
analyzer
performance
summaries
stream
PROGRAM ACTIVITY GRAPH SNAPSHOT
26
W1
W2
W3
a b
c d
t=k t=k+1
x u v z
ts te
W1
W2
W3
a b
c d
x u v z
ts te
27
▸ All paths have the same length: te - ts
W1
W2
W3
a b
c d
x u v z
ts te
27
W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
28
W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
29
W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
30
W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
31
W1
W2
W3
a b
c d
x u v z
ts te
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
31
▸ All paths have the same length: te - ts
▸ Choosing a random path might miss critical activities
▸ Enumerating all paths is impractical
W1
W2
W3
a b
c d
x u v z
ts te
32
W1
W2
W3
a b
c d
x u v z
ts te
How to rank activities with regard to criticality?
All paths are
potentially part
of the evolving
critical path
33
W1
W2
W3
a b
c d
x u v z
ts te
How to rank activities with regard to criticality?
All paths are
potentially part
of the evolving
critical path
Intuition: the more paths an activity appears on
the more probable it is that this activity is critical
33
1
2
3
4
5
6
7
8
9
W1
W2
W3
a b
c d
x u v z
ts te
34
1
2
3
4
5
6
7
8
9
W1
W2
W3
a b
c d
x u v z
ts te
9
0
0
6 6
35
36
CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
tion 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
set of N transient paths of snapshot G[ts,te]. The tran-
path centrality of an edge e 2 G[ts,te] is defined as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
e following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Sp
di↵er
ysis:
graph
whos
ers (t
graph
all wo
tions
1 We p
4
36
CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Can be computed
without path
enumeration!
tion 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
set of N transient paths of snapshot G[ts,te]. The tran-
path centrality of an edge e 2 G[ts,te] is defined as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
e following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Sp
di↵er
ysis:
graph
whos
ers (t
graph
all wo
tions
1 We p
4
ONLINE PERFORMANCE ANALYSIS WITH
SNAILTRAIL
38
reference application SnailTrail
Timely
Trace ingestion
CP-based
performance
summaries
PAG construction
CP computation and
activity ranking
trace streams
Profiling
Trace generation
Apache Flink,
Apache Spark,
TensorFlow,
Heron,
Timely Dataflow, ...
SNAILTRAIL IN ACTION
EXAMPLE: TASK SCHEDULING IN APACHE SPARK
39
DRIVER
W1
W2
W3
Venkataraman, Shivaram, et al. "Drizzle: Fast and adaptable stream processing at scale." Spark Summit (2016).
SCHEDULING BOTTLENECK IN APACHE SPARK
40
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Processing Scheduling
Conventional ProfilingSnailTrail Profiling
Apache Spark: Yahoo! Streaming Benchmark, 16 workers, 8s snapshots
SNAILTRAIL CP-BASED SUMMARIES
▸ Activity Summary
▸ which activity type is a bottleneck?
41
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
1.0
CP
DataMessage
Unknown
Buffer
Deserialization
Serialization
Processing
Activity Summary
Apache Flink: Dhalion WordCount Benchmark, 4 workers, 1s snapshots
42
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
1.0
CP
DataMessage
Unknown
Buffer
Deserialization
Serialization
Processing
Activity Summary
Optimize
serialization!
Apache Flink: Dhalion WordCount Benchmark, 4 workers, 1s snapshots
42
SNAILTRAIL CP-BASED SUMMARIES
▸ Activity Summary
▸ which activity type is a bottleneck?
▸ Straggler Summary
▸ which worker is a bottleneck?
43
0 5 10 15
Snapshot
0.00
0.05
0.10
0.15
CP
W1
W2
W3
W4
Straggler Summary
Apache Flink: Dhalion WordCount Benchmark, 4 workers, 1s snapshots
44
0 5 10 15
Snapshot
0.00
0.05
0.10
0.15
CP
W1
W2
W3
W4
Straggler Summary
Steal work from W1
Apache Flink: Dhalion WordCount Benchmark, 4 workers, 1s snapshots
44
SNAILTRAIL CP-BASED SUMMARIES
▸ Activity Summary
▸ which activity type is a bottleneck?
▸ Straggler Summary
▸ which worker is a bottleneck?
▸ Operator Summary
▸ which operator is a bottleneck?
45
Operator Summary
Apache Flink: Dhalion WordCount Benchmark, 10 workers, 1s snapshots
46
Operator Summary
Increase
flatmap’s parallelism!
Apache Flink: Dhalion WordCount Benchmark, 10 workers, 1s snapshots
46
SNAILTRAIL CP-BASED SUMMARIES
▸ Activity Summary
▸ which activity type is a bottleneck?
▸ Straggler Summary
▸ which worker is a bottleneck?
▸ Operator Summary
▸ which operator is a bottleneck?
▸ Communication Summary
▸ which communication channels are bottlenecks?
47
Communication Criticality
Communication Summary
48
0 1 2 3 4 5 6 7 8 9 10 11 12
Worker
0
1
2
3
4
5
6
7
8
9
10
11
12
Worker
1 32 54 86 7 109 1211
Worker
Worker
1
2
3
4
5
6
7
8
9
10
11
12
Communication Criticality
Communication Summary
48
0 1 2 3 4 5 6 7 8 9 10 11 12
Worker
0
1
2
3
4
5
6
7
8
9
10
11
12
Worker
1 32 54 86 7 109 1211
Worker
Worker
1
2
3
4
5
6
7
8
9
10
11
12
Collocate worker 5
with 2, 9, 10, 11
SNAILTRAIL PERFORMANCE
▸ Low instrumentation overhead
▸ < 10% for all reference systems
▸ High throughput
▸ 1.2 million events per second
▸ Always online
▸ 1s of traces in 6ms, 256s of traces in < 25s
49
Traces from Apache Flink Sessionization, 48 workers, 1s-256s snapshots
SnailTrail on Intel Xeon E5-4640, 2.40GHz, 32 cores, 512GB RAM
RECAP
50
RECAP
50
Strymon: online datacenter analytics
and management
RECAP
50
Strymon: online datacenter analytics
and management
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Processing Scheduling
Conventional profiling is misleading
RECAP
50
Strymon: online datacenter analytics
and management
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Processing Scheduling
Conventional profiling is misleading
CP-metric: online critical path analysis
RECAP
50
Strymon: online datacenter analytics
and management
0 5 10 15
Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15
Snapshot
%weight
Processing Scheduling
Conventional profiling is misleading
CP-metric: online critical path analysis SnailTrail: online CP-based summaries
THE STRYMON TEAM & FRIENDS
51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Prof. Timothy Roscoe
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
Frank McSherry
THE STRYMON TEAM & FRIENDS
51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Prof. Timothy Roscoe
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
? ?
Frank McSherry
THE STRYMON TEAM & FRIENDS
51
Vasiliki Kalavri
Zaheer Chothia
Andrea Lattuada
Prof. Timothy Roscoe
Sebastian Wicki
Moritz Hoffmann
Desislava Dimitrova
John Liagouris
? ?
Frank McSherry
IT COULD BE YOU!
strymon.systems.ethz.ch
www.systems.ethz.ch/positions
ONLINE
PERFORMANCE
ANALYSIS
OF DISTRIBUTED DATAFLOW SYSTEMS
Vasia Kalavri
kalavriv@inf.ethz.ch
Support:
O’Reilly Velocity London
20 October 2017

More Related Content

PDF
The shortest path is not always a straight line
PDF
AINL 2016: Goncharov
PDF
Self-managed and automatically reconfigurable stream processing
PDF
StrataGEM: A Generic Petri Net Verification Framework
PDF
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
PDF
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
PPT
Vcla 1
PPT
Unit26 shortest pathalgorithm
The shortest path is not always a straight line
AINL 2016: Goncharov
Self-managed and automatically reconfigurable stream processing
StrataGEM: A Generic Petri Net Verification Framework
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Vcla 1
Unit26 shortest pathalgorithm

Similar to Online performance analysis of distributed dataflow systems (O'Reilly Velocity 2017) (20)

PDF
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
PDF
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
PDF
A Strategic Model For Dynamic Traffic Assignment
PDF
Graph Evolution Models
PDF
Actors for Behavioural Simulation
PDF
RedisDay London 2018 - CRDTs and Redis From sequential to concurrent executions
PDF
Digital Communication - Stochastic Process
PPT
Fluid dynamics
PDF
Finding Dense Subgraphs
PPTX
Learning End to End
PDF
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
DOCX
Week-3 – System RSupplemental material1Recap •.docx
PPT
Mit15 082 jf10_lec01
PDF
Algorithms 2 A Quickstudy Laminated Reference Guide 1st Edition Babak Ahmadi
PPTX
SEMINAR ON SHORTEST PATH ALGORITHMS.pptx
PPTX
ASYMTOTIC NOTATIONS BIG O OEMGA THETE NOTATION.pptx
PPTX
dms slide discrete mathematics sem 2 engineering
PDF
branch bound algorithm the concept of daa.pdf
PDF
Compiler Construction | Lecture 11 | Monotone Frameworks
PPTX
Project Scheduling techniques in Project PERT.pptx
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
A Strategic Model For Dynamic Traffic Assignment
Graph Evolution Models
Actors for Behavioural Simulation
RedisDay London 2018 - CRDTs and Redis From sequential to concurrent executions
Digital Communication - Stochastic Process
Fluid dynamics
Finding Dense Subgraphs
Learning End to End
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Week-3 – System RSupplemental material1Recap •.docx
Mit15 082 jf10_lec01
Algorithms 2 A Quickstudy Laminated Reference Guide 1st Edition Babak Ahmadi
SEMINAR ON SHORTEST PATH ALGORITHMS.pptx
ASYMTOTIC NOTATIONS BIG O OEMGA THETE NOTATION.pptx
dms slide discrete mathematics sem 2 engineering
branch bound algorithm the concept of daa.pdf
Compiler Construction | Lecture 11 | Monotone Frameworks
Project Scheduling techniques in Project PERT.pptx
Ad

More from Vasia Kalavri (17)

PDF
From data stream management to distributed dataflows and beyond
PDF
Predictive Datacenter Analytics with Strymon
PDF
Apache Flink & Graph Processing
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
PDF
Demystifying Distributed Graph Processing
PDF
Like a Pack of Wolves: Community Structure of Web Trackers
PDF
Batch and Stream Graph Processing with Apache Flink
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PDF
Big data processing systems research
PDF
Asymmetry in Large-Scale Graph Analysis, Explained
PDF
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
PDF
m2r2: A Framework for Results Materialization and Reuse
PDF
MapReduce: Optimizations, Limitations, and Open Issues
PDF
A Skype case study (2011)
PDF
Gelly in Apache Flink Bay Area Meetup
PDF
Apache Flink Deep Dive
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
From data stream management to distributed dataflows and beyond
Predictive Datacenter Analytics with Strymon
Apache Flink & Graph Processing
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Demystifying Distributed Graph Processing
Like a Pack of Wolves: Community Structure of Web Trackers
Batch and Stream Graph Processing with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Big data processing systems research
Asymmetry in Large-Scale Graph Analysis, Explained
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
m2r2: A Framework for Results Materialization and Reuse
MapReduce: Optimizations, Limitations, and Open Issues
A Skype case study (2011)
Gelly in Apache Flink Bay Area Meetup
Apache Flink Deep Dive
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Ad

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Programs and apps: productivity, graphics, security and other tools
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Assigned Numbers - 2025 - Bluetooth® Document
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Programs and apps: productivity, graphics, security and other tools
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
sap open course for s4hana steps from ECC to s4
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf

Online performance analysis of distributed dataflow systems (O'Reilly Velocity 2017)