Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale

Experiences in running
Apache Flink® 
at large scale
Stephan Ewen (@StephanEwen)

2
Lessons learned from running Flink at large scale
Including various things we never expected to
become a problem and evidently still did…
Also a preview to various fixes coming in Flink…

What is large scale?
3
Large Data Volume
( events / sec )
Large Application State
(GBs / TBs)
High Parallelism 
(1000s of subtasks)
Complex Dataflow
Graphs
(many operators)

Deploying Tasks
5
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects 
- Recover State Handle 
- Correlation IDs

Deploying Tasks
6
Happens during initial deployment and recovery
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects 
- Recover State Handle 
- Correlation IDs
KBs
up to MBs
KBs
few bytes

RPC volume during deployment
7
(back of the napkin calculation)
number of
tasks
2 MB
parallelism
size of task 
objects
100010 x x
x x
=
= RPC volume
20 GB
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net

Timeouts and Failure detection
8
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
Default RPC timeout: 10 secs
default settings lead to failed 
deployments with RPC timeouts
Solution: Increase RPC timeout
Caveat: Increasing the timeout makes failure detection
slower 
Future: Reduce RPC load (next slides)

Dissecting the RPC messages
9
Message part Size
Variance across
subtasks 
and redeploys
Job Configuration KBs constant
Task Code and
Objects
up to MBs constant
Recover State Handle KBs variable
Correlation IDs few bytes variable

Upcoming: Deploying Tasks
10
Out-of-band transfer and caching of 
large and constant message parts
Blob Server Blob Cache
(1) Deployment RPC Call
(Recover State Handle,
Correlation IDs, BLOB pointers)
(2) Download and cache BLOBs
 
(Job Config, Task Objects) MBs
KBs

12
…is the most important part of 
running a large Flink program
Robustly checkpointing…

Review: Checkpoints
13
Trigger checkpoint Inject checkpoint barrier
stateful 
operation
source /
transform

Review: Checkpoints
14
Take state snapshot Trigger state 
snapshot
stateful 
operation
source /
transform

Review: Checkpoint Alignment
15
1
2
3
4
5
6
a
b
c
d
e
f
begin aligning
checkpoint 
barrier n
xy
operator
7
8
9
c
d
e
f
g
h
aligning
ab
operator
23 1
4
5
6
input buffer
y

Review: Checkpoint Alignment
16
7
8
9
d
e
f
g
h
bc
operator
23 1
5
6
emit barrier n
7
8
9
d
e
f
g
h
c
operator
23 1
5
6
input buffer
continuecheckpoint
4 4
i
a

Understanding Checkpoints
18
How well behaves
the alignment?
(lower is better)
How long do 
snapshots take?
delay = 
end_to_end – sync – async

Understanding Checkpoints
19
How well behaves
the alignment?
(lower is better)
How long do 
snapshots take?
delay = 
end_to_end – sync – async
long delay = under backpressure
under constant backpressure
means the application is
under provisioned
too long means
! too much state 
per node 
! snapshot store cannot 
keep up with load 
(low bandwidth)
changes with incremental 
checkpoints
most important 
metric

Alignments: Limit in-flight data
▪ In-flight data is data "between" operators
• On the wire or in the network buffers
• Amount depends mainly on network buffer memory
▪ Need some to buffer out network 
fluctuations / transient backpressure
▪ Max amount of in-flight data is max 
amount buffered during alignment
20
1
2
3
4
5
6
a
b
c
d
e
f
checkpoint 
barrier
xyoperator

Alignments: Limit in-flight data
▪ Flink 1.2: Global pool that distributes across all tasks
• Rule-of-thumb: set to 4 * num_shuffles * parallelism * num_slots
▪ Flink 1.3: Limits the max in-flight 
data automatically
• Heuristic based on of channels 
and connections involved in a 
transfer step
21
1
2
3
4
5
6
a
b
c
d
e
f
checkpoint 
barrier
xyoperator

Heavy alignments
▪ A heavy alignment typically happens at some point 
! Different load on different paths
▪ Big window emission concurrent 
to a checkpoint
▪ Stall of one operator on the path
22

Heavy alignments
to a checkpoint
23

Heavy alignments
to a checkpoint
24
GC stall

Catching up from heavy alignments
▪ Operators that did heavy alignment need to catch up
again
▪ Otherwise, next checkpoint will have a 
heavy alignment as well
25
7
8
9
d
e
f
g
h
operator
5
6
7
8
9
d
e
f
g
h
bc
operator
23 1
5
6
4
a
consumed first after 
checkpoint completed
bc a

Catching up from heavy alignments
▪ Giving the computation time to catch up before starting
the next checkpoint
• Useful: Set the min-time-between-checkpoints
▪ Asynchronous checkpoints help a lot!
• Shorter stalls in the pipelines means less build-up of in-flight
data
• Catch up already happens concurrently to state materialization
26

Asynchronous Checkpoints
27
Durably persist 
snapshots 
asynchronously
Processing pipeline continues
stateful 
operation
source /
transform

Asynchrony of different state types
28
State Flink 1.2 Flink 1.3 Flink 1.3 +
Keyed state 
RocksDB ✔ ✔ ✔
Keyed State 
on heap
✘ (✔) 
(hidden in 1.2.1) ✔ ✔
Timers ✘ ✔/✘ ✔
Operator State ✘ ✔ ✔

When to use which state backend?
29
Async. Heap RocksDB
State ≥ Memory ?
Complex Objects? 
(expensive serialization)
high data rate?
no yes
yes no
yes
no
a bit 
simplified

File Systems, Object Stores, 
and Checkpointed State
30

Exceeding FS request capacity
▪ Job size: 4 operators
▪ Parallelism: 100s to 1000
▪ State Backend: FsStateBackend
▪ State size: few KBs per operator, 100s to 1000 of files
▪ Checkpoint interval: few secs
▪ Symptom: S3 blocked off connections after exceeding
1000s HEAD requests / sec
31

Exceeding FS request capacity
What happened?
▪ Operators prepare state writes, 
ensure parent directory exists
▪ Via the S3 FS (from Hadoop), each mkdirs causes 
2 HEAD requests
▪ Flink 1.2: Lazily initialize checkpoint preconditions (dirs.)
▪ Flink 1.3: Core state backends reduce assumption of directories
(PUT/GET/DEL), rich file systems support them as fast paths
32

Reducing FS stress for small state
33
Checkpoint 
Coordinator
Task
TaskManager
Task
TaskTask
Root Checkpoint File 
(metadata) checkpoint data 
files
Fs/RocksDB state backend 
for most states

Reducing FS stress for small state
34
Checkpoint 
Coordinator
Task
TaskManager
Task
TaskTask
checkpoint data 
directly in metadata file
Fs/RocksDB state backend 
for small states
ack+data
Increasing small state 
threshold reduces number 
of files (default: 1KB)

Lagging state cleanup
35
JobManager
TM
TMTM
TMTM
TM
TMTMTM
Symptom: Checkpoints get cleaned up too slow 
State accumulates over time
many TaskManagers 
create files
One JobManager 
deleting files

Lagging state cleanup
▪ Problem: FileSystems and Object Stores offer only
synchronous requests to delete state object 
Time to delete a checkpoint may accumulates to minutes.
▪ Flink 1.2: Concurrent checkpoint deletes on the JobManager
▪ Flink 1.3: For FileSystems with actual directory structure, use
recursive directory deletes (one request per directory)
36

Orphaned Checkpoint State
37
JobManager
Checkpoint 
Coordinator
TaskManager
TaskTask
(1) TaskManager 
writes state
(2) Ack and transfer 
ownership of state
(3) JobManager 
records state reference
Who owns state objects at what time?

Orphaned Checkpoint State
38
fs:///checkpoints/job-61776516/
chk-113
/
chk-129
/
chk-221
/
chk-271
/
chk-272
/
It gets more complicated with incremental checkpoints…
Upcoming: Searching for orphaned state
latest
retained
periodically sweep checkpoint 
directory for leftover dirs

Conclusion & 
General Recommendations
39

40
The closer you application is to saturating either 
network, CPU, memory, FS throughput, etc. 
the sooner an extraordinary situation causes a regression
Enough headroom in provisioned capacity means
fast catchup after temporary regressions
Be aware that certain operations are spiky
(like aligned windows)
Production test always with checkpoints ;-)

Recommendations (part 1)
Be aware of the inherent scalability of primitives
▪ Broadcasting state is useful, for example for updating rules / configs,
dynamic code loading, etc.
▪ Broadcasting does not scale, i.e., adding more nodes does not. 
Don't use it for high volume joins
▪ Putting very large objects into a ValueState may mean big
serialization effort on access / checkpoint
▪ If the state can be mappified, use MapState – it performs much better
41

If you are about recovery time
▪ Having spare TaskManagers helps bridge the time until backup
TaskManagers come online
▪ Having a spare JobManager can be useful
• Future: JobManager failures are non disruptive
42

If you care about CPU efficiency, watch your serializers
▪ JSON is a flexible, but awfully inefficient data type
▪ Kryo does okay - make sure you register the types
▪ Flink's directly supported types have good performance 
basic types, arrays, tuples, …
▪ Nothing ever beats a custom serializer ;-)
43

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale

More Related Content

What's hot (20)

Similar to Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale