SlideShare a Scribd company logo
Experiences in running
Apache Flink®

at large scale
Stephan Ewen (@StephanEwen)
2
Lessons learned from running Flink at large scale
Including various things we never expected to
become a problem and evidently still did…
Also a preview to various fixes coming in Flink…
What is large scale?
3
Large Data Volume
( events / sec )
Large Application State
(GBs / TBs)
High Parallelism

(1000s of subtasks)
Complex Dataflow
Graphs
(many operators)
Distributed Coordination
4
Deploying Tasks
5
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects

- Recover State Handle

- Correlation IDs
Deploying Tasks
6
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects

- Recover State Handle

- Correlation IDs
KBs
up to MBs
KBs
few bytes
RPC volume during deployment
7
(back of the napkin calculation)
number of
tasks
2 MB
parallelism
size of task

objects
100010 x x
x x
=
= RPC volume
20 GB
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
Timeouts and Failure detection
8
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
Default RPC timeout: 10 secs
default settings lead to failed

deployments with RPC timeouts
Solution: Increase RPC timeout
Caveat: Increasing the timeout makes failure detection
slower

Future: Reduce RPC load (next slides)
Dissecting the RPC messages
9
Message part Size
Variance across
subtasks

and redeploys
Job Configuration KBs constant
Task Code and
Objects
up to MBs constant
Recover State Handle KBs variable
Correlation IDs few bytes variable
Upcoming: Deploying Tasks
10
Out-of-band transfer and caching of

large and constant message parts
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Cache
(1) Deployment RPC Call
(Recover State Handle,
Correlation IDs, BLOB pointers)
(2) Download and cache BLOBs


(Job Config, Task Objects) MBs
KBs
Checkpoints at scale
11
12
…is the most important part of

running a large Flink program
Robustly checkpointing…
Review: Checkpoints
13
Trigger checkpoint Inject checkpoint barrier
stateful

operation
source /
transform
Review: Checkpoints
14
Take state snapshot Trigger state

snapshot
stateful

operation
source /
transform
Review: Checkpoint Alignment
15
1
2
3
4
5
6
a
b
c
d
e
f
begin aligning
checkpoint

barrier n
xy
operator
7
8
9
c
d
e
f
g
h
aligning
ab
operator
23 1
4
5
6
input buffer
y
Review: Checkpoint Alignment
16
7
8
9
d
e
f
g
h
bc
operator
23 1
5
6
emit barrier n
7
8
9
d
e
f
g
h
c
operator
23 1
5
6
input buffer
continuecheckpoint
4 4
i
a
Understanding Checkpoints
17
Understanding Checkpoints
18
How well behaves
the alignment?
(lower is better)
How long do

snapshots take?
delay =

end_to_end – sync – async
Understanding Checkpoints
19
How well behaves
the alignment?
(lower is better)
How long do

snapshots take?
delay =

end_to_end – sync – async
long delay = under backpressure
under constant backpressure
means the application is
under provisioned
too long means
! too much state

per node

! snapshot store cannot

keep up with load

(low bandwidth)
changes with incremental

checkpoints
most important

metric
Alignments: Limit in-flight data
▪ In-flight data is data "between" operators
• On the wire or in the network buffers
• Amount depends mainly on network buffer memory
▪ Need some to buffer out network

fluctuations / transient backpressure
▪ Max amount of in-flight data is max

amount buffered during alignment
20
1
2
3
4
5
6
a
b
c
d
e
f
checkpoint

barrier
xyoperator
Alignments: Limit in-flight data
▪ Flink 1.2: Global pool that distributes across all tasks
• Rule-of-thumb: set to 4 * num_shuffles * parallelism * num_slots
▪ Flink 1.3: Limits the max in-flight

data automatically
• Heuristic based on of channels

and connections involved in a

transfer step
21
1
2
3
4
5
6
a
b
c
d
e
f
checkpoint

barrier
xyoperator
Heavy alignments
▪ A heavy alignment typically happens at some point

! Different load on different paths
▪ Big window emission concurrent

to a checkpoint
▪ Stall of one operator on the path
22
Heavy alignments
▪ A heavy alignment typically happens at some point

! Different load on different paths
▪ Big window emission concurrent

to a checkpoint
▪ Stall of one operator on the path
23
Heavy alignments
▪ A heavy alignment typically happens at some point

! Different load on different paths
▪ Big window emission concurrent

to a checkpoint
▪ Stall of one operator on the path
24
GC stall
Catching up from heavy alignments
▪ Operators that did heavy alignment need to catch up
again
▪ Otherwise, next checkpoint will have a

heavy alignment as well
25
7
8
9
d
e
f
g
h
operator
5
6
7
8
9
d
e
f
g
h
bc
operator
23 1
5
6
4
a
consumed first after

checkpoint completed
bc a
Catching up from heavy alignments
▪ Giving the computation time to catch up before starting
the next checkpoint
• Useful: Set the min-time-between-checkpoints
▪ Asynchronous checkpoints help a lot!
• Shorter stalls in the pipelines means less build-up of in-flight
data
• Catch up already happens concurrently to state materialization
26
Asynchronous Checkpoints
27
Durably persist

snapshots

asynchronously
Processing pipeline continues
stateful

operation
source /
transform
Asynchrony of different state types
28
State Flink 1.2 Flink 1.3 Flink 1.3 +
Keyed state

RocksDB ✔ ✔ ✔
Keyed State

on heap
✘ (✔)

(hidden in 1.2.1) ✔ ✔
Timers ✘ ✔/✘ ✔
Operator State ✘ ✔ ✔
When to use which state backend?
29
Async. Heap RocksDB
State ≥ Memory ?
Complex Objects?

(expensive serialization)
high data rate?
no yes
yes no
yes
no
a bit

simplified
File Systems, Object Stores,

and Checkpointed State
30
Exceeding FS request capacity
▪ Job size: 4 operators
▪ Parallelism: 100s to 1000
▪ State Backend: FsStateBackend
▪ State size: few KBs per operator, 100s to 1000 of files
▪ Checkpoint interval: few secs
▪ Symptom: S3 blocked off connections after exceeding
1000s HEAD requests / sec
31
Exceeding FS request capacity
What happened?
▪ Operators prepare state writes,

ensure parent directory exists
▪ Via the S3 FS (from Hadoop), each mkdirs causes

2 HEAD requests
▪ Flink 1.2: Lazily initialize checkpoint preconditions (dirs.)
▪ Flink 1.3: Core state backends reduce assumption of directories
(PUT/GET/DEL), rich file systems support them as fast paths
32
Reducing FS stress for small state
33
JobManager TaskManager
Checkpoint

Coordinator
Task
TaskManager
Task
TaskTask
Root Checkpoint File

(metadata) checkpoint data

files
Fs/RocksDB state backend

for most states
Reducing FS stress for small state
34
JobManager TaskManager
Checkpoint

Coordinator
Task
TaskManager
Task
TaskTask
checkpoint data

directly in metadata file
Fs/RocksDB state backend

for small states
ack+data
Increasing small state

threshold reduces number

of files (default: 1KB)
Lagging state cleanup
35
JobManager
TM
TMTM
TMTM
TM
TMTMTM
Symptom: Checkpoints get cleaned up too slow

State accumulates over time
many TaskManagers

create files
One JobManager

deleting files
Lagging state cleanup
▪ Problem: FileSystems and Object Stores offer only
synchronous requests to delete state object

Time to delete a checkpoint may accumulates to minutes.
▪ Flink 1.2: Concurrent checkpoint deletes on the JobManager
▪ Flink 1.3: For FileSystems with actual directory structure, use
recursive directory deletes (one request per directory)
36
Orphaned Checkpoint State
37
JobManager
Checkpoint

Coordinator
TaskManager
TaskTask
(1) TaskManager

writes state
(2) Ack and transfer

ownership of state
(3) JobManager

records state reference
Who owns state objects at what time?
Orphaned Checkpoint State
38
fs:///checkpoints/job-61776516/
chk-113
/
chk-129
/
chk-221
/
chk-271
/
chk-272
/
It gets more complicated with incremental checkpoints…
Upcoming: Searching for orphaned state
latest
retained
periodically sweep checkpoint

directory for leftover dirs
Conclusion &

General Recommendations
39
40
The closer you application is to saturating either

network, CPU, memory, FS throughput, etc.

the sooner an extraordinary situation causes a regression
Enough headroom in provisioned capacity means
fast catchup after temporary regressions
Be aware that certain operations are spiky
(like aligned windows)
Production test always with checkpoints ;-)
Recommendations (part 1)
Be aware of the inherent scalability of primitives
▪ Broadcasting state is useful, for example for updating rules / configs,
dynamic code loading, etc.
▪ Broadcasting does not scale, i.e., adding more nodes does not.

Don't use it for high volume joins
▪ Putting very large objects into a ValueState may mean big
serialization effort on access / checkpoint
▪ If the state can be mappified, use MapState – it performs much better
41
Recommendations (part 2)
If you are about recovery time
▪ Having spare TaskManagers helps bridge the time until backup
TaskManagers come online
▪ Having a spare JobManager can be useful
• Future: JobManager failures are non disruptive
42
Recommendations (part 3)
If you care about CPU efficiency, watch your serializers
▪ JSON is a flexible, but awfully inefficient data type
▪ Kryo does okay - make sure you register the types
▪ Flink's directly supported types have good performance

basic types, arrays, tuples, …
▪ Nothing ever beats a custom serializer ;-)
43
44
Thank you!
Questions?

More Related Content

PDF
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
PDF
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
PPTX
Stephan Ewen - Scaling to large State
PPTX
Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Stephan Ewen - Scaling to large State
Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

What's hot (20)

PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
PPTX
Kostas Kloudas - Extending Flink's Streaming APIs
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
PPTX
Apache Flink @ NYC Flink Meetup
PDF
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PDF
Unified Stream and Batch Processing with Apache Flink
PDF
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
PDF
Pulsar connector on flink 1.14
PPTX
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
PPTX
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
PPTX
First Flink Bay Area meetup
PDF
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
PPTX
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
PDF
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
PDF
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
PDF
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Kostas Kloudas - Extending Flink's Streaming APIs
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink @ NYC Flink Meetup
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Continuous Processing with Apache Flink - Strata London 2016
Unified Stream and Batch Processing with Apache Flink
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Pulsar connector on flink 1.14
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
First Flink Bay Area meetup
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Ad

Similar to Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale (20)

PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Linux Kernel vs DPDK: HTTP Performance Showdown
PDF
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PDF
3.2 Streaming and Messaging
PPT
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
PDF
Container orchestration from theory to practice
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Container Orchestration from Theory to Practice
PDF
Diving into the Deep End - Kafka Connect
PDF
Flink at netflix paypal speaker series
PDF
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
PPTX
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PDF
Flink forward-2017-netflix keystones-paas
PDF
Postgres clusters
Stephan Ewen - Experiences running Flink at Very Large Scale
Linux Kernel vs DPDK: HTTP Performance Showdown
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
QCon London - Stream Processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
3.2 Streaming and Messaging
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Container orchestration from theory to practice
Flexible and Real-Time Stream Processing with Apache Flink
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Container Orchestration from Theory to Practice
Diving into the Deep End - Kafka Connect
Flink at netflix paypal speaker series
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Apache Flink(tm) - A Next-Generation Stream Processor
Flink forward-2017-netflix keystones-paas
Postgres clusters
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
PPTX
Welcome to the Flink Community!
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg
Welcome to the Flink Community!

Recently uploaded (20)

PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Lecture1 pattern recognition............
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to Business Data Analytics.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction-to-Cloud-ComputingFinal.pptx
IB Computer Science - Internal Assessment.pptx
Reliability_Chapter_ presentation 1221.5784
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Lecture1 pattern recognition............
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Moving the Public Sector (Government) to a Digital Adoption
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Major-Components-ofNKJNNKNKNKNKronment.pptx
Launch Your Data Science Career in Kochi – 2025
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
Introduction to Business Data Analytics.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large Scale

  • 1. Experiences in running Apache Flink®
 at large scale Stephan Ewen (@StephanEwen)
  • 2. 2 Lessons learned from running Flink at large scale Including various things we never expected to become a problem and evidently still did… Also a preview to various fixes coming in Flink…
  • 3. What is large scale? 3 Large Data Volume ( events / sec ) Large Application State (GBs / TBs) High Parallelism
 (1000s of subtasks) Complex Dataflow Graphs (many operators)
  • 5. Deploying Tasks 5 Happens during initial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects
 - Recover State Handle
 - Correlation IDs
  • 6. Deploying Tasks 6 Happens during initial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects
 - Recover State Handle
 - Correlation IDs KBs up to MBs KBs few bytes
  • 7. RPC volume during deployment 7 (back of the napkin calculation) number of tasks 2 MB parallelism size of task
 objects 100010 x x x x = = RPC volume 20 GB ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net
  • 8. Timeouts and Failure detection 8 ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net Default RPC timeout: 10 secs default settings lead to failed
 deployments with RPC timeouts Solution: Increase RPC timeout Caveat: Increasing the timeout makes failure detection slower
 Future: Reduce RPC load (next slides)
  • 9. Dissecting the RPC messages 9 Message part Size Variance across subtasks
 and redeploys Job Configuration KBs constant Task Code and Objects up to MBs constant Recover State Handle KBs variable Correlation IDs few bytes variable
  • 10. Upcoming: Deploying Tasks 10 Out-of-band transfer and caching of
 large and constant message parts JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Cache (1) Deployment RPC Call (Recover State Handle, Correlation IDs, BLOB pointers) (2) Download and cache BLOBs 
 (Job Config, Task Objects) MBs KBs
  • 12. 12 …is the most important part of
 running a large Flink program Robustly checkpointing…
  • 13. Review: Checkpoints 13 Trigger checkpoint Inject checkpoint barrier stateful
 operation source / transform
  • 14. Review: Checkpoints 14 Take state snapshot Trigger state
 snapshot stateful
 operation source / transform
  • 15. Review: Checkpoint Alignment 15 1 2 3 4 5 6 a b c d e f begin aligning checkpoint
 barrier n xy operator 7 8 9 c d e f g h aligning ab operator 23 1 4 5 6 input buffer y
  • 16. Review: Checkpoint Alignment 16 7 8 9 d e f g h bc operator 23 1 5 6 emit barrier n 7 8 9 d e f g h c operator 23 1 5 6 input buffer continuecheckpoint 4 4 i a
  • 18. Understanding Checkpoints 18 How well behaves the alignment? (lower is better) How long do
 snapshots take? delay =
 end_to_end – sync – async
  • 19. Understanding Checkpoints 19 How well behaves the alignment? (lower is better) How long do
 snapshots take? delay =
 end_to_end – sync – async long delay = under backpressure under constant backpressure means the application is under provisioned too long means ! too much state
 per node
 ! snapshot store cannot
 keep up with load
 (low bandwidth) changes with incremental
 checkpoints most important
 metric
  • 20. Alignments: Limit in-flight data ▪ In-flight data is data "between" operators • On the wire or in the network buffers • Amount depends mainly on network buffer memory ▪ Need some to buffer out network
 fluctuations / transient backpressure ▪ Max amount of in-flight data is max
 amount buffered during alignment 20 1 2 3 4 5 6 a b c d e f checkpoint
 barrier xyoperator
  • 21. Alignments: Limit in-flight data ▪ Flink 1.2: Global pool that distributes across all tasks • Rule-of-thumb: set to 4 * num_shuffles * parallelism * num_slots ▪ Flink 1.3: Limits the max in-flight
 data automatically • Heuristic based on of channels
 and connections involved in a
 transfer step 21 1 2 3 4 5 6 a b c d e f checkpoint
 barrier xyoperator
  • 22. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 22
  • 23. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 23
  • 24. Heavy alignments ▪ A heavy alignment typically happens at some point
 ! Different load on different paths ▪ Big window emission concurrent
 to a checkpoint ▪ Stall of one operator on the path 24 GC stall
  • 25. Catching up from heavy alignments ▪ Operators that did heavy alignment need to catch up again ▪ Otherwise, next checkpoint will have a
 heavy alignment as well 25 7 8 9 d e f g h operator 5 6 7 8 9 d e f g h bc operator 23 1 5 6 4 a consumed first after
 checkpoint completed bc a
  • 26. Catching up from heavy alignments ▪ Giving the computation time to catch up before starting the next checkpoint • Useful: Set the min-time-between-checkpoints ▪ Asynchronous checkpoints help a lot! • Shorter stalls in the pipelines means less build-up of in-flight data • Catch up already happens concurrently to state materialization 26
  • 27. Asynchronous Checkpoints 27 Durably persist
 snapshots
 asynchronously Processing pipeline continues stateful
 operation source / transform
  • 28. Asynchrony of different state types 28 State Flink 1.2 Flink 1.3 Flink 1.3 + Keyed state
 RocksDB ✔ ✔ ✔ Keyed State
 on heap ✘ (✔)
 (hidden in 1.2.1) ✔ ✔ Timers ✘ ✔/✘ ✔ Operator State ✘ ✔ ✔
  • 29. When to use which state backend? 29 Async. Heap RocksDB State ≥ Memory ? Complex Objects?
 (expensive serialization) high data rate? no yes yes no yes no a bit
 simplified
  • 30. File Systems, Object Stores,
 and Checkpointed State 30
  • 31. Exceeding FS request capacity ▪ Job size: 4 operators ▪ Parallelism: 100s to 1000 ▪ State Backend: FsStateBackend ▪ State size: few KBs per operator, 100s to 1000 of files ▪ Checkpoint interval: few secs ▪ Symptom: S3 blocked off connections after exceeding 1000s HEAD requests / sec 31
  • 32. Exceeding FS request capacity What happened? ▪ Operators prepare state writes,
 ensure parent directory exists ▪ Via the S3 FS (from Hadoop), each mkdirs causes
 2 HEAD requests ▪ Flink 1.2: Lazily initialize checkpoint preconditions (dirs.) ▪ Flink 1.3: Core state backends reduce assumption of directories (PUT/GET/DEL), rich file systems support them as fast paths 32
  • 33. Reducing FS stress for small state 33 JobManager TaskManager Checkpoint
 Coordinator Task TaskManager Task TaskTask Root Checkpoint File
 (metadata) checkpoint data
 files Fs/RocksDB state backend
 for most states
  • 34. Reducing FS stress for small state 34 JobManager TaskManager Checkpoint
 Coordinator Task TaskManager Task TaskTask checkpoint data
 directly in metadata file Fs/RocksDB state backend
 for small states ack+data Increasing small state
 threshold reduces number
 of files (default: 1KB)
  • 35. Lagging state cleanup 35 JobManager TM TMTM TMTM TM TMTMTM Symptom: Checkpoints get cleaned up too slow
 State accumulates over time many TaskManagers
 create files One JobManager
 deleting files
  • 36. Lagging state cleanup ▪ Problem: FileSystems and Object Stores offer only synchronous requests to delete state object
 Time to delete a checkpoint may accumulates to minutes. ▪ Flink 1.2: Concurrent checkpoint deletes on the JobManager ▪ Flink 1.3: For FileSystems with actual directory structure, use recursive directory deletes (one request per directory) 36
  • 37. Orphaned Checkpoint State 37 JobManager Checkpoint
 Coordinator TaskManager TaskTask (1) TaskManager
 writes state (2) Ack and transfer
 ownership of state (3) JobManager
 records state reference Who owns state objects at what time?
  • 38. Orphaned Checkpoint State 38 fs:///checkpoints/job-61776516/ chk-113 / chk-129 / chk-221 / chk-271 / chk-272 / It gets more complicated with incremental checkpoints… Upcoming: Searching for orphaned state latest retained periodically sweep checkpoint
 directory for leftover dirs
  • 40. 40 The closer you application is to saturating either
 network, CPU, memory, FS throughput, etc.
 the sooner an extraordinary situation causes a regression Enough headroom in provisioned capacity means fast catchup after temporary regressions Be aware that certain operations are spiky (like aligned windows) Production test always with checkpoints ;-)
  • 41. Recommendations (part 1) Be aware of the inherent scalability of primitives ▪ Broadcasting state is useful, for example for updating rules / configs, dynamic code loading, etc. ▪ Broadcasting does not scale, i.e., adding more nodes does not.
 Don't use it for high volume joins ▪ Putting very large objects into a ValueState may mean big serialization effort on access / checkpoint ▪ If the state can be mappified, use MapState – it performs much better 41
  • 42. Recommendations (part 2) If you are about recovery time ▪ Having spare TaskManagers helps bridge the time until backup TaskManagers come online ▪ Having a spare JobManager can be useful • Future: JobManager failures are non disruptive 42
  • 43. Recommendations (part 3) If you care about CPU efficiency, watch your serializers ▪ JSON is a flexible, but awfully inefficient data type ▪ Kryo does okay - make sure you register the types ▪ Flink's directly supported types have good performance
 basic types, arrays, tuples, … ▪ Nothing ever beats a custom serializer ;-) 43