SlideShare a Scribd company logo
Where is my bottleneck?
Performance
troubleshooting in
Apache Flink
Piotr Nowojski
About me
Open source
● Apache Flink contributor/committer since 2017
● Member of the project management committee (PMC)
● Among core architects of the Flink Runtime
Career
● Co-Founder, Engineer @ Immerok
○ immerok.com
● Before that: Runtime team @ DataArtisans/Ververica (acquired by Alibaba)
● Even before that: working on Presto (now Trino) runtime
2
Agenda
3
● Understanding Flink Job basics
● Where to start performance analysis?
● What about checkpointing or recovery process?
● Tips & Tricks
Understanding the
basics
4
5
Job on Task Managers
6
Performance
troubleshooting
7
What are we troubleshooting?
8
● Processing records
○ Throughput is too low?
○ Resource usage is too high?
● Checkpointing
○ Are checkpoints failing?
○ Too long end-to-end exactly-once latency?
○ Reprocessing too many records after failover?
● Recovery
○ Long downtime?
Processing records
9
WebUI
10
Where is my bottleneck? TL;DR
11
HERE
Parallel subtasks can have different load profiles
12
Varying load
13
Varying load
14
Where is my bottleneck? TL;DRv2
15
● Rule of thumb
○ Start from the sources
○ Follow any backpressured subtasks downstream to the first ~100% busy
subtask(s)
○ That is your bottleneck
● Remember about potential data skew and varying load
I found the bottleneck! Now what?
16
● What to do next might be obvious
● Check machine and JVM process vitals
○ CPU usage
○ GC pauses
● Might require further investigation:
○ Looking into the code
○ Testing out various changes
○ Profiling
Not enough?
17
● You know what subtask(s) are causing problems
● Attach a code profiler to the Task Manager running that subtask
● Beware of other threads
○ Filter/Focus profiler results
○ Threads are named after the subtask that they are running
Flame Graphs!
18
Checkpointing
19
Checkpointing
20
● Checkpoints are failing?
● Too long end-to-end exactly-once latency?
● Reprocessing too many records after failover?
Checkpoint Barriers
21
Alignment
22
Checkpoints taking too long?
23
Checkpoints taking too long?
24
Checkpoints taking too long?
25
Long alignment duration/start delay
26
● Most likely caused by backpressure
○ Scale up
○ Optimise Job to increase throughput
○ Buffer debloating (reduces amount of in-flight data in Flink 1.14+)
○ Unaligned checkpoints
Long sync phase
27
● Might be general cluster overload (CPU, Memory, IO)
○ Checkpointing adds extra load to the cluster
● State backends
○ RocksDB flushing to disks
○ Tuning RocksDB advanced options
● Operators/Functions specific code
○ CheckpointedFunction#snapshotState call
○ For example: sink flushing/committing records
Long async phase
28
● Might be general cluster overload (CPU, Memory, IO)
○ Checkpointing adds extra load on the cluster
● Uploading state backend files
○ FileSystem-specific things
■ Make sure to fully utilize your FS (S3 Entropy)
○ Checkpointed state might be too large
■ Scale up?
■ Reduce state size?
■ Enable incremental checkpoints?
○ Too many small files
■ Increase state.storage.fs.memory-threshold?
● Experimental feature: enabling state backend changelog (Flink 1.14+)
Recovery
29
Long recovery
30
● Analyse Flink (debug) logs
● Use incremental checkpoints and/or native savepoints
● Similar issues to checkpointing but in reverse
● Potential solutions
○ Enabling local recovery might help
○ Reduce state size
○ Scale up
○ Tuning RocksDB advanced options
Closing words
31
● What is the main problem:
○ Processing records?
■ First locate the bottleneck subtask
○ Checkpointing?
■ Look into checkpoint statistics
○ Recovery?
■ Flink logs
Thanks
Piotr Nowojski
@PiotrNowojski
piotr@immerok.com

More Related Content

PPTX
Dynamic Rule-based Real-time Market Data Alerts
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PPTX
Autoscaling Flink with Reactive Mode
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Apache Flink internals
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
PPTX
Apache Flink and what it is used for
Dynamic Rule-based Real-time Market Data Alerts
Tame the small files problem and optimize data layout for streaming ingestion...
Autoscaling Flink with Reactive Mode
Introducing the Apache Flink Kubernetes Operator
Evening out the uneven: dealing with skew in Flink
Apache Flink internals
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
Apache Flink and what it is used for

What's hot (20)

PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PPTX
The Current State of Table API in 2022
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PDF
Flink powered stream processing platform at Pinterest
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
Changelog Stream Processing with Apache Flink
PPTX
The top 3 challenges running multi-tenant Flink at scale
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PPTX
Using Queryable State for Fun and Profit
PDF
Deploying Flink on Kubernetes - David Anderson
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Introduction to Kafka Cruise Control
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
Iceberg + Alluxio for Fast Data Analytics
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Tuning Apache Kafka Connectors for Flink.pptx
The Current State of Table API in 2022
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink powered stream processing platform at Pinterest
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Changelog Stream Processing with Apache Flink
The top 3 challenges running multi-tenant Flink at scale
Apache Flink in the Cloud-Native Era
Building Reliable Lakehouses with Apache Flink and Delta Lake
Using Queryable State for Fun and Profit
Deploying Flink on Kubernetes - David Anderson
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Practical learnings from running thousands of Flink jobs
Introduction to Kafka Cruise Control
Deep Dive: Memory Management in Apache Spark
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Iceberg + Alluxio for Fast Data Analytics
Ad

Similar to Where is my bottleneck? Performance troubleshooting in Flink (20)

PDF
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
Apache Flink Worst Practices
PPTX
Aljoscha Krettek - The Future of Apache Flink
PDF
Making Sense of Apache Flink: A Fearless Introduction
PDF
Apache flink
PDF
Flink Snapshots: A Comprehensive Guide for New Users
PPTX
Flink 0.10 - Upcoming Features
PDF
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
PPTX
Flink Streaming @BudapestData
PDF
Apache Flink Deep Dive
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
When Streaming Needs Batch With Konstantin Knauf | Current 2022
PPTX
Debunking Common Myths in Stream Processing
PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
PPTX
Apache Flink Overview at SF Spark and Friends
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Stephan Ewen - Experiences running Flink at Very Large Scale
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Apache Flink Worst Practices
Aljoscha Krettek - The Future of Apache Flink
Making Sense of Apache Flink: A Fearless Introduction
Apache flink
Flink Snapshots: A Comprehensive Guide for New Users
Flink 0.10 - Upcoming Features
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Streaming @BudapestData
Apache Flink Deep Dive
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flexible and Real-Time Stream Processing with Apache Flink
When Streaming Needs Batch With Konstantin Knauf | Current 2022
Debunking Common Myths in Stream Processing
Kostas Tzoumas - Stream Processing with Apache Flink®
Apache Flink Overview at SF Spark and Friends
Ad

More from Flink Forward (10)

PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PPTX
One sink to rule them all: Introducing the new Async Sink
PDF
Flink SQL on Pulsar made easy
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Batch Processing at Scale with Flink & Iceberg
PPTX
Welcome to the Flink Community!
PPTX
Extending Flink SQL for stream processing use cases
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
One sink to rule them all: Introducing the new Async Sink
Flink SQL on Pulsar made easy
Processing Semantically-Ordered Streams in Financial Services
Batch Processing at Scale with Flink & Iceberg
Welcome to the Flink Community!
Extending Flink SQL for stream processing use cases
Large Scale Real Time Fraudulent Web Behavior Detection
Near real-time statistical modeling and anomaly detection using Flink!

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Where is my bottleneck? Performance troubleshooting in Flink

  • 1. Where is my bottleneck? Performance troubleshooting in Apache Flink Piotr Nowojski
  • 2. About me Open source ● Apache Flink contributor/committer since 2017 ● Member of the project management committee (PMC) ● Among core architects of the Flink Runtime Career ● Co-Founder, Engineer @ Immerok ○ immerok.com ● Before that: Runtime team @ DataArtisans/Ververica (acquired by Alibaba) ● Even before that: working on Presto (now Trino) runtime 2
  • 3. Agenda 3 ● Understanding Flink Job basics ● Where to start performance analysis? ● What about checkpointing or recovery process? ● Tips & Tricks
  • 5. 5
  • 6. Job on Task Managers 6
  • 8. What are we troubleshooting? 8 ● Processing records ○ Throughput is too low? ○ Resource usage is too high? ● Checkpointing ○ Are checkpoints failing? ○ Too long end-to-end exactly-once latency? ○ Reprocessing too many records after failover? ● Recovery ○ Long downtime?
  • 11. Where is my bottleneck? TL;DR 11 HERE
  • 12. Parallel subtasks can have different load profiles 12
  • 15. Where is my bottleneck? TL;DRv2 15 ● Rule of thumb ○ Start from the sources ○ Follow any backpressured subtasks downstream to the first ~100% busy subtask(s) ○ That is your bottleneck ● Remember about potential data skew and varying load
  • 16. I found the bottleneck! Now what? 16 ● What to do next might be obvious ● Check machine and JVM process vitals ○ CPU usage ○ GC pauses ● Might require further investigation: ○ Looking into the code ○ Testing out various changes ○ Profiling
  • 17. Not enough? 17 ● You know what subtask(s) are causing problems ● Attach a code profiler to the Task Manager running that subtask ● Beware of other threads ○ Filter/Focus profiler results ○ Threads are named after the subtask that they are running
  • 20. Checkpointing 20 ● Checkpoints are failing? ● Too long end-to-end exactly-once latency? ● Reprocessing too many records after failover?
  • 26. Long alignment duration/start delay 26 ● Most likely caused by backpressure ○ Scale up ○ Optimise Job to increase throughput ○ Buffer debloating (reduces amount of in-flight data in Flink 1.14+) ○ Unaligned checkpoints
  • 27. Long sync phase 27 ● Might be general cluster overload (CPU, Memory, IO) ○ Checkpointing adds extra load to the cluster ● State backends ○ RocksDB flushing to disks ○ Tuning RocksDB advanced options ● Operators/Functions specific code ○ CheckpointedFunction#snapshotState call ○ For example: sink flushing/committing records
  • 28. Long async phase 28 ● Might be general cluster overload (CPU, Memory, IO) ○ Checkpointing adds extra load on the cluster ● Uploading state backend files ○ FileSystem-specific things ■ Make sure to fully utilize your FS (S3 Entropy) ○ Checkpointed state might be too large ■ Scale up? ■ Reduce state size? ■ Enable incremental checkpoints? ○ Too many small files ■ Increase state.storage.fs.memory-threshold? ● Experimental feature: enabling state backend changelog (Flink 1.14+)
  • 30. Long recovery 30 ● Analyse Flink (debug) logs ● Use incremental checkpoints and/or native savepoints ● Similar issues to checkpointing but in reverse ● Potential solutions ○ Enabling local recovery might help ○ Reduce state size ○ Scale up ○ Tuning RocksDB advanced options
  • 31. Closing words 31 ● What is the main problem: ○ Processing records? ■ First locate the bottleneck subtask ○ Checkpointing? ■ Look into checkpoint statistics ○ Recovery? ■ Flink logs