SlideShare a Scribd company logo
Tuning Flink Clusters for
stability and efficiency
Divye Kapoor, Pinterest
Flink Forward 2023 ©
Starting with the end in mind…
By the end of this talk, you’ll know how
we tuned our Flink clusters to reduce
per-job costs by 50-90% (~75%
typical) and how we were able to
absorb ~40% workload for free for
Pinterest.
25%
Cost of a job after we
were done with our work
Hi! I’m Divye Kapoor, I’m the TL for the
Stream Processing Platform at Pinterest and
I’m here presenting the work that our teams at
Pinterest have done over the past 2 years on
stability and efficiency.
Flink Forward 2023 ©
Credits
Teja T. - SRE & EM
CGroups, Cluster HW, Rollouts &
optimizations at several levels of
the stack.
Thank you
Flink Forward 2023 ©
Credits: Leadership & Partners
30+ teams
200+ partners
4 orgs
Flink Forward 2023 ©
Actual spend vs Budgeted spend for the company ($ terms).
Flink Forward 2023 ©
Let’s just say that a fair bit of money was printed for Pinterest…
Flink Forward 2023 ©
Our clusters run on YARN (today), everything that follows is in that context.
Flink Forward 2023 ©
So what’s challenging about running
and tuning a Flink multi tenant cluster?
● Job sizes: 2000+ cores on a job
vs jobs with < 10 cores.
● Job tiering: small jobs that can’t
fail and other jobs that can.
● Multitenant efficiency: resource
use that isn’t wasteful.
● Multitenant priority: in an incident,
keep the right jobs working.
● Noisy neighbors
● Data skew
Flink Forward 2023 ©
CGroups
● CGroups was our must-have for
everything that follows. Teja led the
charge.
1. We upgraded YARN and then
configured it to support soft CGroup
limits. (The limits only kick in if the
host is running out of capacity)
2. We verified that if a host is at capacity,
the resources are fairly shared.
3. We started running the cluster hotter
(no CPU starvation!).
Flink Forward 2023 ©
CGroups
● Hard limits don’t work well for Flink jobs.
● Most Flink jobs want to burst on CPU on
deploys and this setup allows for the catch
up to take place without throttling.
● Hard limits can trigger OOMs, back
pressure and other stability issues.
Generally, it’s not clear if the job will come
back after a restart.
Lesson 1: Always configure
your YARN or K8s cluster to
avoid hard limits / throttles.
Flink Forward 2023 ©
Container Placement: Stability & Cost Opt.
● No Hot Nodes please!
● Container Placement is critical to
keeping a stable cluster running.
● We want all applications to be well
behaved and work well with our job
schedulers.
● Bad container scheduling = host
running out of capacity at peak.
Flink Forward 2023 ©
Container Placement: Option 1
Caption
CPU Aware: Schedule on hosts where CPU utilization is < 50th percentile
Flink Forward 2023 ©
Container Placement: Option 2
Caption
Config: yarn.nodemanager.resource.cpu-vcores = 75% of cores on host
Flink Forward 2023 ©
No traffic-peak stability issues
seen after the container placement
strategy was implemented.
Stability is a prerequisite for optimization
Flink Forward 2023 ©
Job Optimization
● Source of significant wins - task placements & vertical sharding.
● Required a full round of re-optimization of our job configurations.
● Mass migrations & rollouts - we got good at it.
● 70%+ reduction in cross-host network traffic for jobs.
● Jobs became 50-90%+ cheaper to run.
● Serialization & Traffic overhead drops.
● Magic: Removing SSGs, aligning parallelism across operators, forcing
“ColocationConstraints” and optimizing Flink 1.11 task placements.
Flink Forward 2023 ©
Job Optimization
Before: CPU utilization showing skewed load. This is wasteful because the lightly loaded
Task Managers are asking for the same resources as the heavily loaded ones.
Flink Forward 2023 ©
Hardware optimization: i3 to i4i
~40% reduction in CPU utilization per job.
Flink Forward 2023 ©
Our last wins:
Input Data optimization:
Only read the data the job needs from Kafka. Where
appropriate, we split the Kafka topics.
Autotuning: We built an in-house autotuner so that
we don’t need to keep re-tuning our jobs for CPU
utilization.
These will be covered separately in other talks in the future.
Flink Forward 2023 ©
Recap:
● CGroups
● Soft Limits
● Run clusters
hotter
● Container
Placement
strategy
● Job re-tuning
● Job optimization
● Job retuning
● Hardware
upgrades
● Input Data
optimization
● Job autotuning
Stage 1 Stage 2 Stage 3 Stage 4
Flink Forward 2023 ©
Our total wins were ~fairly large.
The end result is a nice clean up of
the costs on the streaming stack.
Job costs on Flink were a discussion
point. After optimizations, these
concerns have melted away.
75%
Job cost reduction through
improved placement of Tasks
on Task Managers.
40%
Job cost reduction through
hardware upgrades.
20%
Cluster cost reduction through
CGroups and the ability to run
the clusters hotter.
%ages don’t sum up to 100 as the baselines are different
Flink Forward 2023 ©
Actual spend vs Budgeted spend for the company ($ terms).
CGroups
Job Optimizations
Hw upgrade
Data opt.
Flink Forward 2023 ©
Thank you
http://divye.me - to connect on LinkedIn

More Related Content

PDF
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
PDF
Improvements to Flink & it's Applications in Alibaba Search
PDF
Introducing the Apache Flink Kubernetes Operator
PDF
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
PPTX
Autoscaling Flink with Reactive Mode
PDF
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Improvements to Flink & it's Applications in Alibaba Search
Introducing the Apache Flink Kubernetes Operator
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Autoscaling Flink with Reactive Mode
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Stephan Ewen - Experiences running Flink at Very Large Scale
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...

Similar to Tuning Flink Clusters for stability and efficiency (20)

PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
PDF
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
PDF
Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp
PDF
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
PDF
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
PPT
Spark & Yarn better together 1.2
PDF
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
PDF
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
PPT
Anti patterns in Hadoop Cluster deployment
PPTX
Managing growth in Production Hadoop Deployments
PDF
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
PPT
Capacity Management from Flickr
PPT
Anti patterns in hadoop cluster deployment
PPTX
Operating Flink on Mesos at Scale
PDF
A look at Flink 1.2
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Spark & Yarn better together 1.2
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency wi...
Anti patterns in Hadoop Cluster deployment
Managing growth in Production Hadoop Deployments
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Capacity Management from Flickr
Anti patterns in hadoop cluster deployment
Operating Flink on Mesos at Scale
A look at Flink 1.2
Ad

More from Divye Kapoor (6)

PPTX
A particle filter based scheme for indoor tracking on an Android Smartphone
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
The Linux Kernel Implementation of Pipes and FIFOs
PPTX
Cybermania Prelims
PPTX
Cybermania Mains
PPTX
A particle filter based scheme for indoor tracking on an Android Smartphone
The TCP/IP Stack in the Linux Kernel
The Linux Kernel Implementation of Pipes and FIFOs
Cybermania Prelims
Cybermania Mains
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Spectroscopy.pptx food analysis technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Unlocking AI with Model Context Protocol (MCP)
Spectroscopy.pptx food analysis technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
sap open course for s4hana steps from ECC to s4

Tuning Flink Clusters for stability and efficiency

  • 1. Tuning Flink Clusters for stability and efficiency Divye Kapoor, Pinterest
  • 2. Flink Forward 2023 © Starting with the end in mind… By the end of this talk, you’ll know how we tuned our Flink clusters to reduce per-job costs by 50-90% (~75% typical) and how we were able to absorb ~40% workload for free for Pinterest. 25% Cost of a job after we were done with our work Hi! I’m Divye Kapoor, I’m the TL for the Stream Processing Platform at Pinterest and I’m here presenting the work that our teams at Pinterest have done over the past 2 years on stability and efficiency.
  • 3. Flink Forward 2023 © Credits Teja T. - SRE & EM CGroups, Cluster HW, Rollouts & optimizations at several levels of the stack. Thank you
  • 4. Flink Forward 2023 © Credits: Leadership & Partners 30+ teams 200+ partners 4 orgs
  • 5. Flink Forward 2023 © Actual spend vs Budgeted spend for the company ($ terms).
  • 6. Flink Forward 2023 © Let’s just say that a fair bit of money was printed for Pinterest…
  • 7. Flink Forward 2023 © Our clusters run on YARN (today), everything that follows is in that context.
  • 8. Flink Forward 2023 © So what’s challenging about running and tuning a Flink multi tenant cluster? ● Job sizes: 2000+ cores on a job vs jobs with < 10 cores. ● Job tiering: small jobs that can’t fail and other jobs that can. ● Multitenant efficiency: resource use that isn’t wasteful. ● Multitenant priority: in an incident, keep the right jobs working. ● Noisy neighbors ● Data skew
  • 9. Flink Forward 2023 © CGroups ● CGroups was our must-have for everything that follows. Teja led the charge. 1. We upgraded YARN and then configured it to support soft CGroup limits. (The limits only kick in if the host is running out of capacity) 2. We verified that if a host is at capacity, the resources are fairly shared. 3. We started running the cluster hotter (no CPU starvation!).
  • 10. Flink Forward 2023 © CGroups ● Hard limits don’t work well for Flink jobs. ● Most Flink jobs want to burst on CPU on deploys and this setup allows for the catch up to take place without throttling. ● Hard limits can trigger OOMs, back pressure and other stability issues. Generally, it’s not clear if the job will come back after a restart. Lesson 1: Always configure your YARN or K8s cluster to avoid hard limits / throttles.
  • 11. Flink Forward 2023 © Container Placement: Stability & Cost Opt. ● No Hot Nodes please! ● Container Placement is critical to keeping a stable cluster running. ● We want all applications to be well behaved and work well with our job schedulers. ● Bad container scheduling = host running out of capacity at peak.
  • 12. Flink Forward 2023 © Container Placement: Option 1 Caption CPU Aware: Schedule on hosts where CPU utilization is < 50th percentile
  • 13. Flink Forward 2023 © Container Placement: Option 2 Caption Config: yarn.nodemanager.resource.cpu-vcores = 75% of cores on host
  • 14. Flink Forward 2023 © No traffic-peak stability issues seen after the container placement strategy was implemented. Stability is a prerequisite for optimization
  • 15. Flink Forward 2023 © Job Optimization ● Source of significant wins - task placements & vertical sharding. ● Required a full round of re-optimization of our job configurations. ● Mass migrations & rollouts - we got good at it. ● 70%+ reduction in cross-host network traffic for jobs. ● Jobs became 50-90%+ cheaper to run. ● Serialization & Traffic overhead drops. ● Magic: Removing SSGs, aligning parallelism across operators, forcing “ColocationConstraints” and optimizing Flink 1.11 task placements.
  • 16. Flink Forward 2023 © Job Optimization Before: CPU utilization showing skewed load. This is wasteful because the lightly loaded Task Managers are asking for the same resources as the heavily loaded ones.
  • 17. Flink Forward 2023 © Hardware optimization: i3 to i4i ~40% reduction in CPU utilization per job.
  • 18. Flink Forward 2023 © Our last wins: Input Data optimization: Only read the data the job needs from Kafka. Where appropriate, we split the Kafka topics. Autotuning: We built an in-house autotuner so that we don’t need to keep re-tuning our jobs for CPU utilization. These will be covered separately in other talks in the future.
  • 19. Flink Forward 2023 © Recap: ● CGroups ● Soft Limits ● Run clusters hotter ● Container Placement strategy ● Job re-tuning ● Job optimization ● Job retuning ● Hardware upgrades ● Input Data optimization ● Job autotuning Stage 1 Stage 2 Stage 3 Stage 4
  • 20. Flink Forward 2023 © Our total wins were ~fairly large. The end result is a nice clean up of the costs on the streaming stack. Job costs on Flink were a discussion point. After optimizations, these concerns have melted away. 75% Job cost reduction through improved placement of Tasks on Task Managers. 40% Job cost reduction through hardware upgrades. 20% Cluster cost reduction through CGroups and the ability to run the clusters hotter. %ages don’t sum up to 100 as the baselines are different
  • 21. Flink Forward 2023 © Actual spend vs Budgeted spend for the company ($ terms). CGroups Job Optimizations Hw upgrade Data opt.
  • 22. Flink Forward 2023 © Thank you http://divye.me - to connect on LinkedIn