SlideShare a Scribd company logo
Running Kafka
For Maximum Pain
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn
To All The Tech Debt
I’ve Loved Before
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn
T E C H N I C A L D E B T
The cost of the rework required by choosing
an easy solution now.
SRE
vs
SWE
SRE
❤️s
SWE
• Both roles are critical
• Work together to balance
operability and features
• SRE’s job is to enable SWE to
move as quickly as possible
while meeting SLOs
How Big?
• Produced
• Every day
2Trillion
Messages
• Single cluster
• Unique data
5Gbps
Inbound
• Average 3x
consumption
• Before mirroring
18Gbps
Outbound
• Largest clusters
are 250k
• Up to 10k
partitions per
broker
2.5M
Partitions
Sources of Pain
Exponentially
increase your
problems by sharing
them
Multitenancy
Kafka’s great!
Everything else
around it sucks
Infrastructure
What do you mean I
have to do it myself?
Management
Multitenancy
Sharing is Caring
• Reduces the hardware footprint
• Less administrative overhead
• One bad actor makes everyone’s life
hard
Types of Data
• Member-related
Activity
• Data schemas
are managed by
DMRC
• Aggregated to
some
datacenters
Tracking Metrics Queuing Logging
• Application
metrics, service
calls, logs
• Mostly produced
by application
containers
• Only aggregated
to backend
datacenters
• Internal
application data,
messaging
• Largest users
are Samza and
Search
• Limited
aggregation in
production only
• Dedicated cluster
for application
logs going to
ELK
• High volume, low
retention
• Not aggregated
Multitenancy Woes
• Auto topic creation means
nobody knows who created it
• Multiple producers further
clouds the issue
• Who makes decisions?
• Who is responsible for
problems?
Ownership Capacity Security
• No controls means it’s free!
• Getting one person to project
growth is hard
• Getting 100 people to do it is
impossible
• Storage hardware is not
commodity
• Started with zero security
• Impossible to handle sensitive
data
Improvements
• Added an ownership metadata
service
• One committee with control
over shared data schemas
• Moving to disable automatic
topic creation
Ownership Capacity Security
• Quotas to limit bandwidth
• Retention by both time and
bytes to restrict disk usage
• Also forces customers to talk to
us about data usage
• Move all clients to SSL
• Add ACLs for existing usage
(after review)
• Starting to evaluate encryption
Infrastructure
Mirror Maker
• Every change requires a
restart
• Grows n2 with number of sites
• Inefficient since 0.8
• Loses key to partition affinity
Mirror Maker
Performance
• Added identity handler for
fixed partition mapping
• Eliminated compression
• Finally off old consumer
• Coming soon to a KIP near
you
Message Auditing
• Required to assure mirroring
works
• Makes infrastructure care
about data schema
• Only tracks producers
(mostly)
• Relational database doesn’t
cut it for storing audit data
Streaming
Audit
• Moving audit data to headers
• Utilizing Samza for processing
counts
• Adding “cost to serve”
information
Management
Topic Configuration
• No way to manage configs
across multiple clusters
• Creating a new datacenter is a
manual process
• Changes need to be
propagated in a specific order
• Administrative commands are
not protected
Nuage
• One-stop shop for Data
Infrastructure
• Allows creation of topics with
ownership and ACLs
• Uses our Kafka REST
interface for CRUD
Cluster Membership
• No tool to remove brokers
• New brokers take no traffic
• Partition reassignment is
basic
• Automatic leader election kills
the cluster
Round 1:
kafka-tools
• kafka-assigner:
• Remove broker
• Rebalance replicas
• Fix replication factor
• Protocol CLI tool
• Adding an admin client
• github.com/linkedin/kafka-tools
Round 2:
Cruise Control
• Dynamic workload rebalancing
• Self-healing clusters
• Manages multiple goals
(network, disk, CPU, rack)
• Requires no additional code
• Open source now!
What Else?
What Needs Attention?
• Very few metrics
• One bad partition breaks it
Log Compaction Client Config Upgrading
• Client and broker cannot
negotiate
• Configurations are essentially
shared secrets
• No information on the version
of clients connecting
• Message format changes are
still troubling
• Broker upgrades must be
carefully ordered
• Often no clear way to roll back
Make It Easier
• Cruise Control
• https://github.com/linkedin/cruise-control
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFK
A
Thank you

More Related Content

PDF
URP? Excuse You! The Three Metrics You Have to Know
PDF
Introducing Kafka's Streams API
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
PDF
A Practical Guide to Selecting a Stream Processing Technology
PDF
Introduction to Apache Kafka and Confluent... and why they matter
PDF
Kafka Summit SF 2017 - Worldwide Scalable and Resilient Messaging Services wi...
PDF
Using Apache Kafka to Analyze Session Windows
PDF
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
URP? Excuse You! The Three Metrics You Have to Know
Introducing Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
A Practical Guide to Selecting a Stream Processing Technology
Introduction to Apache Kafka and Confluent... and why they matter
Kafka Summit SF 2017 - Worldwide Scalable and Resilient Messaging Services wi...
Using Apache Kafka to Analyze Session Windows
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines

What's hot (20)

PPTX
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
PDF
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
PPTX
Building an Event-oriented Data Platform with Kafka, Eric Sammer
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
PDF
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
PDF
Stream Processing with Apache Kafka and .NET
PDF
Event Driven Architectures with Apache Kafka on Heroku
PDF
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
PPTX
Data Streaming with Apache Kafka & MongoDB
PDF
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
PDF
Simplify Governance of Streaming Data
PDF
Common Patterns of Multi Data-Center Architectures with Apache Kafka
PDF
Integrating Apache Kafka and Elastic Using the Connect Framework
PPTX
Apache Kafka: Past, Present and Future
PDF
War Stories: DIY Kafka
PDF
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
PDF
How to over-engineer things and have fun? | Oto Brglez, OPALAB
PDF
What is Apache Kafka and What is an Event Streaming Platform?
Kafka Summit NYC 2017 - Cloud Native Data Streaming Microservices with Spring...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Capital One Delivers Risk Insights in Real Time with Stream Processing
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Stream Processing with Apache Kafka and .NET
Event Driven Architectures with Apache Kafka on Heroku
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Data Streaming with Apache Kafka & MongoDB
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
Simplify Governance of Streaming Data
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Integrating Apache Kafka and Elastic Using the Connect Framework
Apache Kafka: Past, Present and Future
War Stories: DIY Kafka
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
How to over-engineer things and have fun? | Oto Brglez, OPALAB
What is Apache Kafka and What is an Event Streaming Platform?
Ad

Similar to Running Kafka for Maximum Pain (20)

PDF
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
PDF
John adams talk cloudy
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PPTX
Iot cloud service v2.0
PPTX
The impact of cloud NSBCon NY by Yves Goeleven
PDF
Data Lake and the rise of the microservices
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
PDF
Scalable and Reliable Logging at Pinterest
PDF
Stay productive_while_slicing_up_the_monolith
PDF
Chirp 2010: Scaling Twitter
PDF
Microservices: The Best Practices
PPTX
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
PPTX
Building FoundationDB
PDF
Monitoring MySQL at scale
PDF
Software Architecture and Architectors: useless VS valuable
PPTX
SQL Server: Now It's Everywhere You Want to Be
PPTX
CloudOpen Japan - Controlling the cost of your first cloud
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
John adams talk cloudy
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Iot cloud service v2.0
The impact of cloud NSBCon NY by Yves Goeleven
Data Lake and the rise of the microservices
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
Stay productive_while_slicing_up_the_monolith
Chirp 2010: Scaling Twitter
Microservices: The Best Practices
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Building FoundationDB
Monitoring MySQL at scale
Software Architecture and Architectors: useless VS valuable
SQL Server: Now It's Everywhere You Want to Be
CloudOpen Japan - Controlling the cost of your first cloud
Ad

More from Todd Palino (14)

PPTX
Leading Without Managing: Becoming an SRE Technical Leader
PPTX
From Operations to Site Reliability in Five Easy Steps
PPTX
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
PPTX
Why Does (My) Monitoring Suck?
PPTX
URP? Excuse You! The Three Kafka Metrics You Need to Know
PPTX
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
PPTX
I'm No Hero: Full Stack Reliability at LinkedIn
PPTX
Multi tier, multi-tenant, multi-problem kafka
PPTX
Kafka at Peak Performance
PPTX
More Datacenters, More Problems
PPTX
Putting Kafka Into Overdrive
PPTX
Tuning Kafka for Fun and Profit
PPTX
Kafka at Scale: Multi-Tier Architectures
PPTX
Enterprise Kafka: Kafka as a Service
Leading Without Managing: Becoming an SRE Technical Leader
From Operations to Site Reliability in Five Easy Steps
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Why Does (My) Monitoring Suck?
URP? Excuse You! The Three Kafka Metrics You Need to Know
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
I'm No Hero: Full Stack Reliability at LinkedIn
Multi tier, multi-tenant, multi-problem kafka
Kafka at Peak Performance
More Datacenters, More Problems
Putting Kafka Into Overdrive
Tuning Kafka for Fun and Profit
Kafka at Scale: Multi-Tier Architectures
Enterprise Kafka: Kafka as a Service

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Running Kafka for Maximum Pain

  • 1. Running Kafka For Maximum Pain Todd Palino Senior Staff Engineer, Site Reliability LinkedIn
  • 2. To All The Tech Debt I’ve Loved Before Todd Palino Senior Staff Engineer, Site Reliability LinkedIn
  • 3. T E C H N I C A L D E B T The cost of the rework required by choosing an easy solution now.
  • 5. SRE ❤️s SWE • Both roles are critical • Work together to balance operability and features • SRE’s job is to enable SWE to move as quickly as possible while meeting SLOs
  • 6. How Big? • Produced • Every day 2Trillion Messages • Single cluster • Unique data 5Gbps Inbound • Average 3x consumption • Before mirroring 18Gbps Outbound • Largest clusters are 250k • Up to 10k partitions per broker 2.5M Partitions
  • 7. Sources of Pain Exponentially increase your problems by sharing them Multitenancy Kafka’s great! Everything else around it sucks Infrastructure What do you mean I have to do it myself? Management
  • 9. Sharing is Caring • Reduces the hardware footprint • Less administrative overhead • One bad actor makes everyone’s life hard
  • 10. Types of Data • Member-related Activity • Data schemas are managed by DMRC • Aggregated to some datacenters Tracking Metrics Queuing Logging • Application metrics, service calls, logs • Mostly produced by application containers • Only aggregated to backend datacenters • Internal application data, messaging • Largest users are Samza and Search • Limited aggregation in production only • Dedicated cluster for application logs going to ELK • High volume, low retention • Not aggregated
  • 11. Multitenancy Woes • Auto topic creation means nobody knows who created it • Multiple producers further clouds the issue • Who makes decisions? • Who is responsible for problems? Ownership Capacity Security • No controls means it’s free! • Getting one person to project growth is hard • Getting 100 people to do it is impossible • Storage hardware is not commodity • Started with zero security • Impossible to handle sensitive data
  • 12. Improvements • Added an ownership metadata service • One committee with control over shared data schemas • Moving to disable automatic topic creation Ownership Capacity Security • Quotas to limit bandwidth • Retention by both time and bytes to restrict disk usage • Also forces customers to talk to us about data usage • Move all clients to SSL • Add ACLs for existing usage (after review) • Starting to evaluate encryption
  • 14. Mirror Maker • Every change requires a restart • Grows n2 with number of sites • Inefficient since 0.8 • Loses key to partition affinity
  • 15. Mirror Maker Performance • Added identity handler for fixed partition mapping • Eliminated compression • Finally off old consumer • Coming soon to a KIP near you
  • 16. Message Auditing • Required to assure mirroring works • Makes infrastructure care about data schema • Only tracks producers (mostly) • Relational database doesn’t cut it for storing audit data
  • 17. Streaming Audit • Moving audit data to headers • Utilizing Samza for processing counts • Adding “cost to serve” information
  • 19. Topic Configuration • No way to manage configs across multiple clusters • Creating a new datacenter is a manual process • Changes need to be propagated in a specific order • Administrative commands are not protected
  • 20. Nuage • One-stop shop for Data Infrastructure • Allows creation of topics with ownership and ACLs • Uses our Kafka REST interface for CRUD
  • 21. Cluster Membership • No tool to remove brokers • New brokers take no traffic • Partition reassignment is basic • Automatic leader election kills the cluster
  • 22. Round 1: kafka-tools • kafka-assigner: • Remove broker • Rebalance replicas • Fix replication factor • Protocol CLI tool • Adding an admin client • github.com/linkedin/kafka-tools
  • 23. Round 2: Cruise Control • Dynamic workload rebalancing • Self-healing clusters • Manages multiple goals (network, disk, CPU, rack) • Requires no additional code • Open source now!
  • 25. What Needs Attention? • Very few metrics • One bad partition breaks it Log Compaction Client Config Upgrading • Client and broker cannot negotiate • Configurations are essentially shared secrets • No information on the version of clients connecting • Message format changes are still troubling • Broker upgrades must be carefully ordered • Often no clear way to roll back
  • 26. Make It Easier • Cruise Control • https://github.com/linkedin/cruise-control • Kafka Monitor • https://github.com/linkedin/kafka-monitor • Burrow • https://github.com/linkedin/Burrow • kafka-tools • https://github.com/linkedin/kafka-tools LinkedIn Open Source Get Involved • Community • users@kafka.apache.org • dev@kafka.apache.org • Bugs and Work: • https://issues.apache.org/jira/projects/KAFK A