SlideShare a Scribd company logo
1
A year supporting Kafka
Dustin Cote (Customer Operations, Confluent)
Ryan Pridgeon (Customer Operations, Confluent)
2
Prerequisites
● Medium experience with Kafka
● Cursory knowledge of
• Configuring Kafka
• Replication
• Request lifecycles
● Interest in Kafka Ops
● Don’t have these things?
• Kafka: The Definitive Guide
• http://kafka.apache.org/documentation.html
3
Agenda
● Quick Flyover of Concepts
● Discussion on some techniques we use to generically troubleshoot
● Three things we’ve seen trouble with
● For each one
○ What happened
○ Why it happened
○ What we’re doing to make it not happen again
● Wrap up and questions
4
Background
How did we get here?
● Supporting Kafka!
● Subscription customers, mailing list subscribers, our own sweat/blood/tears
Why does it matter?
● Avoid the mistakes of others
● Reduce time to a stable production
● Help improve Kafka!
OK, but who should really care?
● Admins (what should I look out for/why should I upgrade?)
● Developers (how can I be a good citizen/why does my admin look at me like that?)
● Architects (what are good deployment strategies/how have we addressed problem use cases?)
5
Concept Overview: How Requests Flow in Kafka
● Replication: copying messages to other
brokers for durability
● ISR: “In-sync Replica” -- is this replica up
to date?
● Brokers both servers and clients
● Coherence matters
6
Troubleshooting (JMX)
● Why JMX?
○ Lightweight for the broker, lightweight for your storage
○ Designed for historical information and pattern recognition
○ Easily shared (could even publish them to Kafka!) and moved to a new (not local) device
● Critical metrics (http://kafka.apache.org/documentation.html#monitoring)
○ Alert on these
○ Alert != restart
● How hard is it to set up?
○ Plenty of solutions of varying detail and price
○ Find what works for your org
● But what do all of these metrics mean??
7
Troubleshooting (JMX) - Key Broker Resources
8
Example 1 -- ISR Shrink/Expand
● Initial problem description
○ Under-replicated partitions are growing
● Scenario
○ Issue self heals
○ NetworkHandlerAvgIdle stabilizes at 60%
○ Brokers are 0.10.0 with some 0.9.x clients
○ Kafkacat -L requests time out occasionally
● Cause
○ 0.9.x clients were slow to receive responses
○ A blocking call was used to send down converted messages to older clients
○ This tied up network processor threads
9
Example 1 -- ISR Shrink/Expand
10
Example 1 -- ISR Shrink/Expand
● Prevention
○ Warn on ISR Shrinks/Expands
○ Warn on high Network and Request handler utilization/saturation
○ Be mindful of increasing request latency
● Solution
○ Upgrade to 0.10.0.1 with the permanent fix
○ Alternatively you could upgrade the clients
● Moral
○ Treat each issue like a new one making no assumptions about what may be the issue. Use the
metrics available to limit the scope of your investigation.
11
Example 2 -- Failed automation
● Initial problem description
○ 1 broker goes “down” repeatedly
○ Full cluster restart, stabilizing for > 1 hour
○ After whole cluster is up, some partitions are permanently under-replicated
● Scenario
○ Environment: Cloud, Docker
○ For any failure, destroy/rebuild containers
○ Failure = ELB to broker connection failure
● Cause
○ Single broker lost connectivity with the ELB
○ Full cluster restart crushed the controller upon startup (8000+ partitions across 5 brokers).
○ Repeated automatic restarts during stabilization exacerbated problem
12
Example 2 - Methodology
13
Example 2 -- Failed automation
● Prevention
○ Go to the source of truth for broker liveness, ZooKeeper
○ Alert and analyze upon “broker down” instead of triggering a container rebuild
○ Avoid “system reset” as a debugging tool
● Solution
○ Near term: disable controlled shutdown to avoid exposure
○ Long term: reduce the number of partitions and take preventative measures above
● Moral
○ Implement monitoring with JMX and rely on it
○ If you aren’t sure what action to take automatically, tell a human
○ Distributed systems and blind restarts do not mix
14
Example 3 -- Reassignment Storm
● Initial problem description
○ Bad performance, producing is slow, consuming is slow, ISRs are shrinking
● Scenario
○ Adding a new broker
○ Partition reassignment done manually
○ Reassignment tool requires some knowledge of how replication works
● Cause
○ A cluster-wide partition reassignment was started
○ Brokers’ network processors overwhelmed
○ Crushed network processors == everything slows down
○ Prior to 0.10.1, process cannot be throttled
15
Example 3 - Methodology
16
Example 3 -- Reassignment Storm
● Prevention
○ Take into account number of partitions being moved
● Solution
○ Move a small number of partitions at a time
○ Upgrade to 0.10.1 or higher to take advantage of replica throttling
http://kafka.apache.org/documentation.html#rep-throttle
○ Confluent Rebalancer
● Moral
○ Monitor the cluster with JMX to understand loading
○ Anytime you change how data is flowing, test in a stage environment if possible first
17
What did we learn...
● Implement monitoring with JMX and rely on it
● If you aren’t sure what action to take automatically, tell a human
● Stateful distributed systems and blind restarts do not mix
● Monitor the cluster with JMX to understand loading
● Anytime you change how data is flowing, test in a stage environment if possible first
● Not all problems have a singular solution, use metrics to tease out the root cause before acting
18
Troubleshooting (JMX) - Utilization/Saturation
Resource utilization
UnderReplicatedPartitions
RequestHandlerAvgIdlePercent
NetworkProcessorAvgIdlePercent
ResponseQueueSize
IdlePercent
RequestsPerSec
ResponseSendTimeMs,
RequestQueueSize
RequestQueueTimeMs
LocalTimeMs
RemoteTimeMs
Key
Replica Manager
Request Handler Pool
Network Processor Threads
19
Troubleshooting (Logging/Errors)
● Should not drive investigation
● Supplements observed metrics
● Provides context to the observed metrics for further investigation
● Exceptions stacks are useful for spotting bugs
20
Troubleshooting (Methodology) - USE
Summary:
Check Utilization, Saturation and Errors for each
resource
Definitions:
● Utilization : How much work is being performed
● Saturation: No additional work can be performed
● Errors: Error, possibly Warn level messages in the logs
Reasoning:
● Avoid needless work
● Expedite TTR
● Accurate RCAs
Acknowledgments:
“Systems Performance: Enterprise and the Cloud”, Brendan Gregg
21
In Summary...
● Get those JMX metrics monitoring systems in place!
● Understand what your metrics are telling you before taking action
● Only restart if you have a reason to believe it will fix the problem
● When adding clients or brokers, test in a staging environment
● Looking for more Kafka?
• Stream me up, Scotty: Transitioning to the cloud using a streaming data platform -- Gwen Shapira/Bob
Lehmann, Today, 2:40PM 230A
• Ask Me Anything -- Gwen Shapira, Tomorrow 4:20PM 212 A-B
• Kafka Summit → https://kafka-summit.org/ (5/8 NYC, 8/28 SF)
• Confluent University Training → https://www.confluent.io/training/
• Docs → http://docs.confluent.io/current
• Confluent Enterprise (built on Kafka) → https://www.confluent.io/product/
22
Thank You!
Dustin Cote | dustin@confluent.io | @TrudgeDMC
Ryan Pridgeon | ryan@confluent.io
Also check out:
Stream me up, Scotty: Transitioning to the cloud using a
streaming data platform -- Gwen Shapira/Bob Lehmann,
Today, 2:40PM 230A
Ask Me Anything -- Gwen Shapira, Tomorrow 4:20PM
212 A-B
23

More Related Content

PPTX
Kafka at scale facebook israel
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PDF
101 ways to configure kafka - badly (Kafka Summit)
PPTX
Netflix Data Pipeline With Kafka
PDF
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
PPTX
Kafka reliability velocity 17
Kafka at scale facebook israel
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Datacenter Kafka - Strata San Jose 2017
101 ways to configure kafka - badly (Kafka Summit)
Netflix Data Pipeline With Kafka
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
Kafka reliability velocity 17

What's hot (20)

PDF
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
PPTX
Apache Kafka Best Practices
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PDF
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
PPTX
kafka for db as postgres
PDF
Kafka internals
PPTX
Papers we love realtime at facebook
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
PPTX
Decoupling Decisions with Apache Kafka
PDF
Ingesting Healthcare Data, Micah Whitacre
PDF
Kafka At Scale in the Cloud
PPT
Kafka Reliability - When it absolutely, positively has to be there
PPTX
Have your cake and eat it too
PDF
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
PPTX
Current and Future of Apache Kafka
PDF
Consumer offset management in Kafka
PDF
From Three Nines to Five Nines - A Kafka Journey
PDF
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
PDF
Make 2016 your year of SMACK talk
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
Apache Kafka Best Practices
Streaming in Practice - Putting Apache Kafka in Production
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
kafka for db as postgres
Kafka internals
Papers we love realtime at facebook
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Decoupling Decisions with Apache Kafka
Ingesting Healthcare Data, Micah Whitacre
Kafka At Scale in the Cloud
Kafka Reliability - When it absolutely, positively has to be there
Have your cake and eat it too
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Current and Future of Apache Kafka
Consumer offset management in Kafka
From Three Nines to Five Nines - A Kafka Journey
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Make 2016 your year of SMACK talk
Ad

Viewers also liked (20)

PDF
Monitoring Apache Kafka with Confluent Control Center
PDF
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
PDF
Distributed stream processing with Apache Kafka
PDF
Data Pipelines Made Simple with Apache Kafka
PDF
Building Event-Driven Services with Apache Kafka
PDF
The Data Dichotomy- Rethinking the Way We Treat Data and Services
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
PDF
What's new in Confluent 3.2 and Apache Kafka 0.10.2
PDF
Power of the Log: LSM & Append Only Data Structures
PDF
Apache kafka-a distributed streaming platform
PDF
Introducing Kafka's Streams API
PPTX
Kafka presentation
PPTX
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
PDF
A Practical Guide to Selecting a Stream Processing Technology
PPTX
Deep Dive into Apache Kafka
PDF
Confluent & Attunity: Mainframe Data Modern Analytics
PDF
Data integration with Apache Kafka
PDF
Confluent Enterprise Datasheet
PDF
Leveraging Mainframe Data for Modern Analytics
PPTX
Data Streaming with Apache Kafka & MongoDB
Monitoring Apache Kafka with Confluent Control Center
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Distributed stream processing with Apache Kafka
Data Pipelines Made Simple with Apache Kafka
Building Event-Driven Services with Apache Kafka
The Data Dichotomy- Rethinking the Way We Treat Data and Services
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
What's new in Confluent 3.2 and Apache Kafka 0.10.2
Power of the Log: LSM & Append Only Data Structures
Apache kafka-a distributed streaming platform
Introducing Kafka's Streams API
Kafka presentation
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
A Practical Guide to Selecting a Stream Processing Technology
Deep Dive into Apache Kafka
Confluent & Attunity: Mainframe Data Modern Analytics
Data integration with Apache Kafka
Confluent Enterprise Datasheet
Leveraging Mainframe Data for Modern Analytics
Data Streaming with Apache Kafka & MongoDB
Ad

Similar to Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka (20)

PDF
Series of Unfortunate Netflix Container Events - QConNYC17
PDF
Gatling - Bordeaux JUG
PDF
Overview of Site Reliability Engineering (SRE) & best practices
PDF
Streaming millions of Contact Center interactions in (near) real-time with Pu...
PDF
Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...
PDF
Scaling Monitoring At Databricks From Prometheus to M3
PDF
Server fleet management using Camunda by Akhil Ahuja
PPTX
Reactive by example (DevOpsDaysTLV 2019)
PDF
Netflix SRE perf meetup_slides
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PPTX
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
PDF
Benchmarks, performance, scalability, and capacity what s behind the numbers...
PDF
Benchmarks, performance, scalability, and capacity what's behind the numbers
PPTX
Performance tuning Grails applications SpringOne 2GX 2014
PDF
Managing 600 instances
PDF
MongoDB Operational Best Practices (mongosf2012)
PDF
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
PPTX
Building real time Data Pipeline using Spark Streaming
PDF
Sql server tips from the field
PDF
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
Series of Unfortunate Netflix Container Events - QConNYC17
Gatling - Bordeaux JUG
Overview of Site Reliability Engineering (SRE) & best practices
Streaming millions of Contact Center interactions in (near) real-time with Pu...
Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pu...
Scaling Monitoring At Databricks From Prometheus to M3
Server fleet management using Camunda by Akhil Ahuja
Reactive by example (DevOpsDaysTLV 2019)
Netflix SRE perf meetup_slides
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity what's behind the numbers
Performance tuning Grails applications SpringOne 2GX 2014
Managing 600 instances
MongoDB Operational Best Practices (mongosf2012)
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Building real time Data Pipeline using Spark Streaming
Sql server tips from the field
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced IT Governance
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Network Security Unit 5.pdf for BCA BBA.
Advanced IT Governance
Dropbox Q2 2025 Financial Results & Investor Presentation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka

  • 1. 1 A year supporting Kafka Dustin Cote (Customer Operations, Confluent) Ryan Pridgeon (Customer Operations, Confluent)
  • 2. 2 Prerequisites ● Medium experience with Kafka ● Cursory knowledge of • Configuring Kafka • Replication • Request lifecycles ● Interest in Kafka Ops ● Don’t have these things? • Kafka: The Definitive Guide • http://kafka.apache.org/documentation.html
  • 3. 3 Agenda ● Quick Flyover of Concepts ● Discussion on some techniques we use to generically troubleshoot ● Three things we’ve seen trouble with ● For each one ○ What happened ○ Why it happened ○ What we’re doing to make it not happen again ● Wrap up and questions
  • 4. 4 Background How did we get here? ● Supporting Kafka! ● Subscription customers, mailing list subscribers, our own sweat/blood/tears Why does it matter? ● Avoid the mistakes of others ● Reduce time to a stable production ● Help improve Kafka! OK, but who should really care? ● Admins (what should I look out for/why should I upgrade?) ● Developers (how can I be a good citizen/why does my admin look at me like that?) ● Architects (what are good deployment strategies/how have we addressed problem use cases?)
  • 5. 5 Concept Overview: How Requests Flow in Kafka ● Replication: copying messages to other brokers for durability ● ISR: “In-sync Replica” -- is this replica up to date? ● Brokers both servers and clients ● Coherence matters
  • 6. 6 Troubleshooting (JMX) ● Why JMX? ○ Lightweight for the broker, lightweight for your storage ○ Designed for historical information and pattern recognition ○ Easily shared (could even publish them to Kafka!) and moved to a new (not local) device ● Critical metrics (http://kafka.apache.org/documentation.html#monitoring) ○ Alert on these ○ Alert != restart ● How hard is it to set up? ○ Plenty of solutions of varying detail and price ○ Find what works for your org ● But what do all of these metrics mean??
  • 7. 7 Troubleshooting (JMX) - Key Broker Resources
  • 8. 8 Example 1 -- ISR Shrink/Expand ● Initial problem description ○ Under-replicated partitions are growing ● Scenario ○ Issue self heals ○ NetworkHandlerAvgIdle stabilizes at 60% ○ Brokers are 0.10.0 with some 0.9.x clients ○ Kafkacat -L requests time out occasionally ● Cause ○ 0.9.x clients were slow to receive responses ○ A blocking call was used to send down converted messages to older clients ○ This tied up network processor threads
  • 9. 9 Example 1 -- ISR Shrink/Expand
  • 10. 10 Example 1 -- ISR Shrink/Expand ● Prevention ○ Warn on ISR Shrinks/Expands ○ Warn on high Network and Request handler utilization/saturation ○ Be mindful of increasing request latency ● Solution ○ Upgrade to 0.10.0.1 with the permanent fix ○ Alternatively you could upgrade the clients ● Moral ○ Treat each issue like a new one making no assumptions about what may be the issue. Use the metrics available to limit the scope of your investigation.
  • 11. 11 Example 2 -- Failed automation ● Initial problem description ○ 1 broker goes “down” repeatedly ○ Full cluster restart, stabilizing for > 1 hour ○ After whole cluster is up, some partitions are permanently under-replicated ● Scenario ○ Environment: Cloud, Docker ○ For any failure, destroy/rebuild containers ○ Failure = ELB to broker connection failure ● Cause ○ Single broker lost connectivity with the ELB ○ Full cluster restart crushed the controller upon startup (8000+ partitions across 5 brokers). ○ Repeated automatic restarts during stabilization exacerbated problem
  • 12. 12 Example 2 - Methodology
  • 13. 13 Example 2 -- Failed automation ● Prevention ○ Go to the source of truth for broker liveness, ZooKeeper ○ Alert and analyze upon “broker down” instead of triggering a container rebuild ○ Avoid “system reset” as a debugging tool ● Solution ○ Near term: disable controlled shutdown to avoid exposure ○ Long term: reduce the number of partitions and take preventative measures above ● Moral ○ Implement monitoring with JMX and rely on it ○ If you aren’t sure what action to take automatically, tell a human ○ Distributed systems and blind restarts do not mix
  • 14. 14 Example 3 -- Reassignment Storm ● Initial problem description ○ Bad performance, producing is slow, consuming is slow, ISRs are shrinking ● Scenario ○ Adding a new broker ○ Partition reassignment done manually ○ Reassignment tool requires some knowledge of how replication works ● Cause ○ A cluster-wide partition reassignment was started ○ Brokers’ network processors overwhelmed ○ Crushed network processors == everything slows down ○ Prior to 0.10.1, process cannot be throttled
  • 15. 15 Example 3 - Methodology
  • 16. 16 Example 3 -- Reassignment Storm ● Prevention ○ Take into account number of partitions being moved ● Solution ○ Move a small number of partitions at a time ○ Upgrade to 0.10.1 or higher to take advantage of replica throttling http://kafka.apache.org/documentation.html#rep-throttle ○ Confluent Rebalancer ● Moral ○ Monitor the cluster with JMX to understand loading ○ Anytime you change how data is flowing, test in a stage environment if possible first
  • 17. 17 What did we learn... ● Implement monitoring with JMX and rely on it ● If you aren’t sure what action to take automatically, tell a human ● Stateful distributed systems and blind restarts do not mix ● Monitor the cluster with JMX to understand loading ● Anytime you change how data is flowing, test in a stage environment if possible first ● Not all problems have a singular solution, use metrics to tease out the root cause before acting
  • 18. 18 Troubleshooting (JMX) - Utilization/Saturation Resource utilization UnderReplicatedPartitions RequestHandlerAvgIdlePercent NetworkProcessorAvgIdlePercent ResponseQueueSize IdlePercent RequestsPerSec ResponseSendTimeMs, RequestQueueSize RequestQueueTimeMs LocalTimeMs RemoteTimeMs Key Replica Manager Request Handler Pool Network Processor Threads
  • 19. 19 Troubleshooting (Logging/Errors) ● Should not drive investigation ● Supplements observed metrics ● Provides context to the observed metrics for further investigation ● Exceptions stacks are useful for spotting bugs
  • 20. 20 Troubleshooting (Methodology) - USE Summary: Check Utilization, Saturation and Errors for each resource Definitions: ● Utilization : How much work is being performed ● Saturation: No additional work can be performed ● Errors: Error, possibly Warn level messages in the logs Reasoning: ● Avoid needless work ● Expedite TTR ● Accurate RCAs Acknowledgments: “Systems Performance: Enterprise and the Cloud”, Brendan Gregg
  • 21. 21 In Summary... ● Get those JMX metrics monitoring systems in place! ● Understand what your metrics are telling you before taking action ● Only restart if you have a reason to believe it will fix the problem ● When adding clients or brokers, test in a staging environment ● Looking for more Kafka? • Stream me up, Scotty: Transitioning to the cloud using a streaming data platform -- Gwen Shapira/Bob Lehmann, Today, 2:40PM 230A • Ask Me Anything -- Gwen Shapira, Tomorrow 4:20PM 212 A-B • Kafka Summit → https://kafka-summit.org/ (5/8 NYC, 8/28 SF) • Confluent University Training → https://www.confluent.io/training/ • Docs → http://docs.confluent.io/current • Confluent Enterprise (built on Kafka) → https://www.confluent.io/product/
  • 22. 22 Thank You! Dustin Cote | dustin@confluent.io | @TrudgeDMC Ryan Pridgeon | ryan@confluent.io Also check out: Stream me up, Scotty: Transitioning to the cloud using a streaming data platform -- Gwen Shapira/Bob Lehmann, Today, 2:40PM 230A Ask Me Anything -- Gwen Shapira, Tomorrow 4:20PM 212 A-B
  • 23. 23