SlideShare a Scribd company logo
1
Metrics Are Not Enough
Gwen Shapira, Product Manager
@gwenshap
Monitoring Apache Kafka and Streaming Applications
2
Monitoring Distributed Systems is hard
“Google SRE team with 10–12 members
typically has one or sometimes two members
whose primary assignment is to build and maintain
monitoring systems for their service.”
https://www.oreilly.com/ideas/monitoring-distributed-systems
3
Apache Kafka is a distributed system and has many components
4
Many Moving Parts to Watch
• Producers
• Consumers
• Consumer Groups
• Brokers
• Controller
• Zookeeper
• Topics
• Partitions
• Messages
• …..
5
And many metrics to monitor
• Broker throughput
• Topic throughput
• Disk utilization
• Unclean leader elections
• Network pool usage
• Request pool usage
• Request latencies – 30 request types, 5 phases
each
• Topic partition status counts: online, under
replicated, offline
• Log flush rates
• ZK disconnects
• Garbage collection pauses
• Message delivery
• Consumer groups reading from topics
• …​
6
Every Service that uses Kafka is a Distributed System
Orders
Service
Stock
Service
Fulfilment
Service
Fraud Detection
Service
Mobile App
Kafka
7
It is all CRITICAL to your business
• Real-time applications mean very little room for errors
• Is Kafka available and performing well? You need to know before your users do.
• You must detect and act on small problems before they escalate
• The business cares a lot about accuracy and SLAs
• It is 8:05am, does the dashboard reflect the status of the system up to 8am?
• Continuously improve performance
• Monitor Kafka cluster performance
• Identify and act on leading indicators of future problems
• Quick triage – can you identify likely causes of a problem quickly and effectively?
8
So you may need a bit of help
• Operators must have visibility into the health
of the Kafka cluster
• The business must have visibility into
completeness and latency of message
delivery
• Everyone needs to focus on the most
meaningful metrics
9
Types of monitoring
• Tailing logs
• OS metrics
• Kafka / Client metrics
• Tracing applications
• Event level sampling
• APM – Application performance from user perspective
• …
10
Types of monitoring
• Tailing logs
• OS metrics
• Kafka / Client metrics
• Tracing applications
• Event level sampling
• APM – Application performance from user perspective
• …
11
Monitor System Health of Your Cluster
12
The basics
• Whatever else you do: Check that the broker process is running
• External agent
• Or alert on stale metrics
• Don’t alert on everything. Fewer, high level alerts are better.
13
First Things First
14
Under-replicated partitions
• If you can monitor just one thing…
• Is it a specific broker?
• Cluster wide:
• Out of resources
• Imbalance
• Broker:
• Hardware
• Noisy neighbor
• Configuration
15
Drill Down into Broker and Topic: Do we see a problem right here?
16
Check partition placement - is the issue specific to one broker?
17
Don’t watch the dashboard
• Control Center detects anomalous events in monitoring data
• Users can define triggers
• Control Center performs customizable actions when triggers occur
• When troubleshooting Kafka issues, users can view previous alerts and historical message delivery
data at the time the alert occurred
18
Capacity Planning – Be Proactive
• Capacity planning ensures that your cluster can continue to meet business demands
• Control Center provides indicators if a cluster may need more brokers
• Key metrics that indicate a cluster is near capacity:
• CPU
• Network and thread pool usage
• Request latencies
• Network utilization - Throughput, per broker and per cluster
• Disk utilization - Disk space used by all log segments, per broker
19
Multi-Cluster Deployments
• Monitor all clusters in one place
20
Monitor End to End Message Delivery
21
Are You Meeting SLAs?
• Stream monitoring helps you determine if all messages are delivered end-to-end in a timely manner
• This is important for several reasons:
• Ensure producers and consumers are not losing messages
• Check if consumers are consuming more than expected
• Verify low latency for real-time applications
• Identify slow consumers
22
How to monitor?
The infamous LinkedIn “Audit”:
• Count messages when they are produced
• Count messages when they are consumed
• Check timestamps when they are consumed
• Compare the results
23
Message delivery metrics
Streaming message delivery metrics are available:
• Aggregate
• Per-consumer group
• Per-topic
24
Under Consumption
• Reasons for under consumption:
• Producers not handling errors and retried correctly
• Misbehaving consumers, perhaps the consumer did not follow shutdown sequence
• Real-time apps intentionally skipping messages
• Red bars indicate some messages were not consumed
• Herringbone pattern can indicate error in measurement
• Usually improper shutdown of client
25
Over Consumption
• Reasons for over consumption
• Consumers may be processing a set of messages more than once, which may have impact on their
applications
• Consumption bars are higher than the expected consumption lines
• Latency may be higher
26
Slow Consumers
• Identify consumers and consumer groups that are not keeping up with data production
• Use the per-consumer and per-consumer group metrics
• Compare a slow, lagging consumer (left) to a good consumer (right)
• The slow consumer (left) is processing all the messages, but with high latency
• Slow consumers may also process fewer messages in a given time window, so monitor "Expected
consumption" (the top line)
27
Optimize Performance
28
Identify Performance Bottlenecks
• Real-time applications require high throughput or low latency
• Need to baseline where you are
• Monitor for changes to get ahead of the problem
• You may need to identify performance bottlenecks
• Break-down the times for the end-to-end dataflow to give you pointers where streams are taking the
most processing time
• The key metrics to look at include:
• Request latencies
• Network pool usage
• Request pool usage
29
Produce and Fetch Request Latencies
Breakdown produce and fetch latencies through the
entire request lifecycle
Request latency values can be shown at the median,
95th, 99th, or 99.9th percentile
30
Request Latencies Explained (1)
• Total request latency (center)
• Total time of an entire request lifecycle, from the broker point of view
• Request queue
• The time the request is in the request queue waiting for an IO thread
• A high value can indicate there are not enough IO threads or CPU is a bottleneck
• Also check: What are those IO threads doing?
• Request local
• The time the request is being processed locally by the leader
• A high value can imply slow disk so monitor broker disk IO
31
Request Latencies Explained (2)
• Response remote
• The time the request is waiting on other brokers
• Higher times are expected on high-reliability or high-throughput systems
• A high value can indicate a slow network connection, or the consumer is caught up to the end of the log
• Response queue
• The time the request is in the response queue waiting for a network thread
• A high value can imply there are not enough network threads
• Response send
• The time the request is being sent back to the consumer
• A high value can imply the CPU or network is a bottleneck
32
Network and Request Handler Threads
• Network pool usage
• Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor
threads are not idle
• If network pool usage is above 70%, isolate bottleneck with the request latency breakdown
• Consider increasing the broker configuration parameter num.network.threads, especially if Response
queue metric is high and you have resources
• Request pool usage
• Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler
threads are not idle
• If request pool usage is above 70%, isolate bottleneck with the request latency breakdown
• Consider increasing the broker configuration parameter num.io.threads, especially if Request queue
metric is high
• Why are all your handlers busy? Check GC, access patterns and disk IO
33
Summary
34
Few things to remember…
• Monitor Kafka
• Work with your developers to monitor critical applications end-to-end
• More data is better: Metrics + logs + OS + APM + …
• But fewer alerts are better
• Alert on what’s important – Under—Replicated Partitions is a good start
• DON’T JUST FIDDLE WITH STUFF
• AND DON’T RESTART KAFKA FOR LOLS
• If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
35
And as you start your Production Kafka Journey…
Plan
Validate
Deploy
Observe
Analyze
36
Thank You!

More Related Content

PDF
Kafka Streams State Stores Being Persistent
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PDF
Efficient monitoring and alerting
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
PDF
Cloud-native Semantic Layer on Data Lake
PDF
Fluentd 101
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Kafka Streams State Stores Being Persistent
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Efficient monitoring and alerting
Scaling your Data Pipelines with Apache Spark on Kubernetes
Event Sourcing & CQRS, Kafka, Rabbit MQ
Cloud-native Semantic Layer on Data Lake
Fluentd 101
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

What's hot (20)

PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
How Orange Financial combat financial frauds over 50M transactions a day usin...
PDF
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
PPTX
Apache kafka 관리와 모니터링
PDF
Google Cloud Dataflow Two Worlds Become a Much Better One
PPTX
Druid deep dive
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PDF
Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021
PPTX
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
PDF
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
PDF
Building an open data platform with apache iceberg
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Stream Processing – Concepts and Frameworks
PPTX
Brandon obrien streaming_data
PDF
Clean architectures with fast api pycones
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Streaming SQL for Data Engineers: The Next Big Thing?
PDF
Real time analytics at uber @ strata data 2019
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Apache Hadoop 3
Apache Arrow: High Performance Columnar Data Framework
How Orange Financial combat financial frauds over 50M transactions a day usin...
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Apache kafka 관리와 모니터링
Google Cloud Dataflow Two Worlds Become a Much Better One
Druid deep dive
Dynamic Rule-based Real-time Market Data Alerts
Deep Dive into the Pulsar Binary Protocol - Pulsar Virtual Summit Europe 2021
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Building an open data platform with apache iceberg
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Stream Processing – Concepts and Frameworks
Brandon obrien streaming_data
Clean architectures with fast api pycones
Massive Data Processing in Adobe Using Delta Lake
Streaming SQL for Data Engineers: The Next Big Thing?
Real time analytics at uber @ strata data 2019
A Deep Dive into Query Execution Engine of Spark SQL
Apache Hadoop 3
Ad

Similar to Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications (20)

PPTX
Monitoring Apache Kafka
PPTX
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
PDF
OnPrem Monitoring.pdf
PPTX
Application Performance Management
PPTX
Kafka at scale facebook israel
PDF
Resilience Planning & How the Empire Strikes Back
PPTX
URP? Excuse You! The Three Kafka Metrics You Need to Know
PDF
URP? Excuse You! The Three Metrics You Have to Know
PDF
Multi Layer Monitoring V1
PPT
10135 b 11
PPTX
Performance tuning Grails applications SpringOne 2GX 2014
PDF
Citi Tech Talk: Monitoring and Performance
PDF
Introduction to dev ops
PDF
Production Ready Microservices at Scale
PPTX
Visibility-from web application interface to the database
PDF
Adding Real-time Features to PHP Applications
PDF
Tokyo AK Meetup Speedtest - Share.pdf
PDF
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
PDF
Fixing Domino Server Sickness
Monitoring Apache Kafka
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
OnPrem Monitoring.pdf
Application Performance Management
Kafka at scale facebook israel
Resilience Planning & How the Empire Strikes Back
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Metrics You Have to Know
Multi Layer Monitoring V1
10135 b 11
Performance tuning Grails applications SpringOne 2GX 2014
Citi Tech Talk: Monitoring and Performance
Introduction to dev ops
Production Ready Microservices at Scale
Visibility-from web application interface to the database
Adding Real-time Features to PHP Applications
Tokyo AK Meetup Speedtest - Share.pdf
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Fixing Domino Server Sickness
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Modernizing your data center with Dell and AMD
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Approach and Philosophy of On baking technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Modernizing your data center with Dell and AMD
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Approach and Philosophy of On baking technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

  • 1. 1 Metrics Are Not Enough Gwen Shapira, Product Manager @gwenshap Monitoring Apache Kafka and Streaming Applications
  • 2. 2 Monitoring Distributed Systems is hard “Google SRE team with 10–12 members typically has one or sometimes two members whose primary assignment is to build and maintain monitoring systems for their service.” https://www.oreilly.com/ideas/monitoring-distributed-systems
  • 3. 3 Apache Kafka is a distributed system and has many components
  • 4. 4 Many Moving Parts to Watch • Producers • Consumers • Consumer Groups • Brokers • Controller • Zookeeper • Topics • Partitions • Messages • …..
  • 5. 5 And many metrics to monitor • Broker throughput • Topic throughput • Disk utilization • Unclean leader elections • Network pool usage • Request pool usage • Request latencies – 30 request types, 5 phases each • Topic partition status counts: online, under replicated, offline • Log flush rates • ZK disconnects • Garbage collection pauses • Message delivery • Consumer groups reading from topics • …​
  • 6. 6 Every Service that uses Kafka is a Distributed System Orders Service Stock Service Fulfilment Service Fraud Detection Service Mobile App Kafka
  • 7. 7 It is all CRITICAL to your business • Real-time applications mean very little room for errors • Is Kafka available and performing well? You need to know before your users do. • You must detect and act on small problems before they escalate • The business cares a lot about accuracy and SLAs • It is 8:05am, does the dashboard reflect the status of the system up to 8am? • Continuously improve performance • Monitor Kafka cluster performance • Identify and act on leading indicators of future problems • Quick triage – can you identify likely causes of a problem quickly and effectively?
  • 8. 8 So you may need a bit of help • Operators must have visibility into the health of the Kafka cluster • The business must have visibility into completeness and latency of message delivery • Everyone needs to focus on the most meaningful metrics
  • 9. 9 Types of monitoring • Tailing logs • OS metrics • Kafka / Client metrics • Tracing applications • Event level sampling • APM – Application performance from user perspective • …
  • 10. 10 Types of monitoring • Tailing logs • OS metrics • Kafka / Client metrics • Tracing applications • Event level sampling • APM – Application performance from user perspective • …
  • 11. 11 Monitor System Health of Your Cluster
  • 12. 12 The basics • Whatever else you do: Check that the broker process is running • External agent • Or alert on stale metrics • Don’t alert on everything. Fewer, high level alerts are better.
  • 14. 14 Under-replicated partitions • If you can monitor just one thing… • Is it a specific broker? • Cluster wide: • Out of resources • Imbalance • Broker: • Hardware • Noisy neighbor • Configuration
  • 15. 15 Drill Down into Broker and Topic: Do we see a problem right here?
  • 16. 16 Check partition placement - is the issue specific to one broker?
  • 17. 17 Don’t watch the dashboard • Control Center detects anomalous events in monitoring data • Users can define triggers • Control Center performs customizable actions when triggers occur • When troubleshooting Kafka issues, users can view previous alerts and historical message delivery data at the time the alert occurred
  • 18. 18 Capacity Planning – Be Proactive • Capacity planning ensures that your cluster can continue to meet business demands • Control Center provides indicators if a cluster may need more brokers • Key metrics that indicate a cluster is near capacity: • CPU • Network and thread pool usage • Request latencies • Network utilization - Throughput, per broker and per cluster • Disk utilization - Disk space used by all log segments, per broker
  • 19. 19 Multi-Cluster Deployments • Monitor all clusters in one place
  • 20. 20 Monitor End to End Message Delivery
  • 21. 21 Are You Meeting SLAs? • Stream monitoring helps you determine if all messages are delivered end-to-end in a timely manner • This is important for several reasons: • Ensure producers and consumers are not losing messages • Check if consumers are consuming more than expected • Verify low latency for real-time applications • Identify slow consumers
  • 22. 22 How to monitor? The infamous LinkedIn “Audit”: • Count messages when they are produced • Count messages when they are consumed • Check timestamps when they are consumed • Compare the results
  • 23. 23 Message delivery metrics Streaming message delivery metrics are available: • Aggregate • Per-consumer group • Per-topic
  • 24. 24 Under Consumption • Reasons for under consumption: • Producers not handling errors and retried correctly • Misbehaving consumers, perhaps the consumer did not follow shutdown sequence • Real-time apps intentionally skipping messages • Red bars indicate some messages were not consumed • Herringbone pattern can indicate error in measurement • Usually improper shutdown of client
  • 25. 25 Over Consumption • Reasons for over consumption • Consumers may be processing a set of messages more than once, which may have impact on their applications • Consumption bars are higher than the expected consumption lines • Latency may be higher
  • 26. 26 Slow Consumers • Identify consumers and consumer groups that are not keeping up with data production • Use the per-consumer and per-consumer group metrics • Compare a slow, lagging consumer (left) to a good consumer (right) • The slow consumer (left) is processing all the messages, but with high latency • Slow consumers may also process fewer messages in a given time window, so monitor "Expected consumption" (the top line)
  • 28. 28 Identify Performance Bottlenecks • Real-time applications require high throughput or low latency • Need to baseline where you are • Monitor for changes to get ahead of the problem • You may need to identify performance bottlenecks • Break-down the times for the end-to-end dataflow to give you pointers where streams are taking the most processing time • The key metrics to look at include: • Request latencies • Network pool usage • Request pool usage
  • 29. 29 Produce and Fetch Request Latencies Breakdown produce and fetch latencies through the entire request lifecycle Request latency values can be shown at the median, 95th, 99th, or 99.9th percentile
  • 30. 30 Request Latencies Explained (1) • Total request latency (center) • Total time of an entire request lifecycle, from the broker point of view • Request queue • The time the request is in the request queue waiting for an IO thread • A high value can indicate there are not enough IO threads or CPU is a bottleneck • Also check: What are those IO threads doing? • Request local • The time the request is being processed locally by the leader • A high value can imply slow disk so monitor broker disk IO
  • 31. 31 Request Latencies Explained (2) • Response remote • The time the request is waiting on other brokers • Higher times are expected on high-reliability or high-throughput systems • A high value can indicate a slow network connection, or the consumer is caught up to the end of the log • Response queue • The time the request is in the response queue waiting for a network thread • A high value can imply there are not enough network threads • Response send • The time the request is being sent back to the consumer • A high value can imply the CPU or network is a bottleneck
  • 32. 32 Network and Request Handler Threads • Network pool usage • Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor threads are not idle • If network pool usage is above 70%, isolate bottleneck with the request latency breakdown • Consider increasing the broker configuration parameter num.network.threads, especially if Response queue metric is high and you have resources • Request pool usage • Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler threads are not idle • If request pool usage is above 70%, isolate bottleneck with the request latency breakdown • Consider increasing the broker configuration parameter num.io.threads, especially if Request queue metric is high • Why are all your handlers busy? Check GC, access patterns and disk IO
  • 34. 34 Few things to remember… • Monitor Kafka • Work with your developers to monitor critical applications end-to-end • More data is better: Metrics + logs + OS + APM + … • But fewer alerts are better • Alert on what’s important – Under—Replicated Partitions is a good start • DON’T JUST FIDDLE WITH STUFF • AND DON’T RESTART KAFKA FOR LOLS • If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
  • 35. 35 And as you start your Production Kafka Journey… Plan Validate Deploy Observe Analyze