SlideShare a Scribd company logo
1
Running Apache Kafka
as a Service at Scale
Sriram Subramanian, Director of Engineering
2
52% of Kafka users in the cloud
3
Availability
Latency
Durability
4
Managing Distributed Systems is really really hard
5
Failure is inevitable – Cloud makes it worse
6
Need to adapt to change
7
Observability is a dark art
You can’t fix what you can’t see
8
Observability is a dark art
Observability is a dark art perfected
only with time and experience!
9
Kafka is no different
10
Takes years of iteration
= v1.0
11
Takes years of iteration
&
12
Stories to Tell
13
A stitch in time saves nine
test data
14
Types of testing we do today
15
4K+
Test hours
8K+
Total tests
75%
Coverage
Testing numbers
16
17
Broker 1 Broker 2 Broker N
Producer Producer Producer
Consumer Consumer
Kafka Cluster
Topic partition
REPLICATIONREPLICATION
Failed tests involved correlated failures
18
m1
m3
m4
m5
m1
m2
m4
m5
Replica A Replica B
0
1
2
3
0
1
2
3
State of the logs
19
m1 m1
Replica A Replica B
0 0
HW HW
HW = High Water Mark – Tracks the committed messages
L
How Replication Works?
20
m1 m1
Replica A Replica B
0 0
HW HW
m2
L
How Replication Works?
21
m1 m1
Replica A Replica B
0 0
HW HW
L
m21
How Replication Works?
22
m1 m1
Replica A Replica B
0 0
HW HW
L
m2
m2
1
How Replication Works?
23
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m2 11
How Replication Works?
24
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m2
ACK
11
How Replication Works?
25
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
How Replication Works?
26
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Updated HW to 1
How Replication Works?
27
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m21 1
How Replication Works?
28
29
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Root Cause
30
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Root Cause
31
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2Truncate 1
Root Cause
32
m1 m1
Replica A Replica B
0 0
HW
HW
L
m21
Root Cause
33
m1 m1
Replica A Replica B
0 0
HW
HW
L
m21
Root Cause
Get m2
34
m1 m1
Replica A Replica B
0 0
HW
HW
L
m21
Root Cause
35
m1 m1
Replica A Replica B
0 0
HW
HW
m21
Root Cause
L
Message M2 is Lost
36
Introducing Leader Generation
37
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Zero Data Loss
LG0 0
LEADER GENERATION MAP
LG0
LG0
LG0
LG0
38
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
Zero Data Loss
LG0 0
LEADER GENERATION MAP
LG0
LG0
LG0
LG0
39
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
LG0 0
LEADER GENERATION MAP
LG0
LG0
LG0
LG0
Leader Generation Request LG0
Zero Data Loss
40
m1 m1
Replica A Replica B
0 0
HW
HW
L
m2m21 1
LG0 0
LEADER GENERATION MAP
LG0
LG0
LG0
LG0
Offset = 1
Zero Data Loss
41
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m21 1
LG0 0
LEADER GENERATION MAP
LG0
LG0
LG0
LG0
Zero Data Loss
42
m1 m1
Replica A Replica B
0 0
HW HW
L
m2m21 1
LG0 0
LEADER GENERATION MAP
LG0
LG0
LG0
LG0
Zero Data Loss
43
m1 m1
Replica A Replica B
0 0
HW HW
m2m21 1
LG0 0
LEADER GENERATION MAP
LG0
LG0
LG0
LG0
L
Zero Data Loss
44
When availability became more important than ability
45
Broker 1 Broker 2 Broker N
Producer Producer Producer
Consumer Consumer
Kafka Cluster
Topic partition
REPLICATIONREPLICATION
Controller
Where did my topic partitions go?
46
● Leader Election
● Replica Reassignment
● Create Topic
● Delete Topic
● Add Partitions
● Broker start and shutdown
Responsibilities of the Controller
No Controller - No Cluster
47
● Controller state - topic creation, topic deletion etc
● Time taken to perform an operation
● Rate at which an admin operation is performed
● Queue sizes within the controller
Lack of Metrics
48
Root Cause
Partition 1
Partition 2
Partition 3
Partition 4
49
Root Cause
Upgrades!
● Synchronous per-partition
zookeeper writes
● Sequential per-partition
controller-to-broker requests
● Complicated concurrency
semantics
● No separation of control plane
from data plane
50
How did we fix it?
Partition
1,2,3,4
51
Zero controller downtime!
● Highly available cluster
● 10x faster leader elections
● More number of topic partitions per cluster
● Faster broker shutdown and upgrades
52
When everything is INFINITE nothing is ever ENOUGH
53
Broker 1 Broker 2 Broker N
Producer Producer Producer
Consumer Consumer
Kafka Cluster
Topic partition
REPLICATIONREPLICATION
Can I get my latency?
54
But I have bytes quota set
● Throttle byte rate per second on the broker
● Response delayed on exceeding threshold
● Avoids bad clients from consuming all the bandwidth
55
Root Cause
Byte Rate Quotas is useful but not sufficient
56
Root Cause
● Too many small sized requests
● DDos attack from client
● Decompression on the server takes a long time
● With more consumer instances, more requests
57
bin/kafka-configs --zookeeper localhost:2181 --alter --add-config
'request_percentage=50' --entity-name user1 --entity-type users
Request Quota - percentage of time a client can spend on request
handler (I/O) threads and network threads within each quota window
Predictable Latency
58
+ +
Apache Kafka Tools Automation
59
Apache Kafka as a Service
60
confluent.io/confluent-cloud
How do you get on Confluent Cloud?
61
Thank You!

More Related Content

PDF
Kafka At Scale in the Cloud
PDF
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
PDF
Reliability Guarantees for Apache Kafka
PPTX
Kafka reliability velocity 17
PPTX
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
PDF
Exactly-once Semantics in Apache Kafka
PDF
Apache Kafka – (Pattern and) Anti-Pattern
PDF
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka At Scale in the Cloud
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Reliability Guarantees for Apache Kafka
Kafka reliability velocity 17
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
Exactly-once Semantics in Apache Kafka
Apache Kafka – (Pattern and) Anti-Pattern
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka

What's hot (20)

PDF
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
PPTX
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
PDF
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
PPTX
Introducing Exactly Once Semantics To Apache Kafka
PPTX
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
PPTX
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
PDF
Troubleshooting Kafka's socket server: from incident to resolution
PDF
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
ODP
Kafka aws
PPTX
Apache Kafka - Patterns anti-patterns
PPTX
Deep Dive into Apache Kafka
PDF
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
PDF
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
PDF
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
PPTX
Introduction To Streaming Data and Stream Processing with Apache Kafka
PDF
Introduction to Apache Kafka
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PPTX
No data loss pipeline with apache kafka
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PDF
Introduction to Akka-Streams
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
Introducing Exactly Once Semantics To Apache Kafka
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Troubleshooting Kafka's socket server: from incident to resolution
The Good, The Bad, and The Avro (Graham Stirling, Saxo Bank and David Navalho...
Kafka aws
Apache Kafka - Patterns anti-patterns
Deep Dive into Apache Kafka
Everything you ever needed to know about Kafka on Kubernetes but were afraid ...
Please Upgrade Apache Kafka. Now. (Gwen Shapira, Confluent) Kafka Summit SF 2019
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction to Apache Kafka
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
No data loss pipeline with apache kafka
Streaming in Practice - Putting Apache Kafka in Production
Introduction to Akka-Streams
Ad

Similar to Kafka Summit SF 2017 - Running Kafka as a Service at Scale (20)

PDF
A Deep Dive into Kafka Controller
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Microservices interaction at scale using Apache Kafka
PPTX
A Deep Dive into Kafka Controller
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
PDF
Migration Effort in the Cloud - The Case of Cloud Platforms
PDF
Kubernetes Failure Stories - KubeCon Europe Barcelona
PDF
From Three Nines to Five Nines - A Kafka Journey
PDF
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
PDF
Status of Vulkan on Raspberry
PDF
TechTalk5-WhatDoesItTakeToRunLLVMBuildbots.pdf
PDF
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
PDF
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
PDF
Scaling LoL Chat to 70M Players
PDF
Reactive mistakes - ScalaDays Chicago 2017
PDF
Mininet: Moving Forward
PDF
[KCD GT 2023] Demystifying etcd failure scenarios for Kubernetes.pdf
PDF
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
PPTX
Getting Started with Kafka on k8s
PDF
HBaseCon 2013: Scalable Network Designs for Apache HBase
A Deep Dive into Kafka Controller
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Microservices interaction at scale using Apache Kafka
A Deep Dive into Kafka Controller
SFBigAnalytics_20190724: Monitor kafka like a Pro
Migration Effort in the Cloud - The Case of Cloud Platforms
Kubernetes Failure Stories - KubeCon Europe Barcelona
From Three Nines to Five Nines - A Kafka Journey
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Status of Vulkan on Raspberry
TechTalk5-WhatDoesItTakeToRunLLVMBuildbots.pdf
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Scaling LoL Chat to 70M Players
Reactive mistakes - ScalaDays Chicago 2017
Mininet: Moving Forward
[KCD GT 2023] Demystifying etcd failure scenarios for Kubernetes.pdf
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
Getting Started with Kafka on k8s
HBaseCon 2013: Scalable Network Designs for Apache HBase
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Transform Your Business with a Software ERP System
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
ai tools demonstartion for schools and inter college
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
medical staffing services at VALiNTRY
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Computer Software and OS of computer science of grade 11.pptx
Upgrade and Innovation Strategies for SAP ERP Customers
Designing Intelligence for the Shop Floor.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
CHAPTER 2 - PM Management and IT Context
wealthsignaloriginal-com-DS-text-... (1).pdf
Transform Your Business with a Software ERP System
L1 - Introduction to python Backend.pptx
Operating system designcfffgfgggggggvggggggggg
Reimagine Home Health with the Power of Agentic AI​
VVF-Customer-Presentation2025-Ver1.9.pptx
How to Choose the Right IT Partner for Your Business in Malaysia
ai tools demonstartion for schools and inter college
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
medical staffing services at VALiNTRY
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Softaken Excel to vCard Converter Software.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Wondershare Filmora 15 Crack With Activation Key [2025

Kafka Summit SF 2017 - Running Kafka as a Service at Scale