SlideShare a Scribd company logo
@allenxwang
Multi-cluster, Multi-tenant and
Hierarchical Kafka Messaging Service
Allen Wang
Growing Pains for A Kafka Cluster
● A few brokers, handful topics, tens of partitions
○ Wonderful!
● Tens of brokers, tens of topics, hundreds of
partitions
○ Life is good!
● A hundred brokers, a hundred topics, thousands of
partitions
○ … OK
● Hundreds of brokers, hundreds of topics, one
hundred thousand partitions
○ ???
Why Huge Kafka Cluster Does Not Work
● Significant time increase on operations
○ Rolling binary update
■ Three minutes per broker, 500 brokers = 1 whole day
○ Rolling AMI (image) update with data copying
■ One hour per broker, 500 brokers = 20 days
● Increased latency due to number of partitions
○ https://www.confluent.io/blog/how-to-choose-the-number
-of-topicspartitions-in-a-kafka-cluster/
● Vulnerability to ZK/Controller failures
Scaling and Data Balancing Challenge
● The problem with partition reassignment
○ Time consuming
○ Replication traffic taking bandwidth
○ Complexity of bin packing for data balancing
The Consumer Fan-out Problem
BytesOut = (numberOfConsumers + replicationFactor - 1) ✕ BytesIn
● A single cluster may easily fit for bytes in, but not
necessarily for bytes out
Solve Consumer Fan-out with Hierarchies
Inevitability of Multi-cluster
The Idea
● Create many small and mostly “immutable”
clusters
● Organize them in a topology with routing service
connecting the clusters
Multi-Cluster Kafka Service At Netflix
Router
(w/ simple ETL)
Fronting
Kafka
Event
Producer
Consumer
Kafka
Management
HTTP
PROXY
Consumers
Multi-cluster Producers
● Support producing to multiple clusters at the same
time
● High level producer API implemented by multiple
embedded Kafka producers
public interface KsProducer<V> {
// ...
<T extends V> CompletableFuture<SendResult> send(T obj)
}
● Dynamic topic to cluster mapping
○ Enabled by NetflixOSS/Archaius
"t1, t2" : {
"where" : [{
"sink" : "fronting-kafka-1"
}]
},
"t3" : {
"where" : [{
"sink" : "fronting-kafka-2"
}]
},
"__default__" : {
"where" : [ {
"sink" : "fronting-kafka-2"
}]
}
@Stream("foo") // send to topic “foo”
public class Foo {
// ...
}
@Stream("bar") // send to topic “bar”
public class Bar {
// ...
}
KsProducer<Object> producer = // …
producer.send(new Foo()); // Send to Kafka cluster which has “foo” topic
producer.send(new Bar()); // Send to Kafka cluster which has “bar” topic
Fronting Kafka
● For data collection and buffering
● Optimized for producers
○ Only consumers are routers
Scaling of Fronting Kafka
● Creating / destroying Kafka clusters
○ E.g., create new topic on new clusters and update topic to
cluster mapping
● No partition reassignment
Data Balancing
● Assign the same number of partitions of any topic
to every brokers
○ E.g., for clusters of 12 brokers, create topics with partitions
of 12, 24, 36
○ Guaranteed even distribution of data (aside from
occasional leader imbalance)
● Balance data among clusters by moving topics
○ Must dynamically update topic to cluster mapping
Topic Move
RouterFronting
Kafka
Event
Producer
Consumer
Kafka
Create topic “foo”
Consumer
“foo”
“foo”
Consumer Kafka
● Scaling
○ Add brokers and partitions for small cluster for non-keyed
topics
○ Create same topics on a new cluster and move consumers
Future Plan
● Cross-cluster topic
○ load sharing beyond single cluster
○ Auto-scale
○ Consumer/producer support needed
Multi-Cluster Consumer (Ongoing work)
● Same Kafka consumer interface
● Consume from multiple clusters with dynamic
topic to cluster mapping
○ Keep subscription state
○ Receive mapping updates
○ Create and delegate to underlying Kafka consumer for each
associated cluster on the fly
Multi-Cluster Consumer Topic to Cluster Mapping and
Code Example
{
"foo": [
{"vip": "cluster1"},
{"vip": "cluster2"}
],
“bar”: [
{“vip”: “cluster2”}
]
}
// Create a multi-cluster consumer
Consumer<String, String> multiClusterConsumer = ...
// subscribe as usual and keep subscription state
consumer.subscribe(new ArrayList<String>(“foo”));
while (...) {
// fetch from both clusters for topic “foo” and
// return the aggregated records
ConsumerRecords<String, String> records =
multiClusterConsumer.poll(2000);
process(records);
}
Topic move for Multi-cluster Consumers
Multi-cluster Consumer
Producer
“foo”: “cluster1” “foo”: [“cluster1”]
“foo”: “cluster2”
“foo”: [“cluster1”, “cluster2”]
“foo”: [“cluster2”]
cluster1
cluster2
Our Vision
Producers
“foo”
“foo”
“bar”
“bar”
“bar”
Multi-cluster
Consumer
Advanced Consumer
Router
Fronting Kafka w/
Cross-cluster Topics
Consumer Kafka
Multi-cluster
Consumer
What About Keyed Messages
● Few topics requiring keyed messages in Netflix
● A word of caution for keyed messages
○ Inflexible/skewed load balancing
○ Difficult to scale
● Handling of keyed messages
○ Currently only produced by routers to consumer Kafka
○ Hard to guarantee message ordering in multi-cluster setting
○ Key-consumer affinity is guaranteed
Think Differently on Scaling Kafka
The “broker” way The “cluster” way
Scale up Add brokers Add clusters
Data balance Move partitions to
different brokers
Move/expand topics to
different clusters
Producer Produce to different
brokers at the same time
Produce to different clusters at
the same time
Consumer Consume from different
brokers at the same time
Consume from different
clusters at the same time
Thank You
https://medium.com/netflix-techblog
https://jobs.netflix.com/

More Related Content

PPTX
Kafka 101
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
An Introduction to Apache Kafka
PPTX
The top 3 challenges running multi-tenant Flink at scale
PDF
Fundamentals of Apache Kafka
PDF
Ansible - Introduction
PDF
Terraform Best Practices - DevOps Unicorns 2019
ODP
Introduction to Ansible
Kafka 101
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
An Introduction to Apache Kafka
The top 3 challenges running multi-tenant Flink at scale
Fundamentals of Apache Kafka
Ansible - Introduction
Terraform Best Practices - DevOps Unicorns 2019
Introduction to Ansible

What's hot (20)

PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
PDF
Introduction to apache kafka
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Kafka presentation
PPTX
Snowflake on AWSのターゲットエンドポイントとしての利用
PPTX
Envoy and Kafka
PPTX
MuleSoft Deployment Strategies (RTF vs Hybrid vs CloudHub)
ODP
Stream processing using Kafka
PPTX
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
PPT
Less05 asm instance
PPTX
Building an Active-Active IBM MQ System
PPTX
Introduction to Apache ZooKeeper
PPTX
Apache kafka
PPTX
An Introduction to Confluent Cloud: Apache Kafka as a Service
PDF
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
PDF
Kafka Streams State Stores Being Persistent
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
02 terraform core concepts
PPTX
Apache Kafka at LinkedIn
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Introduction to apache kafka
Evening out the uneven: dealing with skew in Flink
Kafka presentation
Snowflake on AWSのターゲットエンドポイントとしての利用
Envoy and Kafka
MuleSoft Deployment Strategies (RTF vs Hybrid vs CloudHub)
Stream processing using Kafka
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
Less05 asm instance
Building an Active-Active IBM MQ System
Introduction to Apache ZooKeeper
Apache kafka
An Introduction to Confluent Cloud: Apache Kafka as a Service
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Kafka Streams State Stores Being Persistent
Apache Kafka Architecture & Fundamentals Explained
02 terraform core concepts
Apache Kafka at LinkedIn
Ad

Similar to Multi cluster, multitenant and hierarchical kafka messaging service slideshare (20)

PDF
Kafka Summit SF 2017 - MultiCluster, MultiTenant and Hierarchical Kafka Messa...
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PDF
Kafka zero to hero
PDF
Apache Kafka - From zero to hero
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
Kafka used at scale to deliver real-time notifications
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
Kafka syed academy_v1_introduction
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
DOCX
KAFKA Quickstart
PDF
STREAMING WITH KAFKA Publish/Subscribe Messaging with Kafka
PDF
Implementing Domain Events with Kafka
PPTX
Kafka
PPTX
Introduction to Kafka and Event-Driven
PDF
Introduction to Kafka and Event-Driven
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PDF
Streaming Analytics unit 2 notes for engineers
PPTX
Distributed messaging with Apache Kafka
PDF
Scaling Open Source Big Data Cloud Applications is Easy/Hard
PDF
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...
Kafka Summit SF 2017 - MultiCluster, MultiTenant and Hierarchical Kafka Messa...
Multi-Datacenter Kafka - Strata San Jose 2017
Kafka zero to hero
Apache Kafka - From zero to hero
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Kafka used at scale to deliver real-time notifications
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Kafka syed academy_v1_introduction
Event Sourcing & CQRS, Kafka, Rabbit MQ
KAFKA Quickstart
STREAMING WITH KAFKA Publish/Subscribe Messaging with Kafka
Implementing Domain Events with Kafka
Kafka
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
Streaming in Practice - Putting Apache Kafka in Production
Streaming Analytics unit 2 notes for engineers
Distributed messaging with Apache Kafka
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Monthly Chronicles - July 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Monthly Chronicles - July 2025
The AUB Centre for AI in Media Proposal.docx
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Multi cluster, multitenant and hierarchical kafka messaging service slideshare

  • 1. @allenxwang Multi-cluster, Multi-tenant and Hierarchical Kafka Messaging Service Allen Wang
  • 2. Growing Pains for A Kafka Cluster ● A few brokers, handful topics, tens of partitions ○ Wonderful! ● Tens of brokers, tens of topics, hundreds of partitions ○ Life is good!
  • 3. ● A hundred brokers, a hundred topics, thousands of partitions ○ … OK ● Hundreds of brokers, hundreds of topics, one hundred thousand partitions ○ ???
  • 4. Why Huge Kafka Cluster Does Not Work ● Significant time increase on operations ○ Rolling binary update ■ Three minutes per broker, 500 brokers = 1 whole day ○ Rolling AMI (image) update with data copying ■ One hour per broker, 500 brokers = 20 days
  • 5. ● Increased latency due to number of partitions ○ https://www.confluent.io/blog/how-to-choose-the-number -of-topicspartitions-in-a-kafka-cluster/ ● Vulnerability to ZK/Controller failures
  • 6. Scaling and Data Balancing Challenge ● The problem with partition reassignment ○ Time consuming ○ Replication traffic taking bandwidth ○ Complexity of bin packing for data balancing
  • 8. BytesOut = (numberOfConsumers + replicationFactor - 1) ✕ BytesIn ● A single cluster may easily fit for bytes in, but not necessarily for bytes out
  • 9. Solve Consumer Fan-out with Hierarchies
  • 11. The Idea ● Create many small and mostly “immutable” clusters ● Organize them in a topology with routing service connecting the clusters
  • 12. Multi-Cluster Kafka Service At Netflix Router (w/ simple ETL) Fronting Kafka Event Producer Consumer Kafka Management HTTP PROXY Consumers
  • 13. Multi-cluster Producers ● Support producing to multiple clusters at the same time ● High level producer API implemented by multiple embedded Kafka producers public interface KsProducer<V> { // ... <T extends V> CompletableFuture<SendResult> send(T obj) }
  • 14. ● Dynamic topic to cluster mapping ○ Enabled by NetflixOSS/Archaius "t1, t2" : { "where" : [{ "sink" : "fronting-kafka-1" }] }, "t3" : { "where" : [{ "sink" : "fronting-kafka-2" }] }, "__default__" : { "where" : [ { "sink" : "fronting-kafka-2" }] }
  • 15. @Stream("foo") // send to topic “foo” public class Foo { // ... } @Stream("bar") // send to topic “bar” public class Bar { // ... } KsProducer<Object> producer = // … producer.send(new Foo()); // Send to Kafka cluster which has “foo” topic producer.send(new Bar()); // Send to Kafka cluster which has “bar” topic
  • 16. Fronting Kafka ● For data collection and buffering ● Optimized for producers ○ Only consumers are routers
  • 17. Scaling of Fronting Kafka ● Creating / destroying Kafka clusters ○ E.g., create new topic on new clusters and update topic to cluster mapping ● No partition reassignment
  • 18. Data Balancing ● Assign the same number of partitions of any topic to every brokers ○ E.g., for clusters of 12 brokers, create topics with partitions of 12, 24, 36 ○ Guaranteed even distribution of data (aside from occasional leader imbalance) ● Balance data among clusters by moving topics ○ Must dynamically update topic to cluster mapping
  • 20. Consumer Kafka ● Scaling ○ Add brokers and partitions for small cluster for non-keyed topics ○ Create same topics on a new cluster and move consumers
  • 21. Future Plan ● Cross-cluster topic ○ load sharing beyond single cluster ○ Auto-scale ○ Consumer/producer support needed
  • 22. Multi-Cluster Consumer (Ongoing work) ● Same Kafka consumer interface ● Consume from multiple clusters with dynamic topic to cluster mapping ○ Keep subscription state ○ Receive mapping updates ○ Create and delegate to underlying Kafka consumer for each associated cluster on the fly
  • 23. Multi-Cluster Consumer Topic to Cluster Mapping and Code Example { "foo": [ {"vip": "cluster1"}, {"vip": "cluster2"} ], “bar”: [ {“vip”: “cluster2”} ] } // Create a multi-cluster consumer Consumer<String, String> multiClusterConsumer = ... // subscribe as usual and keep subscription state consumer.subscribe(new ArrayList<String>(“foo”)); while (...) { // fetch from both clusters for topic “foo” and // return the aggregated records ConsumerRecords<String, String> records = multiClusterConsumer.poll(2000); process(records); }
  • 24. Topic move for Multi-cluster Consumers Multi-cluster Consumer Producer “foo”: “cluster1” “foo”: [“cluster1”] “foo”: “cluster2” “foo”: [“cluster1”, “cluster2”] “foo”: [“cluster2”] cluster1 cluster2
  • 26. What About Keyed Messages ● Few topics requiring keyed messages in Netflix ● A word of caution for keyed messages ○ Inflexible/skewed load balancing ○ Difficult to scale ● Handling of keyed messages ○ Currently only produced by routers to consumer Kafka ○ Hard to guarantee message ordering in multi-cluster setting ○ Key-consumer affinity is guaranteed
  • 27. Think Differently on Scaling Kafka The “broker” way The “cluster” way Scale up Add brokers Add clusters Data balance Move partitions to different brokers Move/expand topics to different clusters Producer Produce to different brokers at the same time Produce to different clusters at the same time Consumer Consume from different brokers at the same time Consume from different clusters at the same time