SlideShare a Scribd company logo
1
Disaster Recovery Plans
for Apache Kafka
Scale and Availability of Apache Kafka in Multiple Data Centers
@gwenshap
2
3
Bad Things
• Kafka cluster failure
• Major storage / network outage
• Entire DC is demolished
• Floods and Earthquakes
4
Disaster Recovery Plan:
“When in trouble
or in doubt
run in circles,
scream and shout”
5
Disaster Recovery Plan:
When This Happens Do That
Kafka cluster failure Failover to a second cluster in same data center
Major storage / network Outage Failover to a second cluster in another “zone” in
same building
Entire data-center is demolished Single Kafka cluster running in multiple near-by
data-centers / buildings.
Flood and Earthquakes Failover to a second cluster in another region
6
There is no such thing
as a free lunch
Anyone who tells you differently
is selling something
7
Reality:
The same event will not appear
in two DCs at the exact same
time.
8
Things to ask:
• What are the guarantees in an event of unplanned failover?
• What are the guarantees in an event of planned failover?
• What is the process for failing back?
• How many data-centers are required?
• How does the solution impact my production performance?
• What are the bandwidth requirements between the data-centers?
9
Every solution needs to balance
these trade offs
Kafka takes DIY approach
10
The inherent complexity of multi data-center replication
There is a diversity of approaches
And diversity of problems
Kafka gives you the flexibility and tools to work
And we’ll give you an example and inspire you to build your own
List tradeoffs here
Here are things to watch out for:
How to do your homework
Tweet me J
11
Stretch Cluster
The easy way
• Take 3 nearby data centers.
• Single digit ms latency is good
• Install at least 1 Zookeeper in each
• Install at least one Kafka broker in each
• Configure each DC as a “rack”
• Configure acks=all, min.isr=2
• Enjoy
12
Diagram!
13
Pros
• Easy to set up
• Failover is “business as usual”
• Sync replication – only method to guarantee
no loss of data.
Cons
• Need 3 data centers nearby
• Cluster failure is still a disaster
• Higher latency, lower throughput compared
to “normal” cluster
• Traffic between DCs can be bottleneck
• Costly infrastructure
14
Want sync replication but only two
data centers?
15
Solution I hesistate because…
2 ZK nodes in each DC and “observer”
somewhere else.
Did anyone do this before?
3 ZK nodes in each DC and manually
reconfigure quorum for failover
• You may lose ZK updates during
failover
• Requires manual intervention2 separate ZK cluster + replication
Solutions I can’t recommend:
16
Most companies don’t do stretch.
Because:
• Only 2 data centers
• Data centers are far
• One cluster isn’t safe enough
• Not into “high latency”
17
So you want to run
2 Kafka clusters
And replicate events
between them?
18
Basic async replication
19
Replication Lag
20
Demo #1
Monitoring Replication Lag
21
Active-Active or
Active-Passive?
• Active-Active is efficient
you use both DCs
• Active-Active is easier
because both clusters are
equivalent
• Active-Passive has lower
network traffic
• Active-Passive requires less
monitoring
22
Active-Active Setup
23
Disaster Strikes
24
Desired Post-Disaster State
25
Only one question left:
What does it consume next?
26
Kafka
consumers
normally use
offsets
27
In an ideal world…
28
Unfortunately, this is not that simple
1. There is no guarantee that offsets are identical in the two data centers.
Event with offset 26 in NYC can be offset 6 or offset 30 in ATL.
2. Replication of each topic and partition is independent. So..
1. Offset metadata may arrive ahead of events themselves
2. Offset metadata may arrive late
Nothing prevents you from replicating offsets topic and using it. Just be realistic
about the guarantees.
29
If accuracy is no big-deal…
1. If duplicates are cool – start from the beginning.
Use Cases:
• Writing to a DB
• Anything idempotent
• Sending emails or alerts to people inside the company
2. If lost events are cool – jump to the latest event.
Use Cases:
• Clickstream analytics
• Log analytics
• “Big data” and analytics use-cases
30
Personal Favorite – Time-based Failover
• Offsets are not identical, but…
3pm is 3pm (within clock drift)
• Relies on new features:
• Timestamps in events! 0.10.0.0
• Time-based indexes! 0.10.1.0
• Force consumer to timestamps tool! 0.11.0.0
31
How we do it?
1. Detect Kafka in NYC is down. Check the time of the incident.
• Even better:
Use an interceptor to track timestamps of events as they are consumed.
Now you know “last consumed time-stamp”
2. Run Consumer Groups tool in ATL and set the offsets for “following-orders”
consumer to time of incident (or “last consumed time”)
3. Start the ”following-orders” consumer in ATL
4. Have a beer. You just aced your annual failover drill.
32
bin/kafka-consumer-groups
--bootstrap-server localhost:29092
--reset-offsets
--topic NYC.orders
--group following-orders
--execute
--to-datetime 2017-08-22T06:00:33.236
33
Few practicalities
• Above all – practice
• Constantly monitor replication lag. High enough lag and everything is useless.
• Also monitor replicator for liveness, errors, etc.
• Chances are the line to the remote DC is both high latency and low throughput.
Prepare to do some work to tune the producers/consumers of the replicator.
• RTFM: http://docs.confluent.io/3.3.0/multi-dc/replicator-tuning.html
• Replicator plays nice with containers and auto-scale. Give it a try.
• Call your legal dept. You may be required to encrypt everything you replicate.
• Watch different versions of this talk. We discuss more architectures and more ops concerns.
34
Thank You!

More Related Content

PDF
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
PPTX
Introduction to Apache Kafka
ODP
Stream processing using Kafka
PPTX
Apache Kafka
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
PDF
Flink powered stream processing platform at Pinterest
PDF
A Deep Dive into Kafka Controller
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Introduction to Apache Kafka
Stream processing using Kafka
Apache Kafka
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Flink powered stream processing platform at Pinterest
A Deep Dive into Kafka Controller

What's hot (20)

PPTX
Apache Kafka at LinkedIn
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PDF
Kafka Streams: What it is, and how to use it?
PDF
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
PDF
Designing a complete ci cd pipeline using argo events, workflow and cd products
PPTX
PPTX
Improving Kafka at-least-once performance at Uber
PPTX
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
PPTX
Kafka 101
PDF
Producer Performance Tuning for Apache Kafka
PDF
Fundamentals of Apache Kafka
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Kubernetes & Google Kubernetes Engine (GKE)
PPTX
Apache Kafka Best Practices
PDF
Apache Kafka - Martin Podval
PDF
Can Apache Kafka Replace a Database?
PDF
Common issues with Apache Kafka® Producer
PPTX
Apache kafka
PDF
ODP
Introduction to Kafka connect
Apache Kafka at LinkedIn
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Kafka Streams: What it is, and how to use it?
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Designing a complete ci cd pipeline using argo events, workflow and cd products
Improving Kafka at-least-once performance at Uber
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka 101
Producer Performance Tuning for Apache Kafka
Fundamentals of Apache Kafka
Apache Kafka Architecture & Fundamentals Explained
Kubernetes & Google Kubernetes Engine (GKE)
Apache Kafka Best Practices
Apache Kafka - Martin Podval
Can Apache Kafka Replace a Database?
Common issues with Apache Kafka® Producer
Apache kafka
Introduction to Kafka connect
Ad

Similar to Disaster Recovery Plans for Apache Kafka (20)

PDF
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
PDF
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
PDF
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
PPTX
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PDF
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
PDF
From Newbie to Highly Available, a Successful Kafka Adoption Tale (Jonathan S...
PDF
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
PDF
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandor...
PDF
Cruise Control: Effortless management of Kafka clusters
PDF
Multitenancy: Kafka clusters for everyone at LINE
PDF
Digital transformation: Highly resilient streaming architecture and strategie...
PDF
Common Patterns of Multi Data-Center Architectures with Apache Kafka
PDF
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Multi-Datacenter Kafka - Strata San Jose 2017
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
Capital One Delivers Risk Insights in Real Time with Stream Processing
Streaming in Practice - Putting Apache Kafka in Production
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
From Newbie to Highly Available, a Successful Kafka Adoption Tale (Jonathan S...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Don’t Be Scared: Multi-Tenant Cluster Support at Scale (Kelly Attaway, Pandor...
Cruise Control: Effortless management of Kafka clusters
Multitenancy: Kafka clusters for everyone at LINE
Digital transformation: Highly resilient streaming architecture and strategie...
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Dropbox Q2 2025 Financial Results & Investor Presentation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Advanced Soft Computing BINUS July 2025.pdf
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025
GamePlan Trading System Review: Professional Trader's Honest Take
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Disaster Recovery Plans for Apache Kafka

  • 1. 1 Disaster Recovery Plans for Apache Kafka Scale and Availability of Apache Kafka in Multiple Data Centers @gwenshap
  • 2. 2
  • 3. 3 Bad Things • Kafka cluster failure • Major storage / network outage • Entire DC is demolished • Floods and Earthquakes
  • 4. 4 Disaster Recovery Plan: “When in trouble or in doubt run in circles, scream and shout”
  • 5. 5 Disaster Recovery Plan: When This Happens Do That Kafka cluster failure Failover to a second cluster in same data center Major storage / network Outage Failover to a second cluster in another “zone” in same building Entire data-center is demolished Single Kafka cluster running in multiple near-by data-centers / buildings. Flood and Earthquakes Failover to a second cluster in another region
  • 6. 6 There is no such thing as a free lunch Anyone who tells you differently is selling something
  • 7. 7 Reality: The same event will not appear in two DCs at the exact same time.
  • 8. 8 Things to ask: • What are the guarantees in an event of unplanned failover? • What are the guarantees in an event of planned failover? • What is the process for failing back? • How many data-centers are required? • How does the solution impact my production performance? • What are the bandwidth requirements between the data-centers?
  • 9. 9 Every solution needs to balance these trade offs Kafka takes DIY approach
  • 10. 10 The inherent complexity of multi data-center replication There is a diversity of approaches And diversity of problems Kafka gives you the flexibility and tools to work And we’ll give you an example and inspire you to build your own List tradeoffs here Here are things to watch out for: How to do your homework Tweet me J
  • 11. 11 Stretch Cluster The easy way • Take 3 nearby data centers. • Single digit ms latency is good • Install at least 1 Zookeeper in each • Install at least one Kafka broker in each • Configure each DC as a “rack” • Configure acks=all, min.isr=2 • Enjoy
  • 13. 13 Pros • Easy to set up • Failover is “business as usual” • Sync replication – only method to guarantee no loss of data. Cons • Need 3 data centers nearby • Cluster failure is still a disaster • Higher latency, lower throughput compared to “normal” cluster • Traffic between DCs can be bottleneck • Costly infrastructure
  • 14. 14 Want sync replication but only two data centers?
  • 15. 15 Solution I hesistate because… 2 ZK nodes in each DC and “observer” somewhere else. Did anyone do this before? 3 ZK nodes in each DC and manually reconfigure quorum for failover • You may lose ZK updates during failover • Requires manual intervention2 separate ZK cluster + replication Solutions I can’t recommend:
  • 16. 16 Most companies don’t do stretch. Because: • Only 2 data centers • Data centers are far • One cluster isn’t safe enough • Not into “high latency”
  • 17. 17 So you want to run 2 Kafka clusters And replicate events between them?
  • 21. 21 Active-Active or Active-Passive? • Active-Active is efficient you use both DCs • Active-Active is easier because both clusters are equivalent • Active-Passive has lower network traffic • Active-Passive requires less monitoring
  • 25. 25 Only one question left: What does it consume next?
  • 27. 27 In an ideal world…
  • 28. 28 Unfortunately, this is not that simple 1. There is no guarantee that offsets are identical in the two data centers. Event with offset 26 in NYC can be offset 6 or offset 30 in ATL. 2. Replication of each topic and partition is independent. So.. 1. Offset metadata may arrive ahead of events themselves 2. Offset metadata may arrive late Nothing prevents you from replicating offsets topic and using it. Just be realistic about the guarantees.
  • 29. 29 If accuracy is no big-deal… 1. If duplicates are cool – start from the beginning. Use Cases: • Writing to a DB • Anything idempotent • Sending emails or alerts to people inside the company 2. If lost events are cool – jump to the latest event. Use Cases: • Clickstream analytics • Log analytics • “Big data” and analytics use-cases
  • 30. 30 Personal Favorite – Time-based Failover • Offsets are not identical, but… 3pm is 3pm (within clock drift) • Relies on new features: • Timestamps in events! 0.10.0.0 • Time-based indexes! 0.10.1.0 • Force consumer to timestamps tool! 0.11.0.0
  • 31. 31 How we do it? 1. Detect Kafka in NYC is down. Check the time of the incident. • Even better: Use an interceptor to track timestamps of events as they are consumed. Now you know “last consumed time-stamp” 2. Run Consumer Groups tool in ATL and set the offsets for “following-orders” consumer to time of incident (or “last consumed time”) 3. Start the ”following-orders” consumer in ATL 4. Have a beer. You just aced your annual failover drill.
  • 33. 33 Few practicalities • Above all – practice • Constantly monitor replication lag. High enough lag and everything is useless. • Also monitor replicator for liveness, errors, etc. • Chances are the line to the remote DC is both high latency and low throughput. Prepare to do some work to tune the producers/consumers of the replicator. • RTFM: http://docs.confluent.io/3.3.0/multi-dc/replicator-tuning.html • Replicator plays nice with containers and auto-scale. Give it a try. • Call your legal dept. You may be required to encrypt everything you replicate. • Watch different versions of this talk. We discuss more architectures and more ops concerns.