SlideShare a Scribd company logo
1
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Kafka Chaos Engineering
Shlomi Hassan & Yaniv Ranen
2
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
It’s a lovely day, let’s
upgrade the
production cluster
3
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3
We didn’t even get a
chance to say
goodbye
4
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Where did we go wrong?
5
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Chaos Engineering is the discipline of experimenting on a system in
order to build confidence in the system’s capability to withstand
turbulent conditions in production.
Chaos engineering
6
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Hello!
Yaniv Ranen
Shlomi Hassan
7
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 77
ZipRecruiter’s mission
We actively connect
people to their next
great opportunity.
8
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Logging infrastructure scheme
9
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Logging Kafka cluster
Cluster spec
1) Kafka
a) 8 EC2 - m4.4xlarge
b) EBS - Io1
2) Zookeeper
a) 5 EC2 - m4.large
b) EBS - Io1
3) Cluster volume - 3TB/day
4) Broker data spread on
Multi AZ
KafkaZookeeper
10
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Basic Kafka terminology
11
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Basic Kafka terminology
P0
R1
P1
R1
broker1
P2
R1
P3
R1
P0
R2
P1
R2
broker2
P2
R2
P3
R2
broker3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry Replica y of partition x
Leader Replica is in bold
Active controller is in bold
12
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
13
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Chaos engineering method
+ Ask a question
+ Write it down as a scenario in a
collaborative document
+ Check the behaviour in a
controlled environment
+ Document the exact commands
and output
14
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 1414
Scenario #1
+ What happens if we kill one broker?
+ Will Kafka self-heal?
15
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #1: Stopping a broker
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
0Under replicated
16
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #1: Stopping a broker
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
54 Under replicated
17
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #1: Stopping a broker
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
18
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #1: Stopping a broker
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
19
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #1: Stopping a broker
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
0Under replicatedReassign partitions
20
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 2020
Demonstration
21
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 2121
Scenario #1: Stopping a broker
+ Kafka is not self-healing
+ Manual reassign partitions is needed
22
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2
+ What can we do to revive an
offline partition?
23
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2: Reviving an offline partition
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
0offline partitions
24
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2: Reviving an offline partition
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
4offline partitions
25
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2: Reviving an offline partition
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
4offline partitions
Reassign partitions
26
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2: Reviving an offline partition
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
4offline partitions
Reassign partitions
27
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2: Reviving an offline partition
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
4offline partitions
28
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2: Reviving an offline partition
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
4offline partitions
29
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2: Reviving an offline partition
Broker 1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
4offline partitions
30
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #2: Reviving an offline partition
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
0offline partitions
New topic
Data loss
31
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3131
Scenario #2:
Reviving an offline partition
+ Restarting broker revives the offline partition
+ Partition reassignment doesn’t work when it
is offline
+ Recreation of a broker
○ Data loss
○ Inconsistent state
32
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #3
+ How can we recover data
from an offline partition?
33
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #3: Recovering offline partitions without data loss
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
0offline partitions
P0
R2
P1
R2
P2
R2
P3
R2
34
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #3: Recovering offline partitions without data loss
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
P0
R2
P1
R2
P2
R2
P3
R2
Cluster health
4offline partitions
35
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #3: Recovering offline partitions without data loss
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
P0
R2
P1
R2
P2
R2
P3
R2
Cluster health
4offline partitions
36
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #3: Recovering offline partitions without data loss
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
"Broker 1”
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
P0
R2
P1
R2
P2
R2
P3
R2
Cluster health
4offline partitions
P0
-
P1
-
P2
-
P3
-
37
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Scenario #3: Recovering offline partitions without data loss
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
“Broker 1”
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
P0
R2
P1
R2
P2
R2
P3
R2
Cluster health
4under replicated
P0
R2
P1
R2
P2
R2
P3
R2
38
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3838
Scenario #3: How can we
recover from a lost AZ?
+ Spawning a replacement node using the
same EBS volumes
+ Maintain the same broker ID
+ Consumer group offsets are kept
39
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3939
Our Conclusions
+ Chaos engineering helps in gaining knowledge
+ Kafka is not self-healing
+ Offline partitions can be brought back using
EBS volumes
+ Problems with the health check
+ Using different versions of Kafka might introduce lag
+ consider freezing the old version protocol
Inter.broker.protocol.version,log.message.format.version
+ Upgrade active controller last
40
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Deployment strategies
+ In-place deployment might
prove risky
41
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Blue green deployment
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
42
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Disconnecting producers and consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
New cluster
P0
R2
P1
R2
P2
R2
P3
R2
43
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Draining the old cluster into the new one
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
New cluster
P0
R2
P1
R2
P2
R2
P3
R2
Replication
tool
44
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Reconnecting the producers and consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
New cluster
P0
R2
P1
R2
P2
R2
P3
R2
Replication
tool
Producers
Consumers
45
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
The new cluster becomes the production one
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
New cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
46
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Replication = data + offsets metadata
47
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Replication tools
Tools name Pros Cons
Mirror Maker Open source
Relatively easy to use
Not a real mirror
UReplicator Creates a real mirror
Scalable
Open source
Maintained mainly/only by Uber
Kafka connect - Replicator Supports smart replication
Creates a real mirror
Based on Kafka Connect ecosystem
Paid solution
48
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Blue green as a failover solution
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Passive cluster
P0
R2
P1
R2
P2
R2
P3
R2
Replication
tool
Producers
Consumers
49
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Blue green as a failover solution
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Passive cluster
P0
R2
P1
R2
P2
R2
P3
R2
Replication
tool
Producers
Consumers
50
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Kafka Pro Tips
+ Be part of the community
○ Join confluent Slack team
○ follow /suggest new KIP (Kafka Improvement)
○ Contribute fixes to Kafka and it’s ecosystem
+ Use smart metrics (like health check) for better visibility
+ Try chaos engineering at home use our Github repo
+ Don’t stay behind
○ Use updated Kafka consumers/producers
○ Update your cluster regularly
+ Follow confluent white papers
+ Kafka Health check repo
51
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Q & A
52
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Thank you
https://www.linkedin.com/in/yaniv-ranen-284b003/
https://www.linkedin.com/in/shlomihassan/

More Related Content

PPTX
SIP: Call Id, Cseq, Via-branch, From & To-tag role play
PPTX
SQL for pattern matching (Oracle 12c)
PDF
Choosing Mikrotik Platform x86 vs chr
PDF
Building the Ultimate Device Matrix
PDF
Routing Information Protocol (RIP)
PPTX
Enable DPDK and SR-IOV for containerized virtual network functions with zun
PDF
BPF: Tracing and more
PDF
VXLAN and FRRouting
SIP: Call Id, Cseq, Via-branch, From & To-tag role play
SQL for pattern matching (Oracle 12c)
Choosing Mikrotik Platform x86 vs chr
Building the Ultimate Device Matrix
Routing Information Protocol (RIP)
Enable DPDK and SR-IOV for containerized virtual network functions with zun
BPF: Tracing and more
VXLAN and FRRouting

What's hot (20)

PDF
LSFMM 2019 BPF Observability
PDF
MUM Melbourne : Build Enterprise Wireless with CAPsMAN
PDF
From Mediasoup WebRTC to Livekit Self-Hosted .pdf
PDF
Kamailio, FreeSWITCH, and You
PPTX
NGINX High-performance Caching
DOCX
Packet Tracer: Nat protocol
PPTX
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
PDF
A whirlwind tour of the LLVM optimizer
PDF
OAuth and STUN, TURN in WebRTC context RFC7635
PDF
Cisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
DOCX
Packet Tracer: Load Balancing with GLBP and FHRP
PDF
Introduction to gRPC: A general RPC framework that puts mobile and HTTP/2 fir...
PPTX
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
PDF
Routing fundamentals with mikrotik
PDF
BGP Techniques for Network Operators
PDF
BGP on mikrotik
PDF
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
PDF
Linux BPF Superpowers
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
How to Handle Asynchronous Behaviors Using SVA
LSFMM 2019 BPF Observability
MUM Melbourne : Build Enterprise Wireless with CAPsMAN
From Mediasoup WebRTC to Livekit Self-Hosted .pdf
Kamailio, FreeSWITCH, and You
NGINX High-performance Caching
Packet Tracer: Nat protocol
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
A whirlwind tour of the LLVM optimizer
OAuth and STUN, TURN in WebRTC context RFC7635
Cisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
Packet Tracer: Load Balancing with GLBP and FHRP
Introduction to gRPC: A general RPC framework that puts mobile and HTTP/2 fir...
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
Routing fundamentals with mikrotik
BGP Techniques for Network Operators
BGP on mikrotik
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Linux BPF Superpowers
The TCP/IP Stack in the Linux Kernel
How to Handle Asynchronous Behaviors Using SVA
Ad

Similar to Using Chaos Engineering to Level up Apache Kafka Skills (20)

PDF
" Breaking Extreme Networks WingOS: How to own millions of devices running on...
PDF
Breaking Extreme Networks WingOS: How to own millions of devices running on A...
PDF
Integrating Vert.x
PDF
[CB19] MalConfScan with Cuckoo: Automatic Malware Configuration Extraction Sy...
PPTX
How to Get Going with Kubernetes
PDF
Big Data LDN 2018: PROGRESS FOR BIG DATA IN KUBERNETES
PPTX
Dataviz in a collaborative mixed reality space
PDF
HITCON FreeTalk 20240726 - Dark side of the Force - 探索暗網威脅【 議題三:藍隊的暗網事件應變守則】
PDF
Trouble with memory
PDF
Webinar: Why evasive zero day attacks are killing traditional sandboxing
PDF
C-SEC|2016 Session 2 The Security Game : You Failed at the Beginning By Incog...
PDF
Rooted2020 the day i_ruled_the_world_deceiving_software_developers_through_op...
PDF
The day I ruled the world (RootedCON 2020)
PDF
Visão Geral de Inteligência Artificial
PPTX
Conférence - Arbor Edge Defense, Première et dernière ligne de défense intell...
PDF
Black Clouds and Silver Linings in Node.js Security - Liran Tal Snyk OWASP Gl...
PPTX
Advanced Threat Hunting - Botconf 2017
PPTX
Advanced Threat Hunting - BotConf 2017
PPTX
My Baby Done Bad Crypto - My Sweet Baby Done Me Wrong
PDF
Apache Pulsar at Yahoo! Japan
" Breaking Extreme Networks WingOS: How to own millions of devices running on...
Breaking Extreme Networks WingOS: How to own millions of devices running on A...
Integrating Vert.x
[CB19] MalConfScan with Cuckoo: Automatic Malware Configuration Extraction Sy...
How to Get Going with Kubernetes
Big Data LDN 2018: PROGRESS FOR BIG DATA IN KUBERNETES
Dataviz in a collaborative mixed reality space
HITCON FreeTalk 20240726 - Dark side of the Force - 探索暗網威脅【 議題三:藍隊的暗網事件應變守則】
Trouble with memory
Webinar: Why evasive zero day attacks are killing traditional sandboxing
C-SEC|2016 Session 2 The Security Game : You Failed at the Beginning By Incog...
Rooted2020 the day i_ruled_the_world_deceiving_software_developers_through_op...
The day I ruled the world (RootedCON 2020)
Visão Geral de Inteligência Artificial
Conférence - Arbor Edge Defense, Première et dernière ligne de défense intell...
Black Clouds and Silver Linings in Node.js Security - Liran Tal Snyk OWASP Gl...
Advanced Threat Hunting - Botconf 2017
Advanced Threat Hunting - BotConf 2017
My Baby Done Bad Crypto - My Sweet Baby Done Me Wrong
Apache Pulsar at Yahoo! Japan
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
A Presentation on Artificial Intelligence
PDF
Empathic Computing: Creating Shared Understanding
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PDF
Modernizing your data center with Dell and AMD
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
A Presentation on Artificial Intelligence
Empathic Computing: Creating Shared Understanding
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Modernizing your data center with Dell and AMD
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx

Using Chaos Engineering to Level up Apache Kafka Skills

  • 1. 1 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Kafka Chaos Engineering Shlomi Hassan & Yaniv Ranen
  • 2. 2 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. It’s a lovely day, let’s upgrade the production cluster
  • 3. 3 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3 We didn’t even get a chance to say goodbye
  • 4. 4 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Where did we go wrong?
  • 5. 5 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering
  • 6. 6 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Hello! Yaniv Ranen Shlomi Hassan
  • 7. 7 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 77 ZipRecruiter’s mission We actively connect people to their next great opportunity.
  • 8. 8 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Logging infrastructure scheme
  • 9. 9 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Logging Kafka cluster Cluster spec 1) Kafka a) 8 EC2 - m4.4xlarge b) EBS - Io1 2) Zookeeper a) 5 EC2 - m4.large b) EBS - Io1 3) Cluster volume - 3TB/day 4) Broker data spread on Multi AZ KafkaZookeeper
  • 10. 10 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Basic Kafka terminology
  • 11. 11 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Basic Kafka terminology P0 R1 P1 R1 broker1 P2 R1 P3 R1 P0 R2 P1 R2 broker2 P2 R2 P3 R2 broker3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x Leader Replica is in bold Active controller is in bold
  • 12. 12 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
  • 13. 13 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Chaos engineering method + Ask a question + Write it down as a scenario in a collaborative document + Check the behaviour in a controlled environment + Document the exact commands and output
  • 14. 14 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 1414 Scenario #1 + What happens if we kill one broker? + Will Kafka self-heal?
  • 15. 15 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #1: Stopping a broker P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 P0 R2 P1 R2 Broker 2 P2 R2 P3 R2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 0Under replicated
  • 16. 16 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #1: Stopping a broker P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 P0 R2 P1 R2 Broker 2 P2 R2 P3 R2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 54 Under replicated
  • 17. 17 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #1: Stopping a broker P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 P0 R2 P1 R2 Broker 2 P2 R2 P3 R2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes
  • 18. 18 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #1: Stopping a broker P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 P0 R2 P1 R2 Broker 2 P2 R2 P3 R2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes
  • 19. 19 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #1: Stopping a broker P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 P0 R2 P1 R2 Broker 2 P2 R2 P3 R2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 0Under replicatedReassign partitions
  • 20. 20 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 2020 Demonstration
  • 21. 21 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 2121 Scenario #1: Stopping a broker + Kafka is not self-healing + Manual reassign partitions is needed
  • 22. 22 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2 + What can we do to revive an offline partition?
  • 23. 23 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2: Reviving an offline partition P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 0offline partitions
  • 24. 24 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2: Reviving an offline partition P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 4offline partitions
  • 25. 25 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2: Reviving an offline partition P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 4offline partitions Reassign partitions
  • 26. 26 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2: Reviving an offline partition P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 4offline partitions Reassign partitions
  • 27. 27 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2: Reviving an offline partition P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 4offline partitions
  • 28. 28 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2: Reviving an offline partition Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 4offline partitions
  • 29. 29 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2: Reviving an offline partition Broker 1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 4offline partitions
  • 30. 30 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #2: Reviving an offline partition P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 0offline partitions New topic Data loss
  • 31. 31 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3131 Scenario #2: Reviving an offline partition + Restarting broker revives the offline partition + Partition reassignment doesn’t work when it is offline + Recreation of a broker ○ Data loss ○ Inconsistent state
  • 32. 32 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #3 + How can we recover data from an offline partition?
  • 33. 33 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #3: Recovering offline partitions without data loss P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes Cluster health 0offline partitions P0 R2 P1 R2 P2 R2 P3 R2
  • 34. 34 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #3: Recovering offline partitions without data loss P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes P0 R2 P1 R2 P2 R2 P3 R2 Cluster health 4offline partitions
  • 35. 35 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #3: Recovering offline partitions without data loss P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes P0 R2 P1 R2 P2 R2 P3 R2 Cluster health 4offline partitions
  • 36. 36 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #3: Recovering offline partitions without data loss P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 "Broker 1” ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes P0 R2 P1 R2 P2 R2 P3 R2 Cluster health 4offline partitions P0 - P1 - P2 - P3 -
  • 37. 37 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Scenario #3: Recovering offline partitions without data loss P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 “Broker 1” ZooKeeper Kafka cluster Producer Consumer Px Ry Replica y of partition x EBS volumes P0 R2 P1 R2 P2 R2 P3 R2 Cluster health 4under replicated P0 R2 P1 R2 P2 R2 P3 R2
  • 38. 38 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3838 Scenario #3: How can we recover from a lost AZ? + Spawning a replacement node using the same EBS volumes + Maintain the same broker ID + Consumer group offsets are kept
  • 39. 39 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3939 Our Conclusions + Chaos engineering helps in gaining knowledge + Kafka is not self-healing + Offline partitions can be brought back using EBS volumes + Problems with the health check + Using different versions of Kafka might introduce lag + consider freezing the old version protocol Inter.broker.protocol.version,log.message.format.version + Upgrade active controller last
  • 40. 40 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Deployment strategies + In-place deployment might prove risky
  • 41. 41 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Blue green deployment P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Old cluster P0 R2 P1 R2 P2 R2 P3 R2 Producers Consumers
  • 42. 42 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Disconnecting producers and consumers P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Old cluster P0 R2 P1 R2 P2 R2 P3 R2 Producers Consumers P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper New cluster P0 R2 P1 R2 P2 R2 P3 R2
  • 43. 43 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Draining the old cluster into the new one P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Old cluster P0 R2 P1 R2 P2 R2 P3 R2 Producers Consumers P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper New cluster P0 R2 P1 R2 P2 R2 P3 R2 Replication tool
  • 44. 44 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Reconnecting the producers and consumers P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Old cluster P0 R2 P1 R2 P2 R2 P3 R2 Producers Consumers P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper New cluster P0 R2 P1 R2 P2 R2 P3 R2 Replication tool Producers Consumers
  • 45. 45 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. The new cluster becomes the production one P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper New cluster P0 R2 P1 R2 P2 R2 P3 R2 Producers Consumers
  • 46. 46 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Replication = data + offsets metadata
  • 47. 47 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Replication tools Tools name Pros Cons Mirror Maker Open source Relatively easy to use Not a real mirror UReplicator Creates a real mirror Scalable Open source Maintained mainly/only by Uber Kafka connect - Replicator Supports smart replication Creates a real mirror Based on Kafka Connect ecosystem Paid solution
  • 48. 48 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Blue green as a failover solution P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Old cluster P0 R2 P1 R2 P2 R2 P3 R2 Producers Consumers P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Passive cluster P0 R2 P1 R2 P2 R2 P3 R2 Replication tool Producers Consumers
  • 49. 49 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Blue green as a failover solution P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Old cluster P0 R2 P1 R2 P2 R2 P3 R2 Producers Consumers P0 R1 P1 R1 Broker 1 P2 R1 P3 R1 Broker 2 Broker 3 ZooKeeper Passive cluster P0 R2 P1 R2 P2 R2 P3 R2 Replication tool Producers Consumers
  • 50. 50 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Kafka Pro Tips + Be part of the community ○ Join confluent Slack team ○ follow /suggest new KIP (Kafka Improvement) ○ Contribute fixes to Kafka and it’s ecosystem + Use smart metrics (like health check) for better visibility + Try chaos engineering at home use our Github repo + Don’t stay behind ○ Use updated Kafka consumers/producers ○ Update your cluster regularly + Follow confluent white papers + Kafka Health check repo
  • 51. 51 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Q & A
  • 52. 52 ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. ZipRecruiter, Inc. Proprietary and Confidential. Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. Thank you https://www.linkedin.com/in/yaniv-ranen-284b003/ https://www.linkedin.com/in/shlomihassan/