Using Chaos Engineering to Level up Apache Kafka Skills

1
ZipRecruiter, Inc. Proprietary and Confidential.
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved.
Kafka Chaos Engineering
Shlomi Hassan & Yaniv Ranen

2
It’s a lovely day, let’s
upgrade the
production cluster

3
Copyright © 2018 ZipRecruiter, Inc. All Rights Reserved. 3
We didn’t even get a
chance to say
goodbye

4
Where did we go wrong?

5
Chaos Engineering is the discipline of experimenting on a system in
order to build confidence in the system’s capability to withstand
turbulent conditions in production.
Chaos engineering

6
Hello!
Yaniv Ranen
Shlomi Hassan

7
ZipRecruiter’s mission
We actively connect
people to their next
great opportunity.

8
Logging infrastructure scheme

9
Logging Kafka cluster
Cluster spec
1) Kafka
a) 8 EC2 - m4.4xlarge
b) EBS - Io1
2) Zookeeper
a) 5 EC2 - m4.large
b) EBS - Io1
3) Cluster volume - 3TB/day
4) Broker data spread on
Multi AZ
KafkaZookeeper

10
Basic Kafka terminology

11
Basic Kafka terminology
P0
R1
P1
R1
broker1
P2
R1
P3
R1
P0
R2
P1
R2
broker2
P2
R2
P3
R2
broker3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry Replica y of partition x
Leader Replica is in bold
Active controller is in bold

12

13
Chaos engineering method
＋ Ask a question
＋ Write it down as a scenario in a
collaborative document
＋ Check the behaviour in a
controlled environment
＋ Document the exact commands
and output

14
Scenario #1
＋ What happens if we kill one broker?
＋ Will Kafka self-heal?

15
Scenario #1: Stopping a broker
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
Replica y of partition x
EBS
volumes
Cluster health
0Under replicated

16
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
54 Under replicated

17
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes

18
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes

19
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
P0
R2
P1
R2
Broker 2
P2
R2
P3
R2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
0Under replicatedReassign partitions

20
Demonstration

21
＋ Kafka is not self-healing
＋ Manual reassign partitions is needed

22
Scenario #2
＋ What can we do to revive an
offline partition?

23
Scenario #2: Reviving an offline partition
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
0offline partitions

24
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
4offline partitions

25
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
4offline partitions
Reassign partitions

26
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
4offline partitions
Reassign partitions

27
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
4offline partitions

28
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
4offline partitions

29
Broker 1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
4offline partitions

30
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
0offline partitions
New topic
Data loss

31
Scenario #2:
Reviving an offline partition
＋ Restarting broker revives the offline partition
＋ Partition reassignment doesn’t work when it
is offline
＋ Recreation of a broker
○ Data loss
○ Inconsistent state

32
Scenario #3
＋ How can we recover data
from an offline partition?

33
Scenario #3: Recovering offline partitions without data loss
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
Cluster health
0offline partitions
P0
R2
P1
R2
P2
R2
P3
R2

34
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
P0
R2
P1
R2
P2
R2
P3
R2
Cluster health
4offline partitions

35
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
P0
R2
P1
R2
P2
R2
P3
R2
Cluster health
4offline partitions

36
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
"Broker 1”
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
P0
R2
P1
R2
P2
R2
P3
R2
Cluster health
4offline partitions
P0
-
P1
-
P2
-
P3
-

37
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
“Broker 1”
ZooKeeper
Kafka cluster
Producer Consumer
Px
Ry
EBS
volumes
P0
R2
P1
R2
P2
R2
P3
R2
Cluster health
4under replicated
P0
R2
P1
R2
P2
R2
P3
R2

38
Scenario #3: How can we
recover from a lost AZ?
＋ Spawning a replacement node using the
same EBS volumes
＋ Maintain the same broker ID
＋ Consumer group offsets are kept

39
Our Conclusions
＋ Chaos engineering helps in gaining knowledge
＋ Kafka is not self-healing
＋ Offline partitions can be brought back using
EBS volumes
＋ Problems with the health check
＋ Using different versions of Kafka might introduce lag
＋ consider freezing the old version protocol
Inter.broker.protocol.version,log.message.format.version
＋ Upgrade active controller last

40
Deployment strategies
＋ In-place deployment might
prove risky

41
Blue green deployment
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers

42
Disconnecting producers and consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
New cluster
P0
R2
P1
R2
P2
R2
P3
R2

43
Draining the old cluster into the new one
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
New cluster
P0
R2
P1
R2
P2
R2
P3
R2
Replication
tool

44
Reconnecting the producers and consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
New cluster
P0
R2
P1
R2
P2
R2
P3
R2
Replication
tool
Producers
Consumers

45
The new cluster becomes the production one
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
New cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers

46
Replication = data + offsets metadata

47
Replication tools
Tools name Pros Cons
Mirror Maker Open source
Relatively easy to use
Not a real mirror
UReplicator Creates a real mirror
Scalable
Open source
Maintained mainly/only by Uber
Kafka connect - Replicator Supports smart replication
Creates a real mirror
Based on Kafka Connect ecosystem
Paid solution

48
Blue green as a failover solution
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Passive cluster
P0
R2
P1
R2
P2
R2
P3
R2
Replication
tool
Producers
Consumers

49
Blue green as a failover solution
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Old cluster
P0
R2
P1
R2
P2
R2
P3
R2
Producers
Consumers
P0
R1
P1
R1
Broker 1
P2
R1
P3
R1
Broker 2
Broker 3
ZooKeeper
Passive cluster
P0
R2
P1
R2
P2
R2
P3
R2
Replication
tool
Producers
Consumers

50
Kafka Pro Tips
＋ Be part of the community
○ Join confluent Slack team
○ follow /suggest new KIP (Kafka Improvement)
○ Contribute fixes to Kafka and it’s ecosystem
＋ Use smart metrics (like health check) for better visibility
＋ Try chaos engineering at home use our Github repo
＋ Don’t stay behind
○ Use updated Kafka consumers/producers
○ Update your cluster regularly
＋ Follow confluent white papers
＋ Kafka Health check repo

51
Q & A

52
Thank you
https://www.linkedin.com/in/yaniv-ranen-284b003/
https://www.linkedin.com/in/shlomihassan/

Using Chaos Engineering to Level up Apache Kafka Skills

More Related Content

What's hot (20)

Similar to Using Chaos Engineering to Level up Apache Kafka Skills (20)

More from confluent (20)

Recently uploaded (20)

Using Chaos Engineering to Level up Apache Kafka Skills