Kafka at half the price with JBOD setup

©2017 LinkedIn Corporation. All Rights Reserved.
Kafka at half the price
Dong Lin
Streams Infrastructure

©2017 LinkedIn Corporation. All Rights Reserved. 2
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference

RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C

producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
- Tolerate only one broker failure

producer
Broker 1 Broker 3
A
B
C
A
B
C
A
B
C
A
B
C
Broker 2
A
B
C
A
B
C
- Tolerate up to two broker failures
- 50% more storage cost

JBOD setup with RF=2
producer
Broker 1
A
B
C
Broker 2
A
B
C
- Tolerate only one broker failure
- 50% less storage cost

JBOD setup with RF=3
producer
Broker 1
A
B
C
Broker 3
A
B
C
Broker 2
A
B
C
- Tolerate up to two broker failures
- 25% less storage cost

RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5

RAID vs. JBOD
Broker failure
tolerance
Disk failure
tolerance
RAID-10
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)

RAID vs. JBOD
Broker failure
tolerance
Disk failure
tolerance
RAID-10
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
4 4X 3 (300% up) 3

Agenda
▪ Motivation
▪ Design
▪ Alternatives
▪ Evaluation
▪ Future work
▪ Reference

Problem 1: All replicas become offline if any log directory fails
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C

Solution: Only replicas on the failed disk become offline
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C

Problem 2: Controller does not recognize disk failure
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
X No further
leader election
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline

Solution: Broker notifies and provides partition list to controller
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure
X
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
STEP 5: Elect
another broker as
leader for partition 2

Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
X
STEP 2:
STEP 1: I am online

Zookeeper
Controller
STEP 3:
STEP 4:
Created partition 2
(problematic)
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
STEP 1: I am online

Zookeeper
Controller
STEP 3:
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
Good disk may
become overloaded STEP 1: I am online
STEP 4:
Created partition 2
(problematic)

Solution: Controller specifies whether to create log for partition
Zookeeper
Controller
STEP 3:
This is NOT a new partition
STEP 4:
Partition 2 is not available
and there is offline log dir
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is new partition?
STEP 5:
Exclude broker 1 from
leader election
for partition 2
STEP 1: I am online

Problem 4: No mechanism to move replicas between disks
Broker 1
P1 P2 P3
P5P4 P6
P7
Disk 1 Disk 2

Example workflow to move replicas between disks
Broker
Client
STEP 1: DescribeDirRequest
STEP 2: DescribeDirResponse
Partition list and size
STEP 3: ChangeDirRequest
Disk 1 Disk 2
STEP 4: create p1.move
STEP 5: ChangeDirResponse
(Inprogress)
STEP 6: copy data from
p1.log to p1.move
STEP 7: delete p1.log and
rename p1.move to p1.log
STEP 8: Verify new assignment
via DescribeDirRequest

Agenda
▪ Motivation
▪ Design
▪ Alternatives
▪ Evaluation
▪ Future work
▪ Reference

Alternatives
▪ RAID-0 doesn’t provide disk fault tolerance
– Assume each broker has 10 disks and RF = 2
– RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD
▪ RAID-5 and RAID-6 have poor performance
▪ Hardware RAID is expensive
▪ One broker per disk

one-broker-per-machine vs. one-broker-per-disk
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1 Broker 2 Broker 3
V.S.
One-broker-per-machine One-broker-per-disk

one-broker-per-machine vs. one-broker-per-disk
▪ Both solutions use JBOD as disk configuration
▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine)
– 100X threads and 100X sockets per machine
– 10X control plane traffic from the controller to brokers (e.g. MetadataRequest)
– 10X broker instances and configuration files to manage
– 10X time to bounce a cluster if we bounce one broker at a time
– 10X load on external service (e.g. a service used to query per-topic ACL)
– Less efficient quota enforcement
– Less efficient rebalance across disks on the same machine
– Lower throughput

Experimental setup
▪ Brokers deployed on 15 machines with 10 disks per machine
IO threads Network threads Replica-fetcher threads
One-broker-per-machine 160 120 140
One-broker-per-disk 16 12 14
▪ Producers deployed on 15 machines
acks threads sync retries retry backoff message size batch size request timeout
all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT
▪ Topic configuration
partition replication factor min-insync-replicas
512 3 3

One-broker-per-machine throughput
Average throughput is 2.3 GBps

One-broker-per-disk throughput
Average throughput is 2 GBps

Agenda
▪ Motivation
▪ Design
▪ Alternatives
▪ Evaluation
▪ Future work
▪ Reference

Changes in operational procedure
▪ Adjust replication factor and min.insync.replicas
▪ Configure num.replica.move.threads for broker
▪ Monitor disk failure via the OfflineLogDirectoriesCount metric

Future work
▪ Use more intelligent solution to select log directory for new replica
▪ Automatic load balancing across log directories on the same broker
– Reduced operational overhead
▪ Distribute segments of a given replica across multiple log directories
– Less overhead for rebalance between disks
– Higher partition size limit
▪ Handle partial disk failure, e.g. disk with degraded performance.

References
▪ KIP-112: Handle disk failure for JBOD (link)
▪ KIP-113: Support replicas movement between log directories (link)

Kafka at half the price with JBOD setup

More Related Content

What's hot (20)

Similar to Kafka at half the price with JBOD setup (20)

More from Dong Lin (6)

Recently uploaded (20)

Kafka at half the price with JBOD setup