SlideShare a Scribd company logo
©2015 LinkedIn Corporation. All Rights Reserved.
Aditya Auradkar & Dong Lin
©2015 LinkedIn Corporation. All Rights Reserved.
Motivation: Why is this important?
● Shared resources in a multi-tenant environment
● Bad clients can hurt others
– Bootstrapping consumers
– Buggy clients
● Better QOS for well-behaved clients
● Preserve throughout and latency for everyone else
● API Limits/Billing
©2015 LinkedIn Corporation. All Rights Reserved.
Clients and Client-Ids
● Quotas are enforced per client-id
● Why client-id?
● No quotas per topic
● No quotas per topic * client-id combination
● Blanket produce and fetch quota for all clients
©2015 LinkedIn Corporation. All Rights Reserved.
Quota Overrides
● Certain clients justify higher quotas
● Rolling bounces take too long and require too much effort
● Store overrides in ZooKeeper
● Brokers parse config change notifications
● Apply new quota immediately
©2015 LinkedIn Corporation. All Rights Reserved.
Quota Overrides
{ "version":1,
"config": {
"producer_byte_rate":"1048576",
"consumer_byte_rate":"1048576”
}
}
©2015 LinkedIn Corporation. All Rights Reserved.
Broker Metrics
● Metrics created for each client
● Clients can come and go
● Don’t need to retain client metrics forever
● GC metrics if inactive for longer than 1 hr
● Recreate if client reconnects
©2015 LinkedIn Corporation. All Rights Reserved.
Enforcement
● Reduce client throughput to desired rate
● Compute delay based on current throughput
● Small violations result in small delays
● Use smaller measurement windows to avoid long pauses
● Client side metrics available to detect throttling
©2015 LinkedIn Corporation. All Rights Reserved.
Delay Calculation
● Delay = W * (μ - Q) / μ
● W = window size, μ = observed rate, Q = desired rate
©2015 LinkedIn Corporation. All Rights Reserved.
replica
manager log
quota
manager
Enforcement
producer
r
e
q
u
e
s
t
c
h
a
n
n
e
l
1. request
7. response
3. append
4. record metric
5. delay
delay queue
6. dequeue
delay queue
2. process
©2015 LinkedIn Corporation. All Rights Reserved.
replica
manager log
quota
manager
Enforcement
r
e
q
u
e
s
t
c
h
a
n
n
e
l
1. request
7. Response
(zero copy)
3. fetch offsets
4. record metric
delay queue
6. dequeue
delay queue
2. process
5. delay
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Slowdown vs Error
● Error handling is hard
● Tricky to implement backoff and retries
● All client implementations need to handle quota errors
● Need something easier
©2015 LinkedIn Corporation. All Rights Reserved.
Getting Started
● Important Broker configs
– quota.producer.default (in bytes/sec)
– quota.consumer.default (in bytes/sec)
● Apply overrides
./bin/kafka-configs.sh --alter
--add-config 'producer_byte_rate=1048576,consumer_byte_rate=1048576’
--entity-type clients
--entity-name TestTopic
--zookeeper localhost:2181
● Read overrides
./bin/kafka-configs.sh --describe
--entity-type clients
--entity-name TestTopic
--zookeeper localhost:2181
©2015 LinkedIn Corporation. All Rights Reserved.
Monitoring
● Producer metrics
– throttle-time avg and max
● Consumer metrics
– throttle-time avg and max
● Broker metrics
– byte-rate and avg throttle-time per client-id
– byte-rate is used for enforcement
● ZookeeperConsumerConnector and SimpleConsumer metrics also
available
©2015 LinkedIn Corporation. All Rights Reserved.
Rollout Strategy
● Deploy without enforcement
● Monitor metrics to track throughput for all clients
● Identify candidates for overrides
● Start with high thresholds
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation
● Validate quota functionality
- broker-throughput <= sum(quota_of_clientid)
- sum(client-throughput) <= quota_of_clientId
● Evaluate performance improvement for clients.
- Throughput and latency
- Clients with different throughput demand
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● Unlimited quota
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● quota.producer.default = quota.consumer.default = 50 MBps
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● quota.producer.default = quota.consumer.default = 10 MBps
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
clients join
in presence of quota
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
clients join
in presence of quota
comparison
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
0 100 200 300 400 500 600
Time (sec)
0
5
10
15
20
25
30
35
Latency(ms)
alone
together
quota
Latency (ms)
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
0 100 200 300 400 500 600
Time (sec)
0
5
10
15
20
25
30
35
Latency(ms)
alone
together
quota
Alone
Latency (ms) 1.5
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
● Producer runs with other producers without quota (together)
0 100 200 300 400 500 600
Time (sec)
0
5
10
15
20
25
30
35
Latency(ms)
alone
together
quota
Alone Together
Latency (ms) 1.5 23.6
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
● Producer runs with other producers without quota (together)
● Producer runs with other producers with 10 MBps quota (quota)
0 100 200 300 400 500 600
Time (sec)
0
5
10
15
20
25
30
35
Latency(ms)
alone
together
quota
Alone Together Quota
Latency (ms) 1.5 23.6 2.5
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
0 100 200 300 400 500 600
Time (sec)
20
30
40
50
60
70
80
90
Througput(MBps)
alone
together
quota
alone_quota
Throughput
(MBps)
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
0 100 200 300 400 500 600
Time (sec)
20
30
40
50
60
70
80
90
Througput(MBps)
alone
together
quota
alone_quota
alone
alone-
quota
Throughput
(MBps)
87 45
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
● Consumer runs with other consumers without quota (together)
0 100 200 300 400 500 600
Time (sec)
20
30
40
50
60
70
80
90
Througput(MBps)
alone
together
quota
alone_quota
alone
alone-
quota
together
Throughput
(MBps)
87 45 31
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
● Consumer runs with other consumers without quota (together)
● Consumer runs with other consumers with 50 MBps quota (quota)
0 100 200 300 400 500 600
Time (sec)
20
30
40
50
60
70
80
90
Througput(MBps)
alone
together
quota
alone_quota
alone
alone-
quota
together quota
Throughput
(MBps)
87 45 31 40
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation - Summary
● Quota functionality is enforced
● Performance improvement for clients from quota in the event that large
clients join
©2015 LinkedIn Corporation. All Rights Reserved.
Future Work
● Throttle replica traffic (e.g. during bootstrap)
● Throttle more request types (OffsetCommitRequest etc.)
● Client-id authentication for use in multi-tenancy environment
©2015 LinkedIn Corporation. All Rights Reserved.
Acknowledgements
● LinkedIn Kafka Engineering team
● Confluent Inc
● John McClean (formerly at LI)

More Related Content

PPTX
Kafka 101
PPTX
Kafka at Peak Performance
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
Producer Performance Tuning for Apache Kafka
PPTX
Apache Kafka Best Practices
PDF
Fundamentals of Apache Kafka
PDF
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
PDF
Kafka 101
Kafka at Peak Performance
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Producer Performance Tuning for Apache Kafka
Apache Kafka Best Practices
Fundamentals of Apache Kafka
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...

What's hot (20)

PDF
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
ODP
Introduction to Kafka connect
PPTX
Apache kafka
PDF
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
ODP
Stream processing using Kafka
PPTX
Apache kafka
PPTX
Introduction to Apache Kafka
PPTX
Introduction to Apache Kafka
PDF
Consumer offset management in Kafka
PDF
Optimizing MariaDB for maximum performance
PDF
RocksDB Performance and Reliability Practices
PDF
Data Pipelines with Apache Kafka
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PPTX
No data loss pipeline with apache kafka
PDF
Zabbix Performance Tuning
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Deep Dive into Apache Kafka
PDF
Lessons from managing a Pulsar cluster (Nutanix)
PDF
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Introduction to Kafka connect
Apache kafka
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
APACHE KAFKA / Kafka Connect / Kafka Streams
Stream processing using Kafka
Apache kafka
Introduction to Apache Kafka
Introduction to Apache Kafka
Consumer offset management in Kafka
Optimizing MariaDB for maximum performance
RocksDB Performance and Reliability Practices
Data Pipelines with Apache Kafka
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
No data loss pipeline with apache kafka
Zabbix Performance Tuning
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Deep Dive into Apache Kafka
Lessons from managing a Pulsar cluster (Nutanix)
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Ad

Similar to Kafka Quotas Talk at LinkedIn (20)

PPTX
Kafka 0.9, Things you should know
PDF
DSR Microservices (Day 1, Part 2)
PPTX
Apache kafka
PPT
PPTX
Enterprise Kafka: Kafka as a Service
PPTX
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
PPTX
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
PDF
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
PDF
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
PDF
Do More With Message Queue
PPTX
Putting Kafka Into Overdrive
PDF
Rate limits and Performance
PPTX
Apache kafka
PPTX
Apache kafka
PPTX
Apache kafka
PPTX
Apache kafka
PPTX
Apache kafka
PPTX
Apache kafka
PPTX
Apache kafka
PDF
Resilient Design 101 (JeeConf 2017)
Kafka 0.9, Things you should know
DSR Microservices (Day 1, Part 2)
Apache kafka
Enterprise Kafka: Kafka as a Service
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Do More With Message Queue
Putting Kafka Into Overdrive
Rate limits and Performance
Apache kafka
Apache kafka
Apache kafka
Apache kafka
Apache kafka
Apache kafka
Apache kafka
Resilient Design 101 (JeeConf 2017)
Ad

Recently uploaded (20)

DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
Geodesy 1.pptx...............................................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Operating System & Kernel Study Guide-1 - converted.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Model Code of Practice - Construction Work - 21102022 .pdf
CYBER-CRIMES AND SECURITY A guide to understanding
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Lecture Notes Electrical Wiring System Components
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
additive manufacturing of ss316l using mig welding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
573137875-Attendance-Management-System-original
Geodesy 1.pptx...............................................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

Kafka Quotas Talk at LinkedIn

  • 1. ©2015 LinkedIn Corporation. All Rights Reserved. Aditya Auradkar & Dong Lin
  • 2. ©2015 LinkedIn Corporation. All Rights Reserved. Motivation: Why is this important? ● Shared resources in a multi-tenant environment ● Bad clients can hurt others – Bootstrapping consumers – Buggy clients ● Better QOS for well-behaved clients ● Preserve throughout and latency for everyone else ● API Limits/Billing
  • 3. ©2015 LinkedIn Corporation. All Rights Reserved. Clients and Client-Ids ● Quotas are enforced per client-id ● Why client-id? ● No quotas per topic ● No quotas per topic * client-id combination ● Blanket produce and fetch quota for all clients
  • 4. ©2015 LinkedIn Corporation. All Rights Reserved. Quota Overrides ● Certain clients justify higher quotas ● Rolling bounces take too long and require too much effort ● Store overrides in ZooKeeper ● Brokers parse config change notifications ● Apply new quota immediately
  • 5. ©2015 LinkedIn Corporation. All Rights Reserved. Quota Overrides { "version":1, "config": { "producer_byte_rate":"1048576", "consumer_byte_rate":"1048576” } }
  • 6. ©2015 LinkedIn Corporation. All Rights Reserved. Broker Metrics ● Metrics created for each client ● Clients can come and go ● Don’t need to retain client metrics forever ● GC metrics if inactive for longer than 1 hr ● Recreate if client reconnects
  • 7. ©2015 LinkedIn Corporation. All Rights Reserved. Enforcement ● Reduce client throughput to desired rate ● Compute delay based on current throughput ● Small violations result in small delays ● Use smaller measurement windows to avoid long pauses ● Client side metrics available to detect throttling
  • 8. ©2015 LinkedIn Corporation. All Rights Reserved. Delay Calculation ● Delay = W * (μ - Q) / μ ● W = window size, μ = observed rate, Q = desired rate
  • 9. ©2015 LinkedIn Corporation. All Rights Reserved. replica manager log quota manager Enforcement producer r e q u e s t c h a n n e l 1. request 7. response 3. append 4. record metric 5. delay delay queue 6. dequeue delay queue 2. process
  • 10. ©2015 LinkedIn Corporation. All Rights Reserved. replica manager log quota manager Enforcement r e q u e s t c h a n n e l 1. request 7. Response (zero copy) 3. fetch offsets 4. record metric delay queue 6. dequeue delay queue 2. process 5. delay consumer
  • 11. ©2015 LinkedIn Corporation. All Rights Reserved. Slowdown vs Error ● Error handling is hard ● Tricky to implement backoff and retries ● All client implementations need to handle quota errors ● Need something easier
  • 12. ©2015 LinkedIn Corporation. All Rights Reserved. Getting Started ● Important Broker configs – quota.producer.default (in bytes/sec) – quota.consumer.default (in bytes/sec) ● Apply overrides ./bin/kafka-configs.sh --alter --add-config 'producer_byte_rate=1048576,consumer_byte_rate=1048576’ --entity-type clients --entity-name TestTopic --zookeeper localhost:2181 ● Read overrides ./bin/kafka-configs.sh --describe --entity-type clients --entity-name TestTopic --zookeeper localhost:2181
  • 13. ©2015 LinkedIn Corporation. All Rights Reserved. Monitoring ● Producer metrics – throttle-time avg and max ● Consumer metrics – throttle-time avg and max ● Broker metrics – byte-rate and avg throttle-time per client-id – byte-rate is used for enforcement ● ZookeeperConsumerConnector and SimpleConsumer metrics also available
  • 14. ©2015 LinkedIn Corporation. All Rights Reserved. Rollout Strategy ● Deploy without enforcement ● Monitor metrics to track throughput for all clients ● Identify candidates for overrides ● Start with high thresholds
  • 15. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation ● Validate quota functionality - broker-throughput <= sum(quota_of_clientid) - sum(client-throughput) <= quota_of_clientId ● Evaluate performance improvement for clients. - Throughput and latency - Clients with different throughput demand
  • 16. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Validate Quota Functionality ● Unlimited quota producer consumer
  • 17. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Validate Quota Functionality ● quota.producer.default = quota.consumer.default = 50 MBps producer consumer
  • 18. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Validate Quota Functionality ● quota.producer.default = quota.consumer.default = 10 MBps producer consumer
  • 19. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Client Performance Improvement small client running alone
  • 20. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Client Performance Improvement small client running alone clients join together
  • 21. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Client Performance Improvement small client running alone clients join together clients join in presence of quota
  • 22. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Client Performance Improvement small client running alone clients join together clients join in presence of quota comparison
  • 23. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Producer Performance Improvement 0 100 200 300 400 500 600 Time (sec) 0 5 10 15 20 25 30 35 Latency(ms) alone together quota Latency (ms)
  • 24. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Producer Performance Improvement ● Producer runs at 2 MBps alone (alone) 0 100 200 300 400 500 600 Time (sec) 0 5 10 15 20 25 30 35 Latency(ms) alone together quota Alone Latency (ms) 1.5
  • 25. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Producer Performance Improvement ● Producer runs at 2 MBps alone (alone) ● Producer runs with other producers without quota (together) 0 100 200 300 400 500 600 Time (sec) 0 5 10 15 20 25 30 35 Latency(ms) alone together quota Alone Together Latency (ms) 1.5 23.6
  • 26. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Producer Performance Improvement ● Producer runs at 2 MBps alone (alone) ● Producer runs with other producers without quota (together) ● Producer runs with other producers with 10 MBps quota (quota) 0 100 200 300 400 500 600 Time (sec) 0 5 10 15 20 25 30 35 Latency(ms) alone together quota Alone Together Quota Latency (ms) 1.5 23.6 2.5
  • 27. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Consumer Performance Improvement 0 100 200 300 400 500 600 Time (sec) 20 30 40 50 60 70 80 90 Througput(MBps) alone together quota alone_quota Throughput (MBps)
  • 28. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Consumer Performance Improvement ● Consumer runs alone (alone) ● Consumer runs alone with 50 MBps quota (alone-quota) 0 100 200 300 400 500 600 Time (sec) 20 30 40 50 60 70 80 90 Througput(MBps) alone together quota alone_quota alone alone- quota Throughput (MBps) 87 45
  • 29. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Consumer Performance Improvement ● Consumer runs alone (alone) ● Consumer runs alone with 50 MBps quota (alone-quota) ● Consumer runs with other consumers without quota (together) 0 100 200 300 400 500 600 Time (sec) 20 30 40 50 60 70 80 90 Througput(MBps) alone together quota alone_quota alone alone- quota together Throughput (MBps) 87 45 31
  • 30. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation – Consumer Performance Improvement ● Consumer runs alone (alone) ● Consumer runs alone with 50 MBps quota (alone-quota) ● Consumer runs with other consumers without quota (together) ● Consumer runs with other consumers with 50 MBps quota (quota) 0 100 200 300 400 500 600 Time (sec) 20 30 40 50 60 70 80 90 Througput(MBps) alone together quota alone_quota alone alone- quota together quota Throughput (MBps) 87 45 31 40
  • 31. ©2015 LinkedIn Corporation. All Rights Reserved. Evaluation - Summary ● Quota functionality is enforced ● Performance improvement for clients from quota in the event that large clients join
  • 32. ©2015 LinkedIn Corporation. All Rights Reserved. Future Work ● Throttle replica traffic (e.g. during bootstrap) ● Throttle more request types (OffsetCommitRequest etc.) ● Client-id authentication for use in multi-tenancy environment
  • 33. ©2015 LinkedIn Corporation. All Rights Reserved. Acknowledgements ● LinkedIn Kafka Engineering team ● Confluent Inc ● John McClean (formerly at LI)

Editor's Notes

  • #2: Good eve. Welcome to LI Introduce. Work on kafka engineering team at LI Here to talk about a brand new feature in 0.9. Quotas Ability to define throughput thresholds for a client
  • #3: When run as a service, all resources are shared. CPU, disk, network etc.. Single bad client can degrade the experience for others (buggy clients) In some cases, the client isn’t even bad i.e. bootstrapping consumers. Need a way to offer better QOS for well-behaved clients
  • #4: What is the quantity we throttle? Client-id Client-id logically identifies an application. Hence we choose Topics inherently don’t have a notion of ownership. Significant number of topics are public data. It’s hard to add quotas per topic, because everyone using that topic will get throttled. Not desirable. For e.g. one well behaved consumer should be throttled because of a different bootstrapping consumer. A well behaved producer instance shouldn’t get throttled because of a buggy client Quotas per topic * client-id combination are also tricky to get right. For example: a wildcard consumer will receive infinite quota. A producer producing to 1000 topics also can bypass the quota system Have a reasonable threshold for everyone.
  • #5: Many clients can justify larger quotas. Default doesn’t work for everybody Quota changes can happen frequently. SRE would hate having to bounce clusters to change quotas for custom clients Similar to topic configs, we store overrides in ZK
  • #7: In order to track quota, we have metrics for each client that has connected. This number can be significant Shortlived clients: console consumers, console producers etc.
  • #8: As mentioned, we have metrics to track per-client byte-rate. The goal is to reduce client throughput to the desired rate. Delay is computed based on the current throughput. Basically, if throughput violation is low, small delays are added to the responses. We use small measurement windows to detect violations early. For e.g., if we had a 5 minute window, we would have a long pause towards the end. This is configurable Metrics available client side. No error returned
  • #9: * After the delay, the measuring window should have throughput equal to Q.
  • #10: Request sent to request channel Sent to replica manager. Appended to log Number of bytes appended, metric updated Compute delay, insert into a queue Reaper thread, will send response async
  • #12: Number of nuances to client side error. Cannot trust client implementations to do the right thing Why not just send errors to clients? Dozens of client implementations need to bulld backoffs and retries on error Something that just works
  • #13: Lets talk specifics Tooling available to change quotas per client id
  • #14: * Consistent with other metrics available on each of these clients
  • #15: Observe traffic patterns Monitor metrics to track throughput for all clients. This lets you pick a reasonable threshold. Start high. Don’t want to configure too low a quota and most people end up getting throttled on a stable cluster