SlideShare a Scribd company logo
Kafka Cost
Reduction
A Practical Guide for
Let’s Make Your CFO Happy;
The Elephant
In The Room
What are we paying for?
Running a self-hosted Kafka deployment
Where and how can we cut costs?
Tips, tricks, and KIPs
Develop an economic mindset
It’s part of our role
Elad Leev
Data Engineer at Riskified
@eladleev
#DistributedSystems #DataStreams
#Scalability
#Kafka #ConfluentCommunityCatalyst
Riskified
What We Do
Riskified by the Numbers
Global team, nearly 50%
in engineering & analytics
Countries across
the globe
Online volume (GMV)
reviewed in 2021
750+ 180+
$89B
50+
Publicly held companies
among our clients
98%+
Client retention*
for the past 2 years
*Annual dollar retention
As of March 2022
Let’s dive in!
According to
Gartner Forecasts
The worldwide end-user
spending on public cloud
services is forecast to grow
by 20% in 2022 to a total of
$397 billion.
Source: Gartner (April 2021)
Data Never Sleeps
According to Statista, the
total amount of data
consumed globally in 2021
was 79 Zettabytes.
of all Fortune 100 companies
trust, and use Kafka.
80%
More than
What are we
paying for?
Kafka
Deployment
$
$
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
EC2 Machines
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
EC2 Machines
EKS $
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
EBS Drives
$
$
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
ZooKeepers
$
$
$
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
LB
$
$
$
$
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
$
$
$
$
Without consuming or producing
a single byte
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
$
$
$
$
$
Traffic
AWS Region
AZ A AZ B AZ N
Producer Consumer Producer Consumer Producer
Kafka
Deployment
$
$
$
$
$
$
$
$
Traffic
Traffic
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
Traffic
AWS Region
Management
AZ A AZ B AZ N
Metrics Logs
Kafka
Deployment
$
$
$
$
$
$
$
$
Metrics Logs
$
Producer Consumer Producer Consumer Producer
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
AWS Region
Management
AZ A AZ B AZ N
Metrics Logs
Kafka
Deployment
$
$
$
$
$
$
$
$
Kafka
Connect
$
$
Tools
Connect Schema
Registry
Producer Consumer Producer Consumer Producer
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
Schema
Registry
AWS Region
Management
AZ A AZ B AZ N
Metrics Logs
Kafka
Deployment
$
$
$
$
$
$
$
$
Cruise
Control
$
$
Tools
Connect Schema
Registry
Kafka
UI
$
UI
Cruise
Control
Producer Consumer Producer Consumer Producer
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
AWS Region
Management
AZ A AZ B AZ N
Metrics Logs
Tools
Connect Schema
Registry
UI
Cruise
Control
Producer Consumer Producer Consumer Producer
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
Kafka
Deployment
$
$
$
$
$
$
$
$
$
$
$
$
$
$
● Multi Region
● Persist to S3
● Mirror Maker
● Domain Separation
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with Elad Leev | Kafka Summit London 2022
Storage
Network
Machines
Cost Factors
What can
we do?
Instance type
Compression
Fetch from replica
Cluster balance
Tiered storage
Client configuration fine tune
01
02
03
04
05
06
Are you using the
right instance type?
A quick Google search will
suggest the R5, D2, C5
combined with GP2/3 or
IO2 storage as the broad
recommendation.
C5 R5
Are you using the
right instance type?
Everything is a trade-off.
Choose wisely based on your
needs: Time-to-recover,
Storage-to-dollar ratio, network
throughput, EBS downfalls.
Source: Liz Fong-Jones, Honeycomb.io
i3 and i3en for
High-performance
Kafka Deployments
Despite the operational
overhead of ephemeral drives.
i3en provides great
storage-to-dollar ratio
Suitable for I/O intensive
deployments in which the
limiting factor is storage
capacity.
Source: ScyllaDB
Are you using the right instance types?
Is your cluster saturated?
On what conditions?
What is the
limiting factor?
Are EBS drives the right
decision?
Do the cluster needs
changed over time?
Instance type
Compression
Fetch from replica
Cluster balance
Tiered storage
Client configuration fine tune
01
02
03
04
05
06
Change compression
type
You can choose between
GZIP, LZ4, Snappy, and since
KIP-110 - ZSTD.
Zstandard
Compression algorithm
by Facebook
It aims for a smaller and
faster data compression.
Meta
Real world examples
Using Zstd, Shopify was able
to get a 4.28x compression
ratio compared to Snappy’s
2.5x.
Average Message Size
Bytes
After switching to Zstandard,
our bandwidth usage
decreased by 3x!
This saves us tens of
thousands of dollars per
month in data transfer costs
just from the processing
pipeline alone.
With these great results we
immediately began deploying
Zstandard to other systems
in our architecture for further
cost savings.
“
Bandwidth usage
“
KIP-390: Support
Compression Level
Zstd Level 1 produces
32.7% more messages per
second than Zstd Level 3.
Gzip Level 1 produces
56.4% more than Gzip Level 6.
29218 JSON files with an average size of 55.25kb
Instance type
Compression
Fetch from replica
Cluster balance
Tiered storage
Client configuration fine tune
01
02
03
04
05
06
AWS Region
AZ A AZ B AZ N
Producer Consumer Producer Consumer Producer
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
AWS Region
AZ A AZ B AZ N
Producer Consumer Producer Consumer Producer
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
KIP-392: Fetch from
closest replica
Leverage locality in order to
reduce expensive cross- AZ /
cross-DC traffic costs.
KIP-392: Fetch from
closest replica
Rack awareness needs to be
set. Support from the client is
needed as well.
Instance type
Compression
Fetch from replica
Cluster balance
Tiered storage
Client configuration fine tune
01
02
03
04
05
06
Cluster imbalance
Might hurt cluster
performance, lead to brokers
skew, resource saturation, and
as consequence -
unnecessary capacity
addition.
Cluster imbalance
Overloaded brokers (more
partitions, more segments) will
suffer from higher MTTR.
Source: plutora.com
AWS Region
AZ A AZ B AZ N
Producer Consumer Producer Consumer Producer
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
Cluster imbalance
Long startup time
exposes you to a higher
risk of data loss in case
of cascading failures.
! !
Cluster imbalance
Relatively easy-to-do task
using-
Cruise Control, CMAK,
Kafka-Kit, and more.
Cruise-control is the first of its kind to fully automate
the dynamic workload rebalance and self-healing of a
Kafka cluster. It provides great value to Kafka users by
simplifying the operation of Kafka clusters.
Instance type
Compression
Fetch from replica
Cluster balance
Tiered storage
Client configuration fine tune
01
02
03
04
05
06
Deep dive into the client configurations knobs can dramatically affect the way
your cluster works
Client Configurations
Clients
Configurations
Kafka just works out of the
box. However to unlock its full
potential - invest time in
learning.
Clients
Configurations
For example, changing your
clients batch.size and
linger.ms configurations can
reduce your cluster resource
utilization.
Message Conversions
Message conversion
introduces a processing
overhead.
Upgrading clients can free up
resources that were used
unnecessarily for this task.
Instance type
Compression
Fetch from replica
Cluster balance
Tiered storage
Client configuration fine tune
01
02
03
04
05
06
The Cluster Storage
Kafka became the main
entry point of all of the data
in organizations, allowing
clients to consume not only
the recent events, but also
older data based on the
topic retention.
Upstream Clusters
A common pattern is to split
your cluster into smaller clusters
based on business domains.
Using this method you can set a
higher retention period on the
upstream cluster for data
recovery in case of a failure.
Upstream
Cluster
Raw Events
Application A Application B
Cluster
Domain A
Cluster
Domain B
DBZ Cluster
Logging
Cluster
Downstream
Cluster
Application C
Domain B
Domain A
S3
Amazon
DynamoDB
Capacity addition
Both cases require you to
add disk capacity to
support growth.
High retention, justified by
business needs, may lead
to needless resource
addition (memory and
CPUs) to support storage
capacity.
KIP-405:
Kafka Tiered Storage
Using Tiered Storage, Kafka
clusters are configured with
two tiers of storage: local
and remote.
The new remote tier uses
external object storage
layers such as AWS S3 or
HDFS.
Kafka Tiered Storage
Storage cost saving
~$0.08 per GB on gp3 Vs.
~$0.023 on S3
01 Accurate capacity
planning
Mostly based on computing
power, and not storage
02 Scaling storage
Independently from
memory and CPUs
03
Remove unnecessary
capacity
Remove capacity that has been
added to support long-term
retention
04 Faster broker
startup
Loading local (and fewer)
segments on broker startup
05
Tiered Storage is already available on Confluent Platform 6.0.0, and will be added
to the Kafka 3.2 release.
The
TL;DL
(Too Long ; Didn't Listen)
Cost Reduction Areas
Machine Network Storage
Instance Type
Compression
Fetch from Close Replica
Cluster Balance
Consumer Tune
Producer Tune
Compression Levels
Tiered Storage
https://leevs.dev
@eladleev
linkedin.com/in/elad-leev
Thank You
For Your Time!

More Related Content

PDF
Building High-Throughput, Low-Latency Pipelines in Kafka
PPTX
WAND Top-k Retrieval
PDF
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
PDF
CockroachDB: Architecture of a Geo-Distributed SQL Database
PDF
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
PDF
Rethinking State Management in Cloud-Native Streaming Systems
PPTX
The Volcano/Cascades Optimizer
PDF
Introduction to Agents and Multi-agent Systems (lecture slides)
Building High-Throughput, Low-Latency Pipelines in Kafka
WAND Top-k Retrieval
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
CockroachDB: Architecture of a Geo-Distributed SQL Database
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Rethinking State Management in Cloud-Native Streaming Systems
The Volcano/Cascades Optimizer
Introduction to Agents and Multi-agent Systems (lecture slides)

What's hot (6)

PDF
The Patterns of Distributed Logging and Containers
DOCX
топқа бөлу
DOCX
Assignment on essay (piety is the soul of character)
PDF
Integrating Multi-Agent Systems and Internet of Things To Support Ambient Int...
PDF
PostgreSQL, performance for queries with grouping
PDF
Streaming Apps and Poison Pills: handle the unexpected with Kafka Streams
The Patterns of Distributed Logging and Containers
топқа бөлу
Assignment on essay (piety is the soul of character)
Integrating Multi-Agent Systems and Internet of Things To Support Ambient Int...
PostgreSQL, performance for queries with grouping
Streaming Apps and Poison Pills: handle the unexpected with Kafka Streams
Ad

Similar to Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with Elad Leev | Kafka Summit London 2022 (20)

PDF
Elastically Scaling Kafka Using Confluent
PPTX
Bridge Your Kafka Streams to Azure Webinar
PDF
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
PDF
Save Money by Uncovering Kafka’s Hidden Cloud Costs
PDF
Confluent_AWS_ImmersionDay_Q42023.pdf
PDF
Build real-time streaming data pipelines to AWS with Confluent
PDF
Autoscaling Confluent Cloud: Should We? How Would We?
PDF
App modernization on AWS with Apache Kafka and Confluent Cloud
PDF
Tech Talks On Site- Edição de Maio- AutoScaling
PDF
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
PDF
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
PPTX
Building real-time serverless data applications with Confluent and AWS.pptx
PDF
Citi Tech Talk: Hybrid Cloud
PDF
All Streams Ahead! ksqlDB Workshop ANZ
PPTX
AWS AutoScalling- Tech Talks Maio 2019
PPTX
Building real-time serverless data applications with Confluent and AWS - Lond...
PDF
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
PPTX
Data Streaming with Apache Kafka & MongoDB
PDF
Building Real-Time Serverless Data Applications With Joseph Morais and Adam W...
PDF
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
Elastically Scaling Kafka Using Confluent
Bridge Your Kafka Streams to Azure Webinar
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Save Money by Uncovering Kafka’s Hidden Cloud Costs
Confluent_AWS_ImmersionDay_Q42023.pdf
Build real-time streaming data pipelines to AWS with Confluent
Autoscaling Confluent Cloud: Should We? How Would We?
App modernization on AWS with Apache Kafka and Confluent Cloud
Tech Talks On Site- Edição de Maio- AutoScaling
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
Building real-time serverless data applications with Confluent and AWS.pptx
Citi Tech Talk: Hybrid Cloud
All Streams Ahead! ksqlDB Workshop ANZ
AWS AutoScalling- Tech Talks Maio 2019
Building real-time serverless data applications with Confluent and AWS - Lond...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
Data Streaming with Apache Kafka & MongoDB
Building Real-Time Serverless Data Applications With Joseph Morais and Adam W...
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
NewMind AI Monthly Chronicles - July 2025
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Unlocking AI with Model Context Protocol (MCP)

Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with Elad Leev | Kafka Summit London 2022

  • 1. Kafka Cost Reduction A Practical Guide for Let’s Make Your CFO Happy;
  • 2. The Elephant In The Room What are we paying for? Running a self-hosted Kafka deployment Where and how can we cut costs? Tips, tricks, and KIPs Develop an economic mindset It’s part of our role
  • 3. Elad Leev Data Engineer at Riskified @eladleev #DistributedSystems #DataStreams #Scalability #Kafka #ConfluentCommunityCatalyst
  • 5. Riskified by the Numbers Global team, nearly 50% in engineering & analytics Countries across the globe Online volume (GMV) reviewed in 2021 750+ 180+ $89B 50+ Publicly held companies among our clients 98%+ Client retention* for the past 2 years *Annual dollar retention As of March 2022
  • 7. According to Gartner Forecasts The worldwide end-user spending on public cloud services is forecast to grow by 20% in 2022 to a total of $397 billion. Source: Gartner (April 2021)
  • 8. Data Never Sleeps According to Statista, the total amount of data consumed globally in 2021 was 79 Zettabytes.
  • 9. of all Fortune 100 companies trust, and use Kafka. 80% More than
  • 11. Kafka Deployment $ $ Kafka Cluster AWS Region Load Balancer Broker 1 AZ A AZ B AZ N Broker 3 Broker N
  • 12. Kafka Cluster AWS Region Load Balancer Broker 1 AZ A AZ B AZ N Broker 3 Broker N Kafka Deployment $ $ EC2 Machines
  • 13. Kafka Cluster AWS Region Load Balancer Broker 1 AZ A AZ B AZ N Broker 3 Broker N Kafka Deployment $ $ EC2 Machines EKS $
  • 14. Kafka Cluster AWS Region Load Balancer Broker 1 AZ A AZ B AZ N Broker 3 Broker N Kafka Deployment $ $ EBS Drives $ $
  • 15. Kafka Cluster AWS Region Load Balancer Broker 1 AZ A AZ B AZ N Broker 3 Broker N Kafka Deployment $ $ ZooKeepers $ $ $
  • 16. Kafka Cluster AWS Region Load Balancer Broker 1 AZ A AZ B AZ N Broker 3 Broker N Kafka Deployment $ $ LB $ $ $ $
  • 17. Kafka Cluster AWS Region Load Balancer Broker 1 AZ A AZ B AZ N Broker 3 Broker N Kafka Deployment $ $ $ $ $ $ Without consuming or producing a single byte
  • 18. Kafka Cluster AWS Region Load Balancer Broker 1 AZ A AZ B AZ N Broker 3 Broker N Kafka Deployment $ $ $ $ $ $ $ Traffic
  • 19. AWS Region AZ A AZ B AZ N Producer Consumer Producer Consumer Producer Kafka Deployment $ $ $ $ $ $ $ $ Traffic Traffic Kafka Cluster Load Balancer Broker 1 Broker 3 Broker N Traffic
  • 20. AWS Region Management AZ A AZ B AZ N Metrics Logs Kafka Deployment $ $ $ $ $ $ $ $ Metrics Logs $ Producer Consumer Producer Consumer Producer Kafka Cluster Load Balancer Broker 1 Broker 3 Broker N
  • 21. AWS Region Management AZ A AZ B AZ N Metrics Logs Kafka Deployment $ $ $ $ $ $ $ $ Kafka Connect $ $ Tools Connect Schema Registry Producer Consumer Producer Consumer Producer Kafka Cluster Load Balancer Broker 1 Broker 3 Broker N Schema Registry
  • 22. AWS Region Management AZ A AZ B AZ N Metrics Logs Kafka Deployment $ $ $ $ $ $ $ $ Cruise Control $ $ Tools Connect Schema Registry Kafka UI $ UI Cruise Control Producer Consumer Producer Consumer Producer Kafka Cluster Load Balancer Broker 1 Broker 3 Broker N
  • 23. AWS Region Management AZ A AZ B AZ N Metrics Logs Tools Connect Schema Registry UI Cruise Control Producer Consumer Producer Consumer Producer Kafka Cluster Load Balancer Broker 1 Broker 3 Broker N Kafka Deployment $ $ $ $ $ $ $ $ $ $ $ $ $ $ ● Multi Region ● Persist to S3 ● Mirror Maker ● Domain Separation
  • 27. Instance type Compression Fetch from replica Cluster balance Tiered storage Client configuration fine tune 01 02 03 04 05 06
  • 28. Are you using the right instance type? A quick Google search will suggest the R5, D2, C5 combined with GP2/3 or IO2 storage as the broad recommendation. C5 R5
  • 29. Are you using the right instance type? Everything is a trade-off. Choose wisely based on your needs: Time-to-recover, Storage-to-dollar ratio, network throughput, EBS downfalls. Source: Liz Fong-Jones, Honeycomb.io
  • 30. i3 and i3en for High-performance Kafka Deployments Despite the operational overhead of ephemeral drives.
  • 31. i3en provides great storage-to-dollar ratio Suitable for I/O intensive deployments in which the limiting factor is storage capacity. Source: ScyllaDB
  • 32. Are you using the right instance types? Is your cluster saturated? On what conditions? What is the limiting factor? Are EBS drives the right decision? Do the cluster needs changed over time?
  • 33. Instance type Compression Fetch from replica Cluster balance Tiered storage Client configuration fine tune 01 02 03 04 05 06
  • 34. Change compression type You can choose between GZIP, LZ4, Snappy, and since KIP-110 - ZSTD.
  • 35. Zstandard Compression algorithm by Facebook It aims for a smaller and faster data compression. Meta
  • 36. Real world examples Using Zstd, Shopify was able to get a 4.28x compression ratio compared to Snappy’s 2.5x. Average Message Size Bytes
  • 37. After switching to Zstandard, our bandwidth usage decreased by 3x! This saves us tens of thousands of dollars per month in data transfer costs just from the processing pipeline alone. With these great results we immediately began deploying Zstandard to other systems in our architecture for further cost savings. “ Bandwidth usage “
  • 38. KIP-390: Support Compression Level Zstd Level 1 produces 32.7% more messages per second than Zstd Level 3. Gzip Level 1 produces 56.4% more than Gzip Level 6. 29218 JSON files with an average size of 55.25kb
  • 39. Instance type Compression Fetch from replica Cluster balance Tiered storage Client configuration fine tune 01 02 03 04 05 06
  • 40. AWS Region AZ A AZ B AZ N Producer Consumer Producer Consumer Producer Kafka Cluster Load Balancer Broker 1 Broker 3 Broker N
  • 41. AWS Region AZ A AZ B AZ N Producer Consumer Producer Consumer Producer Kafka Cluster Load Balancer Broker 1 Broker 3 Broker N
  • 42. KIP-392: Fetch from closest replica Leverage locality in order to reduce expensive cross- AZ / cross-DC traffic costs.
  • 43. KIP-392: Fetch from closest replica Rack awareness needs to be set. Support from the client is needed as well.
  • 44. Instance type Compression Fetch from replica Cluster balance Tiered storage Client configuration fine tune 01 02 03 04 05 06
  • 45. Cluster imbalance Might hurt cluster performance, lead to brokers skew, resource saturation, and as consequence - unnecessary capacity addition.
  • 46. Cluster imbalance Overloaded brokers (more partitions, more segments) will suffer from higher MTTR. Source: plutora.com
  • 47. AWS Region AZ A AZ B AZ N Producer Consumer Producer Consumer Producer Kafka Cluster Load Balancer Broker 1 Broker 3 Broker N Cluster imbalance Long startup time exposes you to a higher risk of data loss in case of cascading failures. ! !
  • 48. Cluster imbalance Relatively easy-to-do task using- Cruise Control, CMAK, Kafka-Kit, and more. Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
  • 49. Instance type Compression Fetch from replica Cluster balance Tiered storage Client configuration fine tune 01 02 03 04 05 06
  • 50. Deep dive into the client configurations knobs can dramatically affect the way your cluster works Client Configurations
  • 51. Clients Configurations Kafka just works out of the box. However to unlock its full potential - invest time in learning.
  • 52. Clients Configurations For example, changing your clients batch.size and linger.ms configurations can reduce your cluster resource utilization.
  • 53. Message Conversions Message conversion introduces a processing overhead. Upgrading clients can free up resources that were used unnecessarily for this task.
  • 54. Instance type Compression Fetch from replica Cluster balance Tiered storage Client configuration fine tune 01 02 03 04 05 06
  • 55. The Cluster Storage Kafka became the main entry point of all of the data in organizations, allowing clients to consume not only the recent events, but also older data based on the topic retention.
  • 56. Upstream Clusters A common pattern is to split your cluster into smaller clusters based on business domains. Using this method you can set a higher retention period on the upstream cluster for data recovery in case of a failure. Upstream Cluster Raw Events Application A Application B Cluster Domain A Cluster Domain B DBZ Cluster Logging Cluster Downstream Cluster Application C Domain B Domain A S3 Amazon DynamoDB
  • 57. Capacity addition Both cases require you to add disk capacity to support growth. High retention, justified by business needs, may lead to needless resource addition (memory and CPUs) to support storage capacity.
  • 58. KIP-405: Kafka Tiered Storage Using Tiered Storage, Kafka clusters are configured with two tiers of storage: local and remote. The new remote tier uses external object storage layers such as AWS S3 or HDFS.
  • 59. Kafka Tiered Storage Storage cost saving ~$0.08 per GB on gp3 Vs. ~$0.023 on S3 01 Accurate capacity planning Mostly based on computing power, and not storage 02 Scaling storage Independently from memory and CPUs 03 Remove unnecessary capacity Remove capacity that has been added to support long-term retention 04 Faster broker startup Loading local (and fewer) segments on broker startup 05 Tiered Storage is already available on Confluent Platform 6.0.0, and will be added to the Kafka 3.2 release.
  • 60. The TL;DL (Too Long ; Didn't Listen)
  • 61. Cost Reduction Areas Machine Network Storage Instance Type Compression Fetch from Close Replica Cluster Balance Consumer Tune Producer Tune Compression Levels Tiered Storage