Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with Elad Leev | Kafka Summit London 2022

Kafka Cost
Reduction
A Practical Guide for
Let’s Make Your CFO Happy;

The Elephant
In The Room
What are we paying for?
Running a self-hosted Kafka deployment
Where and how can we cut costs?
Tips, tricks, and KIPs
Develop an economic mindset
It’s part of our role

Elad Leev
Data Engineer at Riskiﬁed
@eladleev
#DistributedSystems #DataStreams
#Scalability
#Kafka #ConﬂuentCommunityCatalyst

Riskiﬁed by the Numbers
Global team, nearly 50%
in engineering & analytics
Countries across
the globe
Online volume (GMV)
reviewed in 2021
750+ 180+
$89B
50+
Publicly held companies
among our clients
98%+
Client retention*
for the past 2 years
*Annual dollar retention
As of March 2022

According to
Gartner Forecasts
The worldwide end-user
spending on public cloud
services is forecast to grow
by 20% in 2022 to a total of
$397 billion.
Source: Gartner (April 2021)

Data Never Sleeps
According to Statista, the
total amount of data
consumed globally in 2021
was 79 Zettabytes.

of all Fortune 100 companies
trust, and use Kafka.
80%
More than

Kafka
Deployment
$
$
Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N

Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
EC2 Machines

Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
EC2 Machines
EKS $

Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
EBS Drives
$
$

Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
ZooKeepers
$
$
$

Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
LB
$
$
$
$

Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
$
$
$
$
Without consuming or producing
a single byte

Kafka Cluster
AWS Region
Load Balancer
Broker 1
AZ A AZ B AZ N
Broker 3 Broker N
Kafka
Deployment
$
$
$
$
$
$
$
Traﬃc

AWS Region
AZ A AZ B AZ N
Producer Consumer Producer Consumer Producer
Kafka
Deployment
$
$
$
$
$
$
$
$
Traffic
Traffic
Kafka Cluster
Load Balancer
Broker 1 Broker 3 Broker N
Traffic

AWS Region
Management
AZ A AZ B AZ N
Metrics Logs
Kafka
Deployment
$
$
$
$
$
$
$
$
Metrics Logs
$
Kafka Cluster
Load Balancer

AWS Region
Management
AZ A AZ B AZ N
Metrics Logs
Kafka
Deployment
$
$
$
$
$
$
$
$
Kafka
Connect
$
$
Tools
Connect Schema
Registry
Kafka Cluster
Load Balancer
Schema
Registry

AWS Region
Management
AZ A AZ B AZ N
Metrics Logs
Kafka
Deployment
$
$
$
$
$
$
$
$
Cruise
Control
$
$
Tools
Connect Schema
Registry
Kafka
UI
$
UI
Cruise
Control
Kafka Cluster
Load Balancer

AWS Region
Management
AZ A AZ B AZ N
Metrics Logs
Tools
Connect Schema
Registry
UI
Cruise
Control
Kafka Cluster
Load Balancer
Kafka
Deployment
$
$
$
$
$
$
$
$
$
$
$
$
$
$
● Multi Region
● Persist to S3
● Mirror Maker
● Domain Separation

Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with Elad Leev | Kafka Summit London 2022

Storage
Network
Machines
Cost Factors

Instance type
Compression
Fetch from replica
Cluster balance
Tiered storage
Client conﬁguration ﬁne tune
01
02
03
04
05
06

Are you using the
right instance type?
A quick Google search will
suggest the R5, D2, C5
combined with GP2/3 or
IO2 storage as the broad
recommendation.
C5 R5

Are you using the
right instance type?
Everything is a trade-oﬀ.
Choose wisely based on your
needs: Time-to-recover,
Storage-to-dollar ratio, network
throughput, EBS downfalls.
Source: Liz Fong-Jones, Honeycomb.io

i3 and i3en for
High-performance
Kafka Deployments
Despite the operational
overhead of ephemeral drives.

i3en provides great
storage-to-dollar ratio
Suitable for I/O intensive
deployments in which the
limiting factor is storage
capacity.
Source: ScyllaDB

Are you using the right instance types?
Is your cluster saturated?
On what conditions?
What is the
limiting factor?
Are EBS drives the right
decision?
Do the cluster needs
changed over time?

Change compression
type
You can choose between
GZIP, LZ4, Snappy, and since
KIP-110 - ZSTD.

Zstandard
Compression algorithm
by Facebook
It aims for a smaller and
faster data compression.
Meta

Real world examples
Using Zstd, Shopify was able
to get a 4.28x compression
ratio compared to Snappy’s
2.5x.
Average Message Size
Bytes

After switching to Zstandard,
our bandwidth usage
decreased by 3x!
This saves us tens of
thousands of dollars per
month in data transfer costs
just from the processing
pipeline alone.
With these great results we
immediately began deploying
Zstandard to other systems
in our architecture for further
cost savings.
“
Bandwidth usage
“

KIP-390: Support
Compression Level
Zstd Level 1 produces
32.7% more messages per
second than Zstd Level 3.
Gzip Level 1 produces
56.4% more than Gzip Level 6.
29218 JSON ﬁles with an average size of 55.25kb

AWS Region
AZ A AZ B AZ N
Kafka Cluster
Load Balancer

KIP-392: Fetch from
closest replica
Leverage locality in order to
reduce expensive cross- AZ /
cross-DC traﬃc costs.

KIP-392: Fetch from
closest replica
Rack awareness needs to be
set. Support from the client is
needed as well.

Cluster imbalance
Might hurt cluster
performance, lead to brokers
skew, resource saturation, and
as consequence -
unnecessary capacity
addition.

Cluster imbalance
Overloaded brokers (more
partitions, more segments) will
suﬀer from higher MTTR.
Source: plutora.com

AWS Region
AZ A AZ B AZ N
Kafka Cluster
Load Balancer
Cluster imbalance
Long startup time
exposes you to a higher
risk of data loss in case
of cascading failures.
! !

Cluster imbalance
Relatively easy-to-do task
using-
Cruise Control, CMAK,
Kafka-Kit, and more.
Cruise-control is the ﬁrst of its kind to fully automate
the dynamic workload rebalance and self-healing of a
Kafka cluster. It provides great value to Kafka users by
simplifying the operation of Kafka clusters.

Deep dive into the client configurations knobs can dramatically affect the way
your cluster works
Client Configurations

Clients
Conﬁgurations
Kafka just works out of the
box. However to unlock its full
potential - invest time in
learning.

Clients
Conﬁgurations
For example, changing your
clients batch.size and
linger.ms conﬁgurations can
reduce your cluster resource
utilization.

Message Conversions
Message conversion
introduces a processing
overhead.
Upgrading clients can free up
resources that were used
unnecessarily for this task.

The Cluster Storage
Kafka became the main
entry point of all of the data
in organizations, allowing
clients to consume not only
the recent events, but also
older data based on the
topic retention.

Upstream Clusters
A common pattern is to split
your cluster into smaller clusters
based on business domains.
Using this method you can set a
higher retention period on the
upstream cluster for data
recovery in case of a failure.
Upstream
Cluster
Raw Events
Application A Application B
Cluster
Domain A
Cluster
Domain B
DBZ Cluster
Logging
Cluster
Downstream
Cluster
Application C
Domain B
Domain A
S3
Amazon
DynamoDB

Capacity addition
Both cases require you to
add disk capacity to
support growth.
High retention, justiﬁed by
business needs, may lead
to needless resource
addition (memory and
CPUs) to support storage
capacity.

KIP-405:
Kafka Tiered Storage
Using Tiered Storage, Kafka
clusters are conﬁgured with
two tiers of storage: local
and remote.
The new remote tier uses
external object storage
layers such as AWS S3 or
HDFS.

Kafka Tiered Storage
Storage cost saving
~$0.08 per GB on gp3 Vs.
~$0.023 on S3
01 Accurate capacity
planning
Mostly based on computing
power, and not storage
02 Scaling storage
Independently from
memory and CPUs
03
Remove unnecessary
capacity
Remove capacity that has been
added to support long-term
retention
04 Faster broker
startup
Loading local (and fewer)
segments on broker startup
05
Tiered Storage is already available on Conﬂuent Platform 6.0.0, and will be added
to the Kafka 3.2 release.

The
TL;DL
(Too Long ; Didn't Listen)

Cost Reduction Areas
Machine Network Storage
Instance Type
Compression
Fetch from Close Replica
Cluster Balance
Consumer Tune
Producer Tune
Compression Levels
Tiered Storage

https://leevs.dev
@eladleev
linkedin.com/in/elad-leev
Thank You
For Your Time!

Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with Elad Leev | Kafka Summit London 2022

More Related Content

What's hot (6)

Similar to Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with Elad Leev | Kafka Summit London 2022 (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with Elad Leev | Kafka Summit London 2022