Now, It's Kafka's Turn: Separating Compute and Storage - Part 1

Ankur Ranjan

Software Engineer by heart, Data Engineer by mind

Published Jul 25, 2025

If you’ve been in data engineering for a while, you’ve probably seen some major shifts, one of the biggest being the move from tightly coupled storage and compute (like with Hadoop) to a decoupled model. This change made systems more scalable, flexible, and cost-efficient.

In the world of real-time data, Apache Kafka has become the go-to tool. But there's often one big concern: cost. Kafka typically stores data on local disks attached to each broker, meaning storage and compute remain tightly linked, making it harder and more expensive to scale.

Recently, a proposal KIP-1150 introduced a game-changing idea: diskless Kafka, where data can be stored directly in cloud storage like AWS S3. Of course, it also means higher latency and a greater reliance on the cloud. It's especially useful for workloads that involve lots of logs and batch processing, but might not be ideal if you need ultra-low latency.

“While this post is made possible through a partnership with Aiven, this post is my thoughts to understand and analyze KIP-1150. It’s not a definitive answer or a comprehensive guide- I’m just sharing my thoughts and would love to hear your opinions too.”

I have co-authered this article with my friend Sai Vineel, a Senior Data Engineer.

I know a lot of my readers may not be aware of Apache Kafka. So in this blog, I will try to walk through each concept and then discuss my thoughts about Diskless Kafka, which is proposed by Aiven and being discussed actively in the Kafka community.

We will learn & discuss the following concepts in the blog.

Introduction to Kafka
Kafka Architecture and Key Components
How Kafka Works (Step-by-Step Example)
The Case for Diskless Kafka (Aiven’s Proposal)
Conclusion

Introduction to Apache Kafka

What Is Apache Kafka, Really?

Think of Apache Kafka as a supercharged pipeline for real-time data. It’s more than just a message queue — it’s the backbone behind modern data platforms, powering everything from real-time analytics to event-driven applications.

So what makes Kafka stand out?

1. Producers and Consumers Don’t Need to Know Each Other

Kafka uses a publish-subscribe model, which means the sender (producer) and receiver (consumer) are completely decoupled.

Producers write data to Kafka topics.
Consumers read from those topics at their own pace.

This design allows multiple services to process the same stream of data in parallel, without slowing each other down — perfect for microservices and scalable architectures.

2. Data Isn’t Gone After It’s Read

Unlike traditional queues, Kafka stores messages for a set time (by default, 7 days — or longer if you want).

If a consumer goes down, it can catch up later.
New consumers can also "rewind" and replay past events.

This feature unlocks event reprocessing, system recovery, and building reliable, fault-tolerant pipelines, which is why Kafka is a favorite for large-scale data systems.Kafka Architecture and Key Components

Kafka Architecture and Key Components

At its core, Kafka is a distributed, append-only log, a system that stores events in the order they happen and makes them available to many consumers. But what makes it powerful is how its core components fit together:

Broker:

A broker is a Kafka server. A cluster is just a group of brokers working together. Each broker handles part of the data and replicates some from others to ensure fault tolerance.

Topic:

A topic is like a category or feed name (e.g. user_signups, orders). Producers write to topics; consumers read from them. Think of it as a named stream of events.

Partition

Topics are split into partitions , ordered, append-only logs that enable parallelism and scalability. Each message in a partition has an offset (its position). Kafka keeps the order within a partition, and by using a message key, related messages (like the same user) can be sent to the same partition.

Producer

A producer is any app or service that sends data to Kafka. It picks the topic (and optionally the partition) and Kafka handles the rest. Examples: web apps sending user actions, IoT devices streaming sensor data.

Consumer

Consumers read from topics. They can work independently or in consumer groups, where Kafka distributes partitions among them. This lets multiple consumers share the load and keeps processing balanced. If one consumer fails, Kafka automatically reassigns its partition(s).

Consumer Groups

If consumers are in the same group, each gets a unique slice of the topic. If they’re in different groups, they all get the full stream, letting different teams or systems reuse the same data independently.

Replication & Leaders

Kafka keeps data safe by replicating partitions across brokers. One broker acts as the leader, handling reads/writes; others are followers. If a leader fails, a follower takes over, no data lost, no downtime.

ZooKeeper & KRaft

Kafka used to rely on ZooKeeper for coordination. Now, it’s moving to KRaft (Kafka Raft), an internal system that simplifies cluster management.

All of this works together to make Kafka fast, scalable, and reliable. Data is written to disk efficiently, consumers process at their own pace, and the system can handle millions of messages per second , even if parts of the system fail.

How Kafka Works (Step-by-Step Example)

Let’s walk through how data flows through Kafka using a ridesharing app like Uber or Lyft.

Producers: Sending Driver Locations

Each driver’s app acts as a producer, sending frequent location updates (latitude, longitude, timestamp, driver ID) to a Kafka topic called "driver_locations".

Topics & Partitions: Organizing the Stream

The topic is split into partitions (e.g., by city region or driver ID hash). Kafka ensures that messages from the same driver go to the same partition, keeping their data in order. These partitions are stored across a Kafka cluster of brokers, with replication to ensure durability.

Kafka Brokers: Storing the Data

Messages are sent to a broker, written to disk (append-only log), and assigned an offset. Kafka then replicates the message to other brokers to maintain fault tolerance.

Consumers: Processing the Stream

Multiple consumers can read from "driver_locations":

A dispatch service might use it to match drivers and riders.
An analytics engine might process data for heatmaps or traffic trends.

Kafka tracks each consumer’s last-read offset, allowing them to resume where they left off , even after a failure, as long as the messages are still within Kafka’s retention window (e.g., 7 days).

Slow Consumers? No Problem

If a service processes data slowly, Kafka buffers messages. Producers keep writing, and the consumer catches up later. This decoupled model ensures no blocking or data loss (within the retention period). If lag grows too large, it’s a signal to scale the consumers.

The Case for Diskless Kafka (Aiven’s Proposal)

Traditionally, Kafka brokers store topic data on local disks (SSD/HDD), ensuring durability and fast access. But as clusters grow, managing disks and replicating data across brokers becomes complex and costly.

To address this, Aiven proposed "Diskless Kafka", formalized in KIP-1150, where topic data is written directly to cloud object storage (e.g., Amazon S3) instead of broker disks. Brokers become mostly stateless and fetch data from the shared storage.

Why Go Diskless?

Simplified Scaling & Rebalancing In traditional Kafka, adding or removing brokers requires moving large volumes of data between them, which is slow and resource-heavy. With diskless Kafka, data stays in cloud object storage (e.g., S3), so brokers can scale up/down instantly without rebalancing.
Lower Storage Costs Kafka typically stores 3 copies of each message on expensive SSDs across zones, increasing cloud costs. Diskless mode writes a single durable copy to cloud storage, reducing storage and network costs by up to 80%.
Operational Simplicity No more managing disks, monitoring usage, handling failures, or worrying about IOPS limits. Broker upgrades, replacements, and maintenance become seamless since data is not tied to the broker itself.
Faster Disaster Recovery & Global Flexibility Data in multi-AZ or multi-region cloud storage remains safe even if brokers or zones go down. You can spin up brokers anywhere, making disaster recovery and geo-replication simple — no need for MirrorMaker or cluster linking.
Efficient Backlog Handling In disk-based Kafka, large backlogs stress disk I/O. Diskless Kafka stores older data in the cloud, so brokers only cache recent data, keeping performance consistent.

How It Works

Messages are batched and stored in cloud object storage.
A Batch Coordinator assigns offsets and tracks where batches are stored.
Consumers fetch metadata from brokers (as usual) and retrieve the actual data from the object store.

Kafka clients remain unchanged — no code changes are needed.

Conclusion

Apache Kafka has transformed data engineering with its high-speed, distributed event log that decouples services and enables scalable, reliable streaming. While its architecture may seem complex, at its core Kafka is about writing sequential events and letting multiple consumers read at their own pace — a simple yet powerful idea.

Now, with innovations like diskless Kafka proposed by Aiven, the platform evolves further. By shifting storage to the cloud, it simplifies operations, cuts costs, and retains Kafka’s core guarantees: ordered, durable, scalable streams.

While there are proprietary versions providing these offerings, having this in open source is a huge plus to the company.

We will discuss this proposal in depth in an upcoming article, and also compare various solutions available in the market from Buf, Warp, and Confluent Fright. Motto of this article to make your aware about the recent development in Kafka ecosystem & in next article, we will go in depth and try to understand the Diskless Kafka in depth and try understanding how others Kafka vendors are actually doing it.

Till then thank you for your precious time as reader :) Keep learning, keep exploring.

Sources:

The Big Data Show

19,219 followers

+ Subscribe

Mohammad Karim

Data Analyst || Ex-Summer Intern at Excellance Technology || Skilled in Python, SQL, Excel, Power BI | Data Visualization Enthusiast | 5⭐ SQL HackerRank | Passionate about Turning Data into Insights

"What key benefits do you think Diskless Kafka brings to data engineering workflows?"

Mohammad Karim

"Great read, Ankur! 🔍 Curious to know—what do you see as the biggest advantage of Diskless Kafka in real-world applications?"

KrishnaKanth S

Java Full Stack Developer | Expertise in Java, Spring Boot, REST APIs, Angular & React | Open to C2C / C2H Opportunities

Diskless Kafka sounds like a game changer! We've seen similar challenges bridging operational and analytical data... curious to see how this impacts latency. Thanks for sharing!

1 Reaction

Arjun Dixit

Ankur Ranjan Great insights on KIP-1150! Looking forward to reading more of your thoughts in future installments.

1 Reaction

Maulik Parikh

Global Leader - AI/ML, Data Architecture, Platforms and Products

Feel free to sign up to learn more about Diskless Kafka Topics from one of the Apache Contributors Greg Harris here : https://www.linkedin.com/events/getkafka-natedepisode3-gregharr7351967184987779074

Introduction to Apache Kafka

What Is Apache Kafka, Really?

1. Producers and Consumers Don’t Need to Know Each Other

2. Data Isn’t Gone After It’s Read

Kafka Architecture and Key Components

Broker:

Topic:

Partition

Producer

Consumer

Consumer Groups

Replication & Leaders

ZooKeeper & KRaft

How Kafka Works (Step-by-Step Example)

Producers: Sending Driver Locations

Topics & Partitions: Organizing the Stream

Kafka Brokers: Storing the Data

Consumers: Processing the Stream

Slow Consumers? No Problem

The Case for Diskless Kafka (Aiven’s Proposal)

Why Go Diskless?

How It Works

Conclusion

The Big Data Show

19,219 followers

Named template in Helm

Jun 9, 2025

Hive Metastore as an Apache Iceberg Catalog

May 26, 2025

Why catalogs matter: The book-keeping of Apache Iceberg

May 22, 2025

Why AWS Glue Falls Short for Multi-Table Transactions in Apache Iceberg

May 16, 2025

Reflections on the Hadoop Catalog in Apache Iceberg

May 15, 2025

Why is Iceberg choosing to deperecate, positional delete in MoR?

May 6, 2025

Apache Arrow Flight

Jan 24, 2025

Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

May 19, 2024

Spark Dynamic Resource Allocation

Apr 26, 2024

Intro to Kafka Security for Data Engineers - Part 1

Oct 8, 2023

Others also viewed

Introduction to Apache Kafka

MongoDB 101: Understanding the Power of NoSQL Databases

Everything You Need to Know About Apache Cassandra in 2025

Common industry use cases for NoSQL with Azure Cosmos DB

The growing ecosystem of community and third-party Kafka connectors

Understanding MongoDB Aggregation: A Simple Guide 🚀

A Comprehensive Guide to MongoDB: Architecture, Operations, and Comparisons

NoSQL Databases: Empowering Modern Data Management

Google DataProc aka Apache Spark & Hadoop Service

Mastering Real-Time Data Challenges: A Comprehensive Guide to Message Queues, Kafka, Redis, and Apache Pulsar

Explore topics