Now, It's Kafka's Turn: Separating Compute and Storage - Part 1
If you’ve been in data engineering for a while, you’ve probably seen some major shifts, one of the biggest being the move from tightly coupled storage and compute (like with Hadoop) to a decoupled model. This change made systems more scalable, flexible, and cost-efficient.
In the world of real-time data, Apache Kafka has become the go-to tool. But there's often one big concern: cost. Kafka typically stores data on local disks attached to each broker, meaning storage and compute remain tightly linked, making it harder and more expensive to scale.
Recently, a proposal KIP-1150 introduced a game-changing idea: diskless Kafka, where data can be stored directly in cloud storage like AWS S3. Of course, it also means higher latency and a greater reliance on the cloud. It's especially useful for workloads that involve lots of logs and batch processing, but might not be ideal if you need ultra-low latency.
“While this post is made possible through a partnership with Aiven, this post is my thoughts to understand and analyze KIP-1150. It’s not a definitive answer or a comprehensive guide- I’m just sharing my thoughts and would love to hear your opinions too.”
I have co-authered this article with my friend Sai Vineel, a Senior Data Engineer.
I know a lot of my readers may not be aware of Apache Kafka. So in this blog, I will try to walk through each concept and then discuss my thoughts about Diskless Kafka, which is proposed by Aiven and being discussed actively in the Kafka community.
We will learn & discuss the following concepts in the blog.
Introduction to Kafka
Kafka Architecture and Key Components
How Kafka Works (Step-by-Step Example)
The Case for Diskless Kafka (Aiven’s Proposal)
Conclusion
Introduction to Apache Kafka
What Is Apache Kafka, Really?
Think of Apache Kafka as a supercharged pipeline for real-time data. It’s more than just a message queue — it’s the backbone behind modern data platforms, powering everything from real-time analytics to event-driven applications.
So what makes Kafka stand out?
1. Producers and Consumers Don’t Need to Know Each Other
Kafka uses a publish-subscribe model, which means the sender (producer) and receiver (consumer) are completely decoupled.
Producers write data to Kafka topics.
Consumers read from those topics at their own pace.
This design allows multiple services to process the same stream of data in parallel, without slowing each other down — perfect for microservices and scalable architectures.
2. Data Isn’t Gone After It’s Read
Unlike traditional queues, Kafka stores messages for a set time (by default, 7 days — or longer if you want).
If a consumer goes down, it can catch up later.
New consumers can also "rewind" and replay past events.
This feature unlocks event reprocessing, system recovery, and building reliable, fault-tolerant pipelines, which is why Kafka is a favorite for large-scale data systems.Kafka Architecture and Key Components
Kafka Architecture and Key Components
At its core, Kafka is a distributed, append-only log, a system that stores events in the order they happen and makes them available to many consumers. But what makes it powerful is how its core components fit together:
Broker:
A broker is a Kafka server. A cluster is just a group of brokers working together. Each broker handles part of the data and replicates some from others to ensure fault tolerance.
Topic:
A topic is like a category or feed name (e.g. user_signups, orders). Producers write to topics; consumers read from them. Think of it as a named stream of events.
Partition
Topics are split into partitions , ordered, append-only logs that enable parallelism and scalability. Each message in a partition has an offset (its position). Kafka keeps the order within a partition, and by using a message key, related messages (like the same user) can be sent to the same partition.
Producer
A producer is any app or service that sends data to Kafka. It picks the topic (and optionally the partition) and Kafka handles the rest. Examples: web apps sending user actions, IoT devices streaming sensor data.
Consumer
Consumers read from topics. They can work independently or in consumer groups, where Kafka distributes partitions among them. This lets multiple consumers share the load and keeps processing balanced. If one consumer fails, Kafka automatically reassigns its partition(s).
Consumer Groups
If consumers are in the same group, each gets a unique slice of the topic. If they’re in different groups, they all get the full stream, letting different teams or systems reuse the same data independently.
Replication & Leaders
Kafka keeps data safe by replicating partitions across brokers. One broker acts as the leader, handling reads/writes; others are followers. If a leader fails, a follower takes over, no data lost, no downtime.
ZooKeeper & KRaft
Kafka used to rely on ZooKeeper for coordination. Now, it’s moving to KRaft (Kafka Raft), an internal system that simplifies cluster management.
All of this works together to make Kafka fast, scalable, and reliable. Data is written to disk efficiently, consumers process at their own pace, and the system can handle millions of messages per second , even if parts of the system fail.
How Kafka Works (Step-by-Step Example)
Let’s walk through how data flows through Kafka using a ridesharing app like Uber or Lyft.
Producers: Sending Driver Locations
Each driver’s app acts as a producer, sending frequent location updates (latitude, longitude, timestamp, driver ID) to a Kafka topic called "driver_locations".
Topics & Partitions: Organizing the Stream
The topic is split into partitions (e.g., by city region or driver ID hash). Kafka ensures that messages from the same driver go to the same partition, keeping their data in order. These partitions are stored across a Kafka cluster of brokers, with replication to ensure durability.
Kafka Brokers: Storing the Data
Messages are sent to a broker, written to disk (append-only log), and assigned an offset. Kafka then replicates the message to other brokers to maintain fault tolerance.
Consumers: Processing the Stream
Multiple consumers can read from "driver_locations":
A dispatch service might use it to match drivers and riders.
An analytics engine might process data for heatmaps or traffic trends.
Kafka tracks each consumer’s last-read offset, allowing them to resume where they left off , even after a failure, as long as the messages are still within Kafka’s retention window (e.g., 7 days).
Slow Consumers? No Problem
If a service processes data slowly, Kafka buffers messages. Producers keep writing, and the consumer catches up later. This decoupled model ensures no blocking or data loss (within the retention period). If lag grows too large, it’s a signal to scale the consumers.
The Case for Diskless Kafka (Aiven’s Proposal)
Traditionally, Kafka brokers store topic data on local disks (SSD/HDD), ensuring durability and fast access. But as clusters grow, managing disks and replicating data across brokers becomes complex and costly.
To address this, Aiven proposed "Diskless Kafka", formalized in KIP-1150, where topic data is written directly to cloud object storage (e.g., Amazon S3) instead of broker disks. Brokers become mostly stateless and fetch data from the shared storage.
Why Go Diskless?
Simplified Scaling & Rebalancing In traditional Kafka, adding or removing brokers requires moving large volumes of data between them, which is slow and resource-heavy. With diskless Kafka, data stays in cloud object storage (e.g., S3), so brokers can scale up/down instantly without rebalancing.
Lower Storage Costs Kafka typically stores 3 copies of each message on expensive SSDs across zones, increasing cloud costs. Diskless mode writes a single durable copy to cloud storage, reducing storage and network costs by up to 80%.
Operational Simplicity No more managing disks, monitoring usage, handling failures, or worrying about IOPS limits. Broker upgrades, replacements, and maintenance become seamless since data is not tied to the broker itself.
Faster Disaster Recovery & Global Flexibility Data in multi-AZ or multi-region cloud storage remains safe even if brokers or zones go down. You can spin up brokers anywhere, making disaster recovery and geo-replication simple — no need for MirrorMaker or cluster linking.
Efficient Backlog Handling In disk-based Kafka, large backlogs stress disk I/O. Diskless Kafka stores older data in the cloud, so brokers only cache recent data, keeping performance consistent.
How It Works
Messages are batched and stored in cloud object storage.
A Batch Coordinator assigns offsets and tracks where batches are stored.
Consumers fetch metadata from brokers (as usual) and retrieve the actual data from the object store.
Kafka clients remain unchanged — no code changes are needed.
Conclusion
Apache Kafka has transformed data engineering with its high-speed, distributed event log that decouples services and enables scalable, reliable streaming. While its architecture may seem complex, at its core Kafka is about writing sequential events and letting multiple consumers read at their own pace — a simple yet powerful idea.
Now, with innovations like diskless Kafka proposed by Aiven, the platform evolves further. By shifting storage to the cloud, it simplifies operations, cuts costs, and retains Kafka’s core guarantees: ordered, durable, scalable streams.
While there are proprietary versions providing these offerings, having this in open source is a huge plus to the company.
We will discuss this proposal in depth in an upcoming article, and also compare various solutions available in the market from Buf, Warp, and Confluent Fright. Motto of this article to make your aware about the recent development in Kafka ecosystem & in next article, we will go in depth and try to understand the Diskless Kafka in depth and try understanding how others Kafka vendors are actually doing it.
Till then thank you for your precious time as reader :) Keep learning, keep exploring.
Sources:
Data Analyst || Ex-Summer Intern at Excellance Technology || Skilled in Python, SQL, Excel, Power BI | Data Visualization Enthusiast | 5⭐ SQL HackerRank | Passionate about Turning Data into Insights
1w"What key benefits do you think Diskless Kafka brings to data engineering workflows?"
Data Analyst || Ex-Summer Intern at Excellance Technology || Skilled in Python, SQL, Excel, Power BI | Data Visualization Enthusiast | 5⭐ SQL HackerRank | Passionate about Turning Data into Insights
1w"Great read, Ankur! 🔍 Curious to know—what do you see as the biggest advantage of Diskless Kafka in real-world applications?"
Java Full Stack Developer | Expertise in Java, Spring Boot, REST APIs, Angular & React | Open to C2C / C2H Opportunities
2wDiskless Kafka sounds like a game changer! We've seen similar challenges bridging operational and analytical data... curious to see how this impacts latency. Thanks for sharing!
Aspiring Data Engineer | Python | ML | SQL | Tableau | R | 10K+ Followers | Transforming Data into Insights & Dashboards | Putting your Strengths to WORK for You
2wAnkur Ranjan Great insights on KIP-1150! Looking forward to reading more of your thoughts in future installments.
Global Leader - AI/ML, Data Architecture, Platforms and Products
3wFeel free to sign up to learn more about Diskless Kafka Topics from one of the Apache Contributors Greg Harris here : https://www.linkedin.com/events/getkafka-natedepisode3-gregharr7351967184987779074