Kafka for Data Engineers

Ankur Ranjan

Software Engineer by heart, Data Engineer by mind

Published Feb 25, 2023

Kafka is the prominent queuing system which is the most used technology in all streaming solutions. In most of my streaming use cases at Walmart, I use Kafka either to consume records from it or produce to it.

But, What makes Kafka most used and how it is different from the traditional queueing system?

In my working experience, there are mainly two biggest advantages of Kafka

Its ability to remove any coupling between the producer and consumer of events.
Its ability to retain the records.

Using Kafka is a good way to eliminate impedance mismatches between producers and consumers. Each one can proceed at its own pace without impacting other parts of the ecosystem, making it a robust enterprise infrastructure software. Interactions via Kafka are not point-to-point allowing true decoupling of systems. It follows the publisher-subscriber model instead of just the producers and consumers model.

Here producer can keep producing the data at its own pace and different consumers can keep consuming by subscribing to the topic.

We can retain the data in the Kafka cluster for a certain period of time.

Data or Records in the Kafka cluster will not be lost even if some consumers consume that records.

We can always add new consumers or remove old consumers.

So it's the ability to retain the data and decouples the producer and consumer was a major breakthrough in the big data industry.

But, Ankur before understanding advantage, We shall understand, what Kafka is. Am I right😅?

Yup, you are right. Let me try to break it down for you.

According to Confluent(Kafka's biggest service provider)

Apache Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale. Originally created to handle real-time data feeds at LinkedIn in 2011, Kafka quickly evolved from messaging queue to a full-fledged event streaming platform capable of handling over 1 million messages per second, or trillions of messages per day.

Ohh again Ankur, too many technical terms😁, How does Data Engineer use Kafka? Will you please explain it in little raw terms?

Ohh ok, I suppose you have read my previous articles on spark streaming where I was using a console or terminal to produce the streaming data for demo purposes. If you have not read my previous article I highly recommend you to read it.

Please find the links below.

So in our previous example, we were trying to read data from a socket using a terminal for our learning purpose

But the socket is not a reliable source.
We can't buffer the data here. Hence we can't replay or reproduce it.

Here Buffering capabilities mean that if a producer is producing data at a larger speed & if a consumer is not able to consume the data quickly then the producer should have the capability of holding or buffering the data.

This capability is provided by Kafka as queuing system. Kafka can't retain the data for a certain period of time. Here the default is mostly 7 days. It provides buffering capability.

Kafka helps to decouple the producer and consumer which means that the producer and consumer can work independently. Both are not necessary to be synced always but it's a good practice to sync the producer and consumer because if it is not synced then it can create Lag for the use case.

Kafka also has the capability of processing the stream data but we will not talk about it right now. For now, we are using Spark Streaming or Spark Structure Streaming for all our processing work.

So, We have now understood that Kafka is a publisher-subscriber system. Let me draw one diagram for you.

You can check from the above diagram that

We have multiple producers which can produce messages to the same Kafka topic which can have multiple partitions.
These messages will be sent to the Kafka cluster.
A Kafka cluster consists of several machines or brokers. It is a distributed system.
Here we can also see that multiple consumers can subscribe to the same topic from the Kafka cluster and can read the messages from them.

We have read a lot about Kafka's definition and its architecture for now. Let's try to see some important terms of Kafka which a Data Engineer should understand.

Producers: The Kafka Producer allows applications to send streams of data to the Kafka cluster. For example, A stream of data from tweeter or any social media can be used for producing streams of messages to the Kafka Cluster. We can use Spark Streaming too to produce messages to certain Kafka topics.
Consumer: The Kafka Consumer allows applications to read streams of data from the cluster. We also use spark streaming or spark structure streaming to consume the records from the Kafka topic.
Brokers: A Broker is a Kafka server that runs in a Kafka Cluster. Multiple Kafka Brokers form a cluster. Brokers sometimes refer to more of a logical system or Kafka as a whole.
Kafka Clusters: A Kafka cluster is made up of multiple Kafka Brokers.
Topic: In layman's terms Kafka's topic is like a unique name that holds a particular kind of data. It's like a table in the database. We can have multiple Kafka topics in a single Kafka cluster.
Partitions: A topic can be divided into multiple partitions & each partition can be stored in a different machine. This makes the system distributed and scalable. The number of partitions for a certain time is a design time decision which means one can't change the number of Kafka topic partitions once it is created. Whenever we are creating a certain topic in the cluster, we are supposed to pass the number of partitions and it can be changed later on.
Partition Offset: Inside each partition, the messages are stored in the sequence & we have sequence ID. So, the offset is just the sequence ID for the messages.
Consumer Group: A consumer group is a set of consumers who cooperate to consume data from some topics. The partitions of all the topics are divided among the consumers in the group.

I hope you are now clear about the basic fundamental of Kafka. Let's meet in the next section to try to cover the following things.

How does Kafka guarantee fault tolerance and scalability?
Storage architecture of Apache Kafka.

I have already written one article on managing disasters for Kafka. Please go through this link if you are planning to implement Disaster Recovery for your Kafka-based system and make uptime more than 99%.

Feel free to subscribe to my YouTube channel i.e The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.

More so, thank you for that most precious gift to a me as writer i.e. your time.

The Big Data Show

19,206 followers

+ Subscribe

Gumboh N.

Data Engineer | Architect of Scalable Data Pipelines | Driving Impact in Nutrition & Education Sectors

thank you Ankur Ranjan for this breakdown. Your YouTube channel is also very practical 🔥

1 Reaction

Saurabh Shashank

Staff Data Engineer at Walmart | Ex-Tesco | ML Engineer | NLP Engineer | Map Reduce -ing since @2012

Good Read. Nice work Ankur Ranjan

1 Reaction

Ivo Stratev

Software Engineer

The number of partitions for a selected topic can be changed. But the change can only increase them.

Kafka for Data Engineers

Ankur Ranjan

Software Engineer by heart, Data Engineer by mind

The Big Data Show

19,206 followers

More articles by this author

Others also viewed

Apache Hudi - The Streaming Data Lake Platform

Complex Tools And Best Practices For Building Event-Driven Architectures

Tracking storms with Kafka/Spark streaming

Boost Real-time Processing with Spark Structured Streaming

Modernising Legacy Systems: The Role of No-Code/Low-Code Streaming Data Transformation

Streaming, real time feed and event distribution: what's the difference?

Introduction to Observability in Kafka Multi-Tenant Architectures

Streaming Metrics for Compute Observability with Kafka

A Guide To Apache Kafka - A Data Streaming Platform

Harnessing the Power of Apache Kafka in Real-Time Data Streaming

Explore topics

The Big Data Show

19,206 followers

Now, It's Kafka's Turn: Separating Compute and Storage - Part 1

Jul 25, 2025

Named template in Helm

Jun 9, 2025

Hive Metastore as an Apache Iceberg Catalog

May 26, 2025

Why catalogs matter: The book-keeping of Apache Iceberg

May 22, 2025

Why AWS Glue Falls Short for Multi-Table Transactions in Apache Iceberg

May 16, 2025

Reflections on the Hadoop Catalog in Apache Iceberg

May 15, 2025

Why is Iceberg choosing to deperecate, positional delete in MoR?

May 6, 2025

Apache Arrow Flight

Jan 24, 2025

Unlocking Apache Kafka: The Secret Sauce of Event Streaming?

May 19, 2024

Spark Dynamic Resource Allocation

Apr 26, 2024

Others also viewed

Apache Hudi - The Streaming Data Lake Platform

Complex Tools And Best Practices For Building Event-Driven Architectures

Tracking storms with Kafka/Spark streaming

Boost Real-time Processing with Spark Structured Streaming

Modernising Legacy Systems: The Role of No-Code/Low-Code Streaming Data Transformation

Streaming, real time feed and event distribution: what's the difference?

Introduction to Observability in Kafka Multi-Tenant Architectures

Streaming Metrics for Compute Observability with Kafka

A Guide To Apache Kafka - A Data Streaming Platform

Harnessing the Power of Apache Kafka in Real-Time Data Streaming

Explore topics