Kafka for Data Engineers

Kafka for Data Engineers

Kafka is the prominent queuing system which is the most used technology in all streaming solutions. In most of my streaming use cases at Walmart, I use Kafka either to consume records from it or produce to it.

But, What makes Kafka most used and how it is different from the traditional queueing system?

In my working experience, there are mainly two biggest advantages of Kafka

  1. Its ability to remove any coupling between the producer and consumer of events.
  2. Its ability to retain the records.

Using Kafka is a good way to eliminate impedance mismatches between producers and consumers. Each one can proceed at its own pace without impacting other parts of the ecosystem, making it a robust enterprise infrastructure software. Interactions via Kafka are not point-to-point allowing true decoupling of systems. It follows the publisher-subscriber model instead of just the producers and consumers model.

Here producer can keep producing the data at its own pace and different consumers can keep consuming by subscribing to the topic.

We can retain the data in the Kafka cluster for a certain period of time.

Data or Records in the Kafka cluster will not be lost even if some consumers consume that records.

We can always add new consumers or remove old consumers.

So it's the ability to retain the data and decouples the producer and consumer was a major breakthrough in the big data industry.


But, Ankur before understanding advantage, We shall understand, what Kafka is. Am I right😅?


Yup, you are right. Let me try to break it down for you.


According to Confluent(Kafka's biggest service provider)

Apache Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale. Originally created to handle real-time data feeds at LinkedIn in 2011, Kafka quickly evolved from messaging queue to a full-fledged event streaming platform capable of handling over 1 million messages per second, or trillions of messages per day.


Ohh again Ankur, too many technical terms😁, How does Data Engineer use Kafka? Will you please explain it in little raw terms?

Ohh ok, I suppose you have read my previous articles on spark streaming where I was using a console or terminal to produce the streaming data for demo purposes. If you have not read my previous article I highly recommend you to read it.

Please find the links below.

  1. Spark Streaming Article Link
  2. Youtube Video Link

So in our previous example, we were trying to read data from a socket using a terminal for our learning purpose

  • But the socket is not a reliable source.
  • We can't buffer the data here. Hence we can't replay or reproduce it.

Here Buffering capabilities mean that if a producer is producing data at a larger speed & if a consumer is not able to consume the data quickly then the producer should have the capability of holding or buffering the data.

This capability is provided by Kafka as queuing system. Kafka can't retain the data for a certain period of time. Here the default is mostly 7 days. It provides buffering capability.

Kafka helps to decouple the producer and consumer which means that the producer and consumer can work independently. Both are not necessary to be synced always but it's a good practice to sync the producer and consumer because if it is not synced then it can create Lag for the use case.

Kafka also has the capability of processing the stream data but we will not talk about it right now. For now, we are using Spark Streaming or Spark Structure Streaming for all our processing work.



So, We have now understood that Kafka is a publisher-subscriber system. Let me draw one diagram for you.

No alt text provided for this image

You can check from the above diagram that

  • We have multiple producers which can produce messages to the same Kafka topic which can have multiple partitions.
  • These messages will be sent to the Kafka cluster.
  • A Kafka cluster consists of several machines or brokers. It is a distributed system.
  • Here we can also see that multiple consumers can subscribe to the same topic from the Kafka cluster and can read the messages from them.



We have read a lot about Kafka's definition and its architecture for now. Let's try to see some important terms of Kafka which a Data Engineer should understand.

  1. Producers: The Kafka Producer allows applications to send streams of data to the Kafka cluster. For example, A stream of data from tweeter or any social media can be used for producing streams of messages to the Kafka Cluster. We can use Spark Streaming too to produce messages to certain Kafka topics.
  2. Consumer: The Kafka Consumer allows applications to read streams of data from the cluster. We also use spark streaming or spark structure streaming to consume the records from the Kafka topic.
  3. Brokers: A Broker is a Kafka server that runs in a Kafka Cluster. Multiple Kafka Brokers form a cluster. Brokers sometimes refer to more of a logical system or Kafka as a whole.
  4. Kafka Clusters:Kafka cluster is made up of multiple Kafka Brokers. 
  5. Topic: In layman's terms Kafka's topic is like a unique name that holds a particular kind of data. It's like a table in the database. We can have multiple Kafka topics in a single Kafka cluster.
  6. Partitions: A topic can be divided into multiple partitions & each partition can be stored in a different machine. This makes the system distributed and scalable. The number of partitions for a certain time is a design time decision which means one can't change the number of Kafka topic partitions once it is created. Whenever we are creating a certain topic in the cluster, we are supposed to pass the number of partitions and it can be changed later on.
  7. Partition Offset: Inside each partition, the messages are stored in the sequence & we have sequence ID. So, the offset is just the sequence ID for the messages.
  8. Consumer Group: A consumer group is a set of consumers who cooperate to consume data from some topics. The partitions of all the topics are divided among the consumers in the group.


I hope you are now clear about the basic fundamental of Kafka. Let's meet in the next section to try to cover the following things.

  1. How does Kafka guarantee fault tolerance and scalability?
  2. Storage architecture of Apache Kafka.

I have already written one article on managing disasters for Kafka. Please go through this link if you are planning to implement Disaster Recovery for your Kafka-based system and make uptime more than 99%.

Feel free to subscribe to my YouTube channel i.e The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.

More so, thank you for that most precious gift to a me as writer i.e. your time.






Gumboh N.

Data Engineer | Architect of Scalable Data Pipelines | Driving Impact in Nutrition & Education Sectors

1y

thank you Ankur Ranjan for this breakdown. Your YouTube channel is also very practical 🔥

Saurabh Shashank

Staff Data Engineer at Walmart | Ex-Tesco | ML Engineer | NLP Engineer | Map Reduce -ing since @2012

2y

Good Read. Nice work Ankur Ranjan

The number of partitions for a selected topic can be changed. But the change can only increase them.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics