Part 1: Setting Up Apache Kafka and Simulating Real-Time Data for Stream Processing

Saurabh .D. Tikekar

Data Engineer @ RDSolutions India | Azure ADLS, ADF, Big data pipelines, Amazon Athena, AWS Glue

Published Jun 2, 2025

This article is the first part of a four-part series where we will build a complete real-time data engineering solution using Apache Kafka, Apache Spark Structured Streaming, PostgreSQL, and Streamlit (or Grafana) for visualization. The objective is to build a data pipeline that processes e-commerce order events in real time and provides live insights to support business decision-making.

Series Breakdown:

Part 1: Kafka Setup and Real-Time Data Simulation
Part 2: Real-Time Data Transformation using Spark Structured Streaming
Part 3: Writing Processed Data into PostgreSQL and Querying
Part 4: Building a Real-Time Analytics Dashboard

Introduction to Real-Time Data Processing

In a traditional data pipeline, data is collected, processed, and loaded at scheduled intervals, often in batches. However, modern use cases such as fraud detection, inventory alerts, personalized recommendations, and real-time dashboards require continuous data ingestion and near real-time processing. These demands are addressed by real-time or streaming data pipelines.

Apache Kafka is one of the most widely adopted platforms for building real-time data streaming solutions. It acts as a highly scalable, distributed publish-subscribe messaging system that enables decoupling between producers (event emitters) and consumers (event processors).

In this first article, we will:

Set up Apache Kafka using Docker
Create a Kafka topic for e-commerce order events
Develop a Python-based Kafka producer to simulate and stream fake order data
Understand Kafka architecture and key components

Understanding Apache Kafka

Apache Kafka is a distributed event streaming platform used for building real-time data pipelines and streaming applications. It is designed for high-throughput, low-latency data processing.

Key Components of Kafka:

Producer: A producer is an application that sends records (events/messages) to a Kafka topic.
Consumer: A consumer subscribes to topics and processes the records sent by producers.
Topic: A category or feed name to which records are published. Topics are partitioned for parallelism.
Broker: A Kafka server that stores data and serves client requests.
Zookeeper: A centralized service for maintaining configuration and coordination. Kafka uses Zookeeper for managing brokers and topics.

Kafka can handle millions of messages per second and is suitable for decoupling applications, building event-driven architectures, and real-time analytics.

Setting Up Kafka with Docker

To simplify Kafka setup, especially for development and testing, Docker and Docker Compose can be used to create isolated containers for both Kafka and Zookeeper.

Step 1: Install Docker

Ensure Docker is installed on your system. You can download Docker Desktop for Windows or macOS from https://www.docker.com/products/docker-desktop. On Linux, Docker can be installed via the package manager.

Step 2: Create a Docker Compose Configuration

Create a file named docker-compose.yml with the following content:

version: '2'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

This configuration sets up two containers:

zookeeper: for managing Kafka coordination
kafka: for message brokering

Step 3: Start Kafka and Zookeeper

Navigate to the folder containing the docker-compose.yml file and run the following command:

docker-compose up -d

To verify that the containers are running, use:

docker ps

This should show two containers one for Kafka and one for Zookeeper.

Creating a Kafka Topic

After Kafka is up and running, you need to create a topic where the producer will publish messages.

Step 1: Access the Kafka Container

First, find the Kafka container ID or name:

docker ps

Then access the container:

docker exec -it <kafka_container_id_or_name> bash

Step 2: Create a Topic Named orders-stream

Inside the container, run the following command:

kafka-topics --create \
  --topic orders-stream \
  --bootstrap-server localhost:9092 \
  --partitions 1 \
  --replication-factor 1

To verify that the topic was created successfully, use:

kafka-topics --list --bootstrap-server localhost:9092

You should see orders-stream listed among the topics.

Simulating Real-Time Order Events Using a Python Kafka Producer

To simulate real-time data, we will use Python along with the kafka-python and faker libraries to generate synthetic e-commerce order data.

Step 1: Install Python Packages

Use pip to install the required packages:

pip install kafka-python faker

Step 2: Create a Python Script

Create a file named kafka_producer.py with the following content:

from kafka import KafkaProducer
from faker import Faker
import json
import time
import random

fake = Faker()

# Initialize Kafka producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Function to generate fake order event
def generate_order_event():
    return {
        "order_id": fake.uuid4(),
        "customer_id": fake.random_int(min=1000, max=9999),
        "product": fake.word(),
        "quantity": random.randint(1, 5),
        "price": round(random.uniform(10.0, 200.0), 2),
        "timestamp": fake.iso8601()
    }

# Send events continuously
if __name__ == "__main__":
    while True:
        order = generate_order_event()
        producer.send("orders-stream", order)
        print(f"Sent: {order}")
        time.sleep(2)  # Simulate delay between events

Step 3: Run the Producer

Simply run the script using:

python kafka_producer.py

This script will continuously generate random order events and send them to the Kafka topic orders-stream every two seconds.

You will see output like:

Sent: {'order_id': 'abc123', 'customer_id': 4890, 'product': 'keyboard', 'quantity': 2, 'price': 79.99, 'timestamp': '2025-06-02T12:10:59'}

Verifying Events in Kafka

To verify that messages are being received in Kafka, you can run a simple consumer within the Kafka container:

kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic orders-stream \
  --from-beginning

You should see the stream of JSON messages being printed to the console.

Conclusion

In this article, we introduced the concept of real-time data streaming and the role of Apache Kafka in building streaming pipelines. We installed Kafka using Docker, created a Kafka topic, and developed a Python script to simulate and stream real-time order events.

This setup provides the foundation for building a real-time data processing pipeline. The generated data stream will serve as the input for our Spark Structured Streaming job, which we will cover in the next part of the series.

Next Steps

In Part 2 of this series, we will focus on:

Connecting Apache Spark to Kafka
Parsing and transforming the JSON order events in real time
Writing the cleaned and enriched data to a structured storage system like PostgreSQL

By the end of Part 2, you will have a fully functional real-time ETL pipeline running on Spark.

Rakesh Kharra

Co-Founder – Simbi Labs India | IIM Mumbai Alumnus (2011) | NIFTEM | Driving Project Optimization & Research Advancement | Open to National & International Research Collaborations

2mo

Impressive how real-time processing reduces latency and boosts efficiency across critical operations.

1 Reaction

Part 1: Setting Up Apache Kafka and Simulating Real-Time Data for Stream Processing

Saurabh .D. Tikekar

Data Engineer @ RDSolutions India | Azure ADLS, ADF, Big data pipelines, Amazon Athena, AWS Glue

Introduction to Real-Time Data Processing

Understanding Apache Kafka

Key Components of Kafka:

Setting Up Kafka with Docker

Step 1: Install Docker

Step 2: Create a Docker Compose Configuration

Step 3: Start Kafka and Zookeeper

Creating a Kafka Topic

Step 1: Access the Kafka Container

Step 2: Create a Topic Named orders-stream

Simulating Real-Time Order Events Using a Python Kafka Producer

Step 1: Install Python Packages

Step 2: Create a Python Script

Step 3: Run the Producer

Verifying Events in Kafka

Conclusion

Next Steps

More articles by this author

Others also viewed

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

Powering Real-Time Intelligence: Apache Kafka’s Role in Modern Data Engineering

A Beginner’s Guide to Apache Airflow

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

🔥 Top Apache Spark Optimization Techniques for Real-Time Performance

How to Spot and Fix Performance Problems in Apache Spark

Copy of Heap vs Off-Heap Memory in Distributed Data Processing

Heap vs Off-Heap Memory in Distributed Data Processing

Lambda VS Kappa Architectures

Leveraging Apache Kafka for Data Consistency: Ensuring Reliable and Synchronized Information Flow

Explore topics

Introduction to Real-Time Data Processing

Understanding Apache Kafka

Key Components of Kafka:

Setting Up Kafka with Docker

Step 1: Install Docker

Step 2: Create a Docker Compose Configuration

Step 3: Start Kafka and Zookeeper

Creating a Kafka Topic

Step 1: Access the Kafka Container

Step 2: Create a Topic Named orders-stream

Simulating Real-Time Order Events Using a Python Kafka Producer

Step 1: Install Python Packages

Step 2: Create a Python Script

Step 3: Run the Producer

Verifying Events in Kafka

Conclusion

Next Steps

Part 3: Storing and Querying Streamed Data in PostgreSQL

Jun 11, 2025

Part 2: Streaming Processing with Spark Structured Streaming

Jun 6, 2025

Part 3: Making It Smart – Slack Alerts & Self-Healing Mechanisms in Data Pipelines

May 23, 2025

Part 2: Orchestrating the Pipeline – Airflow Integration & Anomaly Flagging

May 19, 2025

Part 1: Laying the Foundation – Data Validation with Great Expectations & PySpark

May 15, 2025

Part 4: Final Thoughts – Pros, Cons, and Best Practices Across Storage Solutions

May 6, 2025

Part 3: Processing Data with PySpark from Multiple Storage Layers

Apr 28, 2025

Part 2: Ingesting & Storing Raw Data Comparing S3, Azure Data Lake, HDFS, and GCS

Apr 22, 2025

Part 1: Project Kickoff – Designing a Cloud-Agnostic Data Pipeline

Apr 15, 2025

Part 4: Choosing Between ETL and ELT – Best Practices and Future Trends

Apr 6, 2025

Others also viewed

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

Powering Real-Time Intelligence: Apache Kafka’s Role in Modern Data Engineering

A Beginner’s Guide to Apache Airflow

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

🔥 Top Apache Spark Optimization Techniques for Real-Time Performance

How to Spot and Fix Performance Problems in Apache Spark

Copy of Heap vs Off-Heap Memory in Distributed Data Processing

Heap vs Off-Heap Memory in Distributed Data Processing

Lambda VS Kappa Architectures

Leveraging Apache Kafka for Data Consistency: Ensuring Reliable and Synchronized Information Flow

Explore topics