Part 1: Setting Up Apache Kafka and Simulating Real-Time Data for Stream Processing

Part 1: Setting Up Apache Kafka and Simulating Real-Time Data for Stream Processing

This article is the first part of a four-part series where we will build a complete real-time data engineering solution using Apache Kafka, Apache Spark Structured Streaming, PostgreSQL, and Streamlit (or Grafana) for visualization. The objective is to build a data pipeline that processes e-commerce order events in real time and provides live insights to support business decision-making.

Series Breakdown:

  • Part 1: Kafka Setup and Real-Time Data Simulation
  • Part 2: Real-Time Data Transformation using Spark Structured Streaming
  • Part 3: Writing Processed Data into PostgreSQL and Querying
  • Part 4: Building a Real-Time Analytics Dashboard


Introduction to Real-Time Data Processing

In a traditional data pipeline, data is collected, processed, and loaded at scheduled intervals, often in batches. However, modern use cases such as fraud detection, inventory alerts, personalized recommendations, and real-time dashboards require continuous data ingestion and near real-time processing. These demands are addressed by real-time or streaming data pipelines.

Apache Kafka is one of the most widely adopted platforms for building real-time data streaming solutions. It acts as a highly scalable, distributed publish-subscribe messaging system that enables decoupling between producers (event emitters) and consumers (event processors).

In this first article, we will:

  • Set up Apache Kafka using Docker
  • Create a Kafka topic for e-commerce order events
  • Develop a Python-based Kafka producer to simulate and stream fake order data
  • Understand Kafka architecture and key components


Understanding Apache Kafka

Apache Kafka is a distributed event streaming platform used for building real-time data pipelines and streaming applications. It is designed for high-throughput, low-latency data processing.

Key Components of Kafka:

  • Producer: A producer is an application that sends records (events/messages) to a Kafka topic.
  • Consumer: A consumer subscribes to topics and processes the records sent by producers.
  • Topic: A category or feed name to which records are published. Topics are partitioned for parallelism.
  • Broker: A Kafka server that stores data and serves client requests.
  • Zookeeper: A centralized service for maintaining configuration and coordination. Kafka uses Zookeeper for managing brokers and topics.

Kafka can handle millions of messages per second and is suitable for decoupling applications, building event-driven architectures, and real-time analytics.


Setting Up Kafka with Docker

To simplify Kafka setup, especially for development and testing, Docker and Docker Compose can be used to create isolated containers for both Kafka and Zookeeper.

Step 1: Install Docker

Ensure Docker is installed on your system. You can download Docker Desktop for Windows or macOS from https://www.docker.com/products/docker-desktop. On Linux, Docker can be installed via the package manager.

Step 2: Create a Docker Compose Configuration

Create a file named docker-compose.yml with the following content:

version: '2'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
        

This configuration sets up two containers:

  • zookeeper: for managing Kafka coordination
  • kafka: for message brokering

Step 3: Start Kafka and Zookeeper

Navigate to the folder containing the docker-compose.yml file and run the following command:

docker-compose up -d
        

To verify that the containers are running, use:

docker ps
        

This should show two containers one for Kafka and one for Zookeeper.


Creating a Kafka Topic

After Kafka is up and running, you need to create a topic where the producer will publish messages.

Step 1: Access the Kafka Container

First, find the Kafka container ID or name:

docker ps
        

Then access the container:

docker exec -it <kafka_container_id_or_name> bash
        

Step 2: Create a Topic Named orders-stream

Inside the container, run the following command:

kafka-topics --create \
  --topic orders-stream \
  --bootstrap-server localhost:9092 \
  --partitions 1 \
  --replication-factor 1
        

To verify that the topic was created successfully, use:

kafka-topics --list --bootstrap-server localhost:9092
        

You should see orders-stream listed among the topics.


Simulating Real-Time Order Events Using a Python Kafka Producer

To simulate real-time data, we will use Python along with the kafka-python and faker libraries to generate synthetic e-commerce order data.

Step 1: Install Python Packages

Use pip to install the required packages:

pip install kafka-python faker
        

Step 2: Create a Python Script

Create a file named kafka_producer.py with the following content:

from kafka import KafkaProducer
from faker import Faker
import json
import time
import random

fake = Faker()

# Initialize Kafka producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Function to generate fake order event
def generate_order_event():
    return {
        "order_id": fake.uuid4(),
        "customer_id": fake.random_int(min=1000, max=9999),
        "product": fake.word(),
        "quantity": random.randint(1, 5),
        "price": round(random.uniform(10.0, 200.0), 2),
        "timestamp": fake.iso8601()
    }

# Send events continuously
if __name__ == "__main__":
    while True:
        order = generate_order_event()
        producer.send("orders-stream", order)
        print(f"Sent: {order}")
        time.sleep(2)  # Simulate delay between events
        

Step 3: Run the Producer

Simply run the script using:

python kafka_producer.py
        

This script will continuously generate random order events and send them to the Kafka topic orders-stream every two seconds.

You will see output like:

Sent: {'order_id': 'abc123', 'customer_id': 4890, 'product': 'keyboard', 'quantity': 2, 'price': 79.99, 'timestamp': '2025-06-02T12:10:59'}
        

Verifying Events in Kafka

To verify that messages are being received in Kafka, you can run a simple consumer within the Kafka container:

kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic orders-stream \
  --from-beginning
        

You should see the stream of JSON messages being printed to the console.


Conclusion

In this article, we introduced the concept of real-time data streaming and the role of Apache Kafka in building streaming pipelines. We installed Kafka using Docker, created a Kafka topic, and developed a Python script to simulate and stream real-time order events.

This setup provides the foundation for building a real-time data processing pipeline. The generated data stream will serve as the input for our Spark Structured Streaming job, which we will cover in the next part of the series.


Next Steps

In Part 2 of this series, we will focus on:

  • Connecting Apache Spark to Kafka
  • Parsing and transforming the JSON order events in real time
  • Writing the cleaned and enriched data to a structured storage system like PostgreSQL

By the end of Part 2, you will have a fully functional real-time ETL pipeline running on Spark.


Rakesh Kharra

Co-Founder – Simbi Labs India | IIM Mumbai Alumnus (2011) | NIFTEM | Driving Project Optimization & Research Advancement | Open to National & International Research Collaborations

2mo

Impressive how real-time processing reduces latency and boosts efficiency across critical operations.

To view or add a comment, sign in

Others also viewed

Explore topics