Part 1: Setting Up Apache Kafka and Simulating Real-Time Data for Stream Processing
This article is the first part of a four-part series where we will build a complete real-time data engineering solution using Apache Kafka, Apache Spark Structured Streaming, PostgreSQL, and Streamlit (or Grafana) for visualization. The objective is to build a data pipeline that processes e-commerce order events in real time and provides live insights to support business decision-making.
Series Breakdown:
Introduction to Real-Time Data Processing
In a traditional data pipeline, data is collected, processed, and loaded at scheduled intervals, often in batches. However, modern use cases such as fraud detection, inventory alerts, personalized recommendations, and real-time dashboards require continuous data ingestion and near real-time processing. These demands are addressed by real-time or streaming data pipelines.
Apache Kafka is one of the most widely adopted platforms for building real-time data streaming solutions. It acts as a highly scalable, distributed publish-subscribe messaging system that enables decoupling between producers (event emitters) and consumers (event processors).
In this first article, we will:
Understanding Apache Kafka
Apache Kafka is a distributed event streaming platform used for building real-time data pipelines and streaming applications. It is designed for high-throughput, low-latency data processing.
Key Components of Kafka:
Kafka can handle millions of messages per second and is suitable for decoupling applications, building event-driven architectures, and real-time analytics.
Setting Up Kafka with Docker
To simplify Kafka setup, especially for development and testing, Docker and Docker Compose can be used to create isolated containers for both Kafka and Zookeeper.
Step 1: Install Docker
Ensure Docker is installed on your system. You can download Docker Desktop for Windows or macOS from https://www.docker.com/products/docker-desktop. On Linux, Docker can be installed via the package manager.
Step 2: Create a Docker Compose Configuration
Create a file named docker-compose.yml with the following content:
version: '2'
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:latest
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
This configuration sets up two containers:
Step 3: Start Kafka and Zookeeper
Navigate to the folder containing the docker-compose.yml file and run the following command:
docker-compose up -d
To verify that the containers are running, use:
docker ps
This should show two containers one for Kafka and one for Zookeeper.
Creating a Kafka Topic
After Kafka is up and running, you need to create a topic where the producer will publish messages.
Step 1: Access the Kafka Container
First, find the Kafka container ID or name:
docker ps
Then access the container:
docker exec -it <kafka_container_id_or_name> bash
Step 2: Create a Topic Named orders-stream
Inside the container, run the following command:
kafka-topics --create \
--topic orders-stream \
--bootstrap-server localhost:9092 \
--partitions 1 \
--replication-factor 1
To verify that the topic was created successfully, use:
kafka-topics --list --bootstrap-server localhost:9092
You should see orders-stream listed among the topics.
Simulating Real-Time Order Events Using a Python Kafka Producer
To simulate real-time data, we will use Python along with the kafka-python and faker libraries to generate synthetic e-commerce order data.
Step 1: Install Python Packages
Use pip to install the required packages:
pip install kafka-python faker
Step 2: Create a Python Script
Create a file named kafka_producer.py with the following content:
from kafka import KafkaProducer
from faker import Faker
import json
import time
import random
fake = Faker()
# Initialize Kafka producer
producer = KafkaProducer(
bootstrap_servers='localhost:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# Function to generate fake order event
def generate_order_event():
return {
"order_id": fake.uuid4(),
"customer_id": fake.random_int(min=1000, max=9999),
"product": fake.word(),
"quantity": random.randint(1, 5),
"price": round(random.uniform(10.0, 200.0), 2),
"timestamp": fake.iso8601()
}
# Send events continuously
if __name__ == "__main__":
while True:
order = generate_order_event()
producer.send("orders-stream", order)
print(f"Sent: {order}")
time.sleep(2) # Simulate delay between events
Step 3: Run the Producer
Simply run the script using:
python kafka_producer.py
This script will continuously generate random order events and send them to the Kafka topic orders-stream every two seconds.
You will see output like:
Sent: {'order_id': 'abc123', 'customer_id': 4890, 'product': 'keyboard', 'quantity': 2, 'price': 79.99, 'timestamp': '2025-06-02T12:10:59'}
Verifying Events in Kafka
To verify that messages are being received in Kafka, you can run a simple consumer within the Kafka container:
kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic orders-stream \
--from-beginning
You should see the stream of JSON messages being printed to the console.
Conclusion
In this article, we introduced the concept of real-time data streaming and the role of Apache Kafka in building streaming pipelines. We installed Kafka using Docker, created a Kafka topic, and developed a Python script to simulate and stream real-time order events.
This setup provides the foundation for building a real-time data processing pipeline. The generated data stream will serve as the input for our Spark Structured Streaming job, which we will cover in the next part of the series.
Next Steps
In Part 2 of this series, we will focus on:
By the end of Part 2, you will have a fully functional real-time ETL pipeline running on Spark.
Co-Founder – Simbi Labs India | IIM Mumbai Alumnus (2011) | NIFTEM | Driving Project Optimization & Research Advancement | Open to National & International Research Collaborations
2moImpressive how real-time processing reduces latency and boosts efficiency across critical operations.