SlideShare a Scribd company logo
Introduction to Kafka
Instructor: Ekpe Okorafor
1. Big Data Academy - Accenture
2. Computer Science - African University of Science &
Technology
Agenda
• Introduction - Messaging Basics
• Kafka – Architecture
• Kafka – Partitioning & Topics
• Summary
2
Agenda
• Introduction - Messaging Basics
• Kafka – Architecture
• Kafka – Partitioning & Topics
• Summary
3
Introduction
4
When used in the right way and for the right use case, Kafka has unique
attributes that make it a highly attractive option for data integration.
• Data Integration is the combination of technical and business processes
used to combine data from disparate sources into meaningful and
valuable information.
• A complete data integration solution encompasses discovery, cleansing,
monitoring, transforming and delivery of data from a variety of sources
• Messaging is a key data integration strategy employed in many
distributed environments such as the cloud.
• Messaging supports asynchronous operations, enabling you to decouple
a process that consumes a service from the process that implements the
service.
Data
Integration
Data Sources
(Producers)
Data Consumers
(Subscribers)
Messaging Architectures: What is
Messaging?
• Application-to-application communication
• Supports asynchronous operations.
• Message:
– A message is a self-contained package of data and network routing headers.
• Broker:
– Intermediary program that translates messages from the formal messaging
protocol of the publisher to the formal messaging protocol of the receiver.
5
Broker Subscriber
Producer
Steps to Messaging
• Messaging connects multiple applications in an exchange of data.
• Messaging uses an encapsulated asynchronous approach to exchange
data through a network.
• A traditional messaging system has two models of abstraction:
• Queue – a message channel where a single message is received exactly by
one consumer in a point-to-point message-queue pattern. If there are no
consumers available, the message is retained until a consumer processes the
message.
• Topic - a message feed that implements the publish-subscribe pattern and
broadcasts messages to consumers that subscribe to that topic.
• A single message is transmitted in five steps:
• Create
• Send
• Deliver
• Receive
• Process
6
Messaging Basics
7
1. Create
Message Source
Message Storage
Sending Application Receiving Application
Channel
2. Send
3. Deliver
4. Receive
5. Process
Message Destination
Message with Data
Data
Steps to Send a Message
Reference: Enterprise Integration Patterns - Gregor Hohpe and Bobby Woolf
Agenda
• Introduction - Messaging Basics
• Kafka – Architecture
• Kafka – Partitioning & Topics
• Summary
8
Messaging Architectures: Messaging
Models
9
1. Point to Point
2. Publish and Subscribe
Kafka is an example of publish-and-subscribe messaging model
Kafka Overview
10
• Kafka is a unique distributed publish-subscribe messaging system written
in the Scala language with multi-language support and runs on the Java
Virtual Machine (JVM).
• Kafka relies on another service named Zookeeper – a distributed
coordination system – to function.
• Kafka has high-throughput and is built to scale-out in a distributed model
on multiple servers.
• Kafka persists messages on disk and can be used for batched
consumption as well as real time applications.
Key Terminology
• Kafka maintains feeds of messages in categories
called topics.
• Processes that publish messages to a Kafka topic are
called producers.
• Processes that subscribe to topics and process the
feed of published messages are called consumers.
• Kafka is run as a cluster comprised of one or more
servers each of which is called a broker.
• Communication between all components is done via a
high performance simple binary API over TCP protocol
11
Kafka Architecture
12
Consumer
Consumer
Broker
Producer
Producer
Zookeeper
Broker
Broker
Broker
Kafka Cluster
Agenda
• Introduction - Messaging Basics
• Kafka – Architecture
• Kafka – Partitioning & Topics
• Summary
13
Understanding Kafka
14
• Kafka is based on the simple storage-abstraction concept called a log, an
append-only totally-ordered sequence of records ordered by time.
• Records are appended to the end of the record and reads proceed from
left to right in the log (or topic).
• Each entry is assigned a unique sequential log-entry number (an offset).
• The log entry number is a convenient property that correlates to the
notion of a “timestamp” entry but is decoupled from any clock due to the
distributed nature of Kafka.
Kafka Key Design Concepts
• A log is synonymous to a file or table where the records are
appended and sorted by the concept of time.
• Conceptually, the log is a natural data-structure for handling
data-flow between systems.
• Kafka is designed for centralizing an organization’s data into an
enterprise log (message bus) for real-time subscription by other
subscribers or application consumers.
15
Kafka Conceptual Design
• Each logical data source can be modeled as a log corresponding to a
topic or data feed in Kafka.
• Each subscribing consuming application should read as quickly as it can
from each topic, persist the record it reads into it’s own data store and
advances the offset to the next message entry to be read.
• Subscribers can be any type of data system or middleware system like a
cache, Hadoop, a streaming system like Spark or Storm, a search
system, a web services provisioning system, a data warehouse, etc.
• In Kafka, partitioning is a concept applied to the log/topic in other to
allow horizontal scaling.
16
Kafka Logical Design
• Each partition is a totally ordered log within a topic, and there is
no global ordering between partitions.
• Assignment of messages to specific partitions is controlled by
the publisher and may be assigned based on a unique
identification key or messages can be allowed to be randomly
assigned to partitions.
• Partitioning allows throughput to scale linearly with the Kafka
cluster size.
17
Kafka Topics
• Kafka topics should have a small number of consumer groups assigned
with each one representing a “logical subscriber”.
• Kafka topic consumption can be scaled by increasing the number of
consumer subscriber instances within the same group which will
automatically load-balance message consumption.
• Kafka has a notion of partitioning within a topic to provide the notion of
parallel consumption
• Partitions in a topic are assigned to the consumers within a consumer
group.
• There can be no more consumer instances within a consumer group
than partitions within a topic.
• If the total order in which messages are published is important in the
consumption, then a single partition for the topic is the solution which
will mean only one consumer process in the consumer group.
18
Kafka Topic Partitions
19
• A topic consists of partitions.
• Partition: ordered + immutable sequence of
messages that is continually appended to
Kafka Topic Partitions
20
• #partitions of a topic is configurable
• #partitions determines max consumer (group) parallelism
– Cf. parallelism of Storm’s KafkaSpout via builder.setSpout(,,N)
– Consumer group A, with 2 consumers, reads from a 4-partition
topic
– Consumer group B, with 4 consumers, reads from the same topic
Kafka Consumer Groups
21
• Kafka assigns the partitions in a topic to the consumer instances in a
consumer group to provide ordering guarantees and load balancing over
a pool of consumer process. Note that there can be no more consumer
instances per group than total partition count.
Kafka Environment Properties
• Ensure you have access to downloading libraries from the web.
• Have at least 15 GB of free hard disk space on your local machine.
• Have at least 8GB (preferably 16GB) of RAM on your local machine.
• Have a JRE of version 1.7 and above installed on the local machine.
• Download and install Eclipse Mars (or the current release) on your local
machine.
• Download and install VMware player for Windows on the local machine
• Download and install Git from the URL https://git-scm.com/
• Download and install Maven https://maven.apache.org/download.cgi
• Download the latest stable version of Gradle http://gradle.org/gradle-
download/
• Download Scala (use the Scala version compatible with the Kafka
download Scala version – in this document Scala version 2.10 is utilized)
• Make sure all the necessary command paths for Git, Maven, Gradle, etc
are in the Windows Environment and Path.
22
Kafka Environment Setup
• The Kafka environment can be set up on a local machine in
Windows, Linux or in a virtual environment on the local machine.
• Go to the Kafka Download URL:
https://kafka.apache.org/downloads.html
• The current Kafka download site has current release and previous
release versions of Kafka with there corresponding Scala version
binary downloads.
• The download releases have a suffix of *.tgz which means the
binaries are gzipd compiled as Linux tar balls.
• To get the Windows binaries, the source code needs to be
downloaded and compiled on Windows.
23
Agenda
• Introduction - Messaging Basics
• Kafka – Architecture
• Kafka – Partitioning & Topics
• Summary
24
Summary
• When used in the right way and for the right use case,
Kafka has unique attributes that make it a highly
attractive option for data integration.
• Kafka is a unique distributed publish-subscribe
messaging system written in the Scala language with
multi-language support and runs on the Java Virtual
Machine (JVM).
25
26

More Related Content

PPTX
Kafka
PPTX
Apache kafka
PDF
PDF
apachekafka-160907180205.pdf
PPTX
Kafka tutorial
PPTX
Unleashing Real-time Power with Kafka.pptx
PPTX
Fundamentals and Architecture of Apache Kafka
PDF
Apache Kafka Introduction
Kafka
Apache kafka
apachekafka-160907180205.pdf
Kafka tutorial
Unleashing Real-time Power with Kafka.pptx
Fundamentals and Architecture of Apache Kafka
Apache Kafka Introduction

Similar to Introduction_to_Kafka - A brief Overview.pdf (20)

PPTX
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
PPTX
Session 23 - Kafka and Zookeeper
PPTX
kafka_session_updated.pptx
PPTX
Apache kafka
PPTX
Copy of Kafka-Camus
PDF
Fundamentals of Apache Kafka
PDF
Python Kafka Integration: Developers Guide
PPTX
Kafka pub sub demo
PPTX
Kafka presentation
PDF
Kafka for begginer
PDF
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
PPTX
Columbus mule soft_meetup_aug2021_Kafka_Integration
PPTX
Introduction to Kafka Streams Presentation
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
PPTX
kafka_session1_basics_1.pptx kafka_session1_basics_1.pptx
PPTX
Apache kafka
PPTX
Kafkha real time analytics platform.pptx
PPTX
Apache kafka
PDF
Building Streaming Data Applications Using Apache Kafka
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Session 23 - Kafka and Zookeeper
kafka_session_updated.pptx
Apache kafka
Copy of Kafka-Camus
Fundamentals of Apache Kafka
Python Kafka Integration: Developers Guide
Kafka pub sub demo
Kafka presentation
Kafka for begginer
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Columbus mule soft_meetup_aug2021_Kafka_Integration
Introduction to Kafka Streams Presentation
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
kafka_session1_basics_1.pptx kafka_session1_basics_1.pptx
Apache kafka
Kafkha real time analytics platform.pptx
Apache kafka
Building Streaming Data Applications Using Apache Kafka
Ad

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
System and Network Administraation Chapter 3
PPTX
Transform Your Business with a Software ERP System
PDF
medical staffing services at VALiNTRY
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
CHAPTER 2 - PM Management and IT Context
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Online Work Permit System for Fast Permit Processing
Design an Analysis of Algorithms I-SECS-1021-03
Which alternative to Crystal Reports is best for small or large businesses.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
ISO 45001 Occupational Health and Safety Management System
How to Choose the Right IT Partner for Your Business in Malaysia
System and Network Administraation Chapter 3
Transform Your Business with a Software ERP System
medical staffing services at VALiNTRY
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Softaken Excel to vCard Converter Software.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
CHAPTER 2 - PM Management and IT Context
Ad

Introduction_to_Kafka - A brief Overview.pdf

  • 1. Introduction to Kafka Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2. Computer Science - African University of Science & Technology
  • 2. Agenda • Introduction - Messaging Basics • Kafka – Architecture • Kafka – Partitioning & Topics • Summary 2
  • 3. Agenda • Introduction - Messaging Basics • Kafka – Architecture • Kafka – Partitioning & Topics • Summary 3
  • 4. Introduction 4 When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration. • Data Integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. • A complete data integration solution encompasses discovery, cleansing, monitoring, transforming and delivery of data from a variety of sources • Messaging is a key data integration strategy employed in many distributed environments such as the cloud. • Messaging supports asynchronous operations, enabling you to decouple a process that consumes a service from the process that implements the service. Data Integration Data Sources (Producers) Data Consumers (Subscribers)
  • 5. Messaging Architectures: What is Messaging? • Application-to-application communication • Supports asynchronous operations. • Message: – A message is a self-contained package of data and network routing headers. • Broker: – Intermediary program that translates messages from the formal messaging protocol of the publisher to the formal messaging protocol of the receiver. 5 Broker Subscriber Producer
  • 6. Steps to Messaging • Messaging connects multiple applications in an exchange of data. • Messaging uses an encapsulated asynchronous approach to exchange data through a network. • A traditional messaging system has two models of abstraction: • Queue – a message channel where a single message is received exactly by one consumer in a point-to-point message-queue pattern. If there are no consumers available, the message is retained until a consumer processes the message. • Topic - a message feed that implements the publish-subscribe pattern and broadcasts messages to consumers that subscribe to that topic. • A single message is transmitted in five steps: • Create • Send • Deliver • Receive • Process 6
  • 7. Messaging Basics 7 1. Create Message Source Message Storage Sending Application Receiving Application Channel 2. Send 3. Deliver 4. Receive 5. Process Message Destination Message with Data Data Steps to Send a Message Reference: Enterprise Integration Patterns - Gregor Hohpe and Bobby Woolf
  • 8. Agenda • Introduction - Messaging Basics • Kafka – Architecture • Kafka – Partitioning & Topics • Summary 8
  • 9. Messaging Architectures: Messaging Models 9 1. Point to Point 2. Publish and Subscribe Kafka is an example of publish-and-subscribe messaging model
  • 10. Kafka Overview 10 • Kafka is a unique distributed publish-subscribe messaging system written in the Scala language with multi-language support and runs on the Java Virtual Machine (JVM). • Kafka relies on another service named Zookeeper – a distributed coordination system – to function. • Kafka has high-throughput and is built to scale-out in a distributed model on multiple servers. • Kafka persists messages on disk and can be used for batched consumption as well as real time applications.
  • 11. Key Terminology • Kafka maintains feeds of messages in categories called topics. • Processes that publish messages to a Kafka topic are called producers. • Processes that subscribe to topics and process the feed of published messages are called consumers. • Kafka is run as a cluster comprised of one or more servers each of which is called a broker. • Communication between all components is done via a high performance simple binary API over TCP protocol 11
  • 13. Agenda • Introduction - Messaging Basics • Kafka – Architecture • Kafka – Partitioning & Topics • Summary 13
  • 14. Understanding Kafka 14 • Kafka is based on the simple storage-abstraction concept called a log, an append-only totally-ordered sequence of records ordered by time. • Records are appended to the end of the record and reads proceed from left to right in the log (or topic). • Each entry is assigned a unique sequential log-entry number (an offset). • The log entry number is a convenient property that correlates to the notion of a “timestamp” entry but is decoupled from any clock due to the distributed nature of Kafka.
  • 15. Kafka Key Design Concepts • A log is synonymous to a file or table where the records are appended and sorted by the concept of time. • Conceptually, the log is a natural data-structure for handling data-flow between systems. • Kafka is designed for centralizing an organization’s data into an enterprise log (message bus) for real-time subscription by other subscribers or application consumers. 15
  • 16. Kafka Conceptual Design • Each logical data source can be modeled as a log corresponding to a topic or data feed in Kafka. • Each subscribing consuming application should read as quickly as it can from each topic, persist the record it reads into it’s own data store and advances the offset to the next message entry to be read. • Subscribers can be any type of data system or middleware system like a cache, Hadoop, a streaming system like Spark or Storm, a search system, a web services provisioning system, a data warehouse, etc. • In Kafka, partitioning is a concept applied to the log/topic in other to allow horizontal scaling. 16
  • 17. Kafka Logical Design • Each partition is a totally ordered log within a topic, and there is no global ordering between partitions. • Assignment of messages to specific partitions is controlled by the publisher and may be assigned based on a unique identification key or messages can be allowed to be randomly assigned to partitions. • Partitioning allows throughput to scale linearly with the Kafka cluster size. 17
  • 18. Kafka Topics • Kafka topics should have a small number of consumer groups assigned with each one representing a “logical subscriber”. • Kafka topic consumption can be scaled by increasing the number of consumer subscriber instances within the same group which will automatically load-balance message consumption. • Kafka has a notion of partitioning within a topic to provide the notion of parallel consumption • Partitions in a topic are assigned to the consumers within a consumer group. • There can be no more consumer instances within a consumer group than partitions within a topic. • If the total order in which messages are published is important in the consumption, then a single partition for the topic is the solution which will mean only one consumer process in the consumer group. 18
  • 19. Kafka Topic Partitions 19 • A topic consists of partitions. • Partition: ordered + immutable sequence of messages that is continually appended to
  • 20. Kafka Topic Partitions 20 • #partitions of a topic is configurable • #partitions determines max consumer (group) parallelism – Cf. parallelism of Storm’s KafkaSpout via builder.setSpout(,,N) – Consumer group A, with 2 consumers, reads from a 4-partition topic – Consumer group B, with 4 consumers, reads from the same topic
  • 21. Kafka Consumer Groups 21 • Kafka assigns the partitions in a topic to the consumer instances in a consumer group to provide ordering guarantees and load balancing over a pool of consumer process. Note that there can be no more consumer instances per group than total partition count.
  • 22. Kafka Environment Properties • Ensure you have access to downloading libraries from the web. • Have at least 15 GB of free hard disk space on your local machine. • Have at least 8GB (preferably 16GB) of RAM on your local machine. • Have a JRE of version 1.7 and above installed on the local machine. • Download and install Eclipse Mars (or the current release) on your local machine. • Download and install VMware player for Windows on the local machine • Download and install Git from the URL https://git-scm.com/ • Download and install Maven https://maven.apache.org/download.cgi • Download the latest stable version of Gradle http://gradle.org/gradle- download/ • Download Scala (use the Scala version compatible with the Kafka download Scala version – in this document Scala version 2.10 is utilized) • Make sure all the necessary command paths for Git, Maven, Gradle, etc are in the Windows Environment and Path. 22
  • 23. Kafka Environment Setup • The Kafka environment can be set up on a local machine in Windows, Linux or in a virtual environment on the local machine. • Go to the Kafka Download URL: https://kafka.apache.org/downloads.html • The current Kafka download site has current release and previous release versions of Kafka with there corresponding Scala version binary downloads. • The download releases have a suffix of *.tgz which means the binaries are gzipd compiled as Linux tar balls. • To get the Windows binaries, the source code needs to be downloaded and compiled on Windows. 23
  • 24. Agenda • Introduction - Messaging Basics • Kafka – Architecture • Kafka – Partitioning & Topics • Summary 24
  • 25. Summary • When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration. • Kafka is a unique distributed publish-subscribe messaging system written in the Scala language with multi-language support and runs on the Java Virtual Machine (JVM). 25
  • 26. 26