SlideShare a Scribd company logo
Fundamentals and Architecture of
Apache Kafka®
Angelo Cesaro
Who am I?
• I’m Angelo!
• Consultant and Data Engineer at Cesaro.io
• More than 10 years of experience
• Worked at ServiceNow, Sky
• Follow me on
https://www.linkedin.com/in/angelocesaro
https://twitter.com/angelocesaro
https://github.com/cesaroangelo
Apache Kafka – Overview
• A distributed streaming platform used for building real time data
pipelines and mission-critical streaming applications with the
following characteristics:
1. Horizontally scalable
2. Fault tolerant
3. Really fast
4. Used by thousands of companies in production
Kafka’s benefits over traditional
messages queues
There are few key differences between kafka and other
traditional messages queues
• Durability and availability
1. Cluster can handle broker failures
2. Messages are replicated for reliability
• Very high throughput
• Data retention
• Excellent scalability
1. A small kafka cluster can process a large number of messages
• Support real-time and batch consumption
1. Kafka was born for real time processing of data, but can also handle
batch oriented jobs, for example feeding data to Hadoop or a data
warehouse
High level of a Kafka cluster
• Producers send data to the kafka cluster
• Consumers read data from the kafka cluster
• Brokers are the main storage and messaging components of the
kafka cluster
Note: the components above can be physical machines, VMs or docker containers, kafka works the
same on of those platforms.
Messages
• The basic unit of data in kafka is a message and the
messages are the atomic unit of data sent by producers
• A message is a key-value pair:
• All the data is stored in Kafka as byte arrays (very
important!)
• Producer provides serializers to convert the key and value
to byte arrays
• Key and value can be any data type
Topic
• Kafka keeps streams of messages called topic and they categorize
messages into groups
• Developers can decide which topics have to exist and by
default Kafka auto-create topics when they are first used
• Kafka has no limit to the number of topics that can be used
• Topics are logical representation that spans across brokers
Note: By analogy, we can think topics as tables in a dbms, just like
we separate data in a db in different tables, we do the same with
topics
Data partitioning
• Producers shard data over a group of partitions and this is needed
to allow for parallel access to the topic for increased throughput
• Each partition contains a subset of messages and they are
ordered and immutable
• Usually the message key is used to control which partition a
message is assigned to
Kafka components
• 4 key components are in a kafka system
• Brokers
• Producers
• Consumers
• Zookeeper
Kafka broker
• Brokers receive and store data sent by the producers
• Brokers are server class systems that provide messages to the
consumers when requested
• Messages are spread across multiple partitions in different brokers
• Kafka provides a configurable retention policy for messages and each
message is identified by its offset number
• The commit log is an append only data structure that lives in ram for
fast access and it’s flushed to disk periodically
• Producer sends requests to the brokers that append messages to the
end of the log
• Consumers consumes from a specific offset (usually the lowest
available) and consumes all messages sequentially
Kafka producers
• Each producer writes data as messages to the kafka cluster
• Producers can be written in any language
• Kafka provides a tool to send messages to the cluster
• Confluent develops a rest (representational state transfer) server
which can be used by clients written in any language
• Confluent Enterprise includes a MQTT (message queuing telemetry
transport) proxy that allows direct ingestion of IoT data
Kafka consumers
• Each consumer pull events from topics as they are written
• The latest message read are kept tracked in a special ‘consumer
offset’ topic
• If necessary the consumers can be reset to start reading from a
specific offset (parameter to set in the configuration for the
default behavior)
Note: other similar solutions use to push events
Distributed consumption
• The way kafka uses to scale the consumption is the combination
of multiple consumers into consumer groups
• Each consumer in that scenario will be assigned a subset of
partitions for consumption
It’s important to know that traditional systems tend to be point to
point, that means that a message is gone once it has been
consumed and can’t be read again. Kafka was designed to work
differently, to allow to use the data multiple times
Zookeeper
• Zookeeper is a centralized and distributed service that can be
used to enable highly reliable distributed coordination
• It maintains configuration information (in this context kafka
cluster configurations)
• It provides distributed synchronization
• It runs in cluster and provides resiliency against failures
Kafka & Zookeeper
Kafka uses Zookeeper for various important features
• Cluster management
• Storage of ACLs and passwords
• Failure detection and recovery
Note:
1. kafka can’t run without zookeeper
2. In the previous kafka releases (<0.11), the clients had to access
to zookeeper, from 0.11 only the brokers need that access and
then the cluster is isolated from the clients for better security and
performance
Advantages of a pull architecture
• Ability to add more consumers to the system without
reconfiguring the cluster
• Ability for a consumer to go offline and return back later,
resuming from where it left off
• Consumer won’t get overwhelmed by data, consumer decides
what speed to get data and slow consumers won’t affect fast
producers
Speeding up data transfer
Kafka is fast, but why?
• Kafka uses system page cache for producing and consuming
messages. (linux kernel feature)
• The use of page cache enables zero-copy, the feature that allows
to transfer data directly from local file channel to a remote socket.
that saves cpu cycles and memory bandwidth.
Kafka metrics
• Kafka metrics can be exposed via jmx and showed through jmx clients
• Type of metrics exposed are:
1. Gauge: instantaneous measurement of one value
2. Meter: measurement of ticks in a time range. E.g. one minute rate, 10 minutes
rate, etc
3. Histogram: measurement of a value variants. E.g. 50th percentile, 98th percentile,
etc
4. Timer: measurement of timings meter + histogram
Kafka uses yammer metrics on the broker and in the older <0.9 clients.
New clients uses new internal metric package. Confluent plans to
consolidate the jmx metrics packages in the future.
Why Replication?
• Each partition is stored in a broker
• If we wouldn’t have any replication and if a broker goes offline,
then the partitions stored in that broker won’t be available and a
permanent data loss could occur
• Without redundancy, partitions will be not available for reads and
writes if the server goes offline and if the server has a fatal crash,
the data is gone permanently
Kafka uses replication for durability and availability
Replica
• Each partition can have replicas
• Each replica is placed on different brokers
• Replicas are spread evenly across brokers for load balancing
We specify the replication factor at topic creation time
Rack awareness of replicas
• Rack awareness enables each replica to be placed on brokers in
different racks. That helps to improve performance and fault
tolerance.
• Each broker can be configured with a broker.rack property, e.g.
rack-1, us-east-1a
• It’s useful if we need to deploy kafka on AWS across availability
zones
• Rack awareness was introduced in Confluent 3.0
Replica configurations
• Increase the replication factor for better durability
• For auto created topics, by default Kafka use replication factor 1,
that needs to be configured accordingly in server.properties
Kafka-topics –create –zookeeper zookeeper:2181 –partitions 1 –
replication-factor 3 –topic mytopic
How brokers are involved in
replication
• Brokers ensure strongly consistent replicas
• One replica is on the leader broker
• All messages produced go to the leader
• The leader propagates those messages to the followers brokers
• All consumers read messages from the leader
Note: very important to understand this above in case of
troubleshooting ;)
Leaders and followers
• Leader:
1. Accepts all reads and writes
2. Manages replicas
3. Leader election rate (meter metric):
kafka.controller:type=controllerstatus,name=leaderelectionrateandtime
ms
• Follower:
• Provide fault tolerance
• keep up with the leader
• There is a special thread running in the cluster that manage the current list
of leaders and followers for every partition. It’s a complex and mission-
critical task, for this reason there is a replica of this information in
zookeeper and then cached on every broker for faster access
Partition leaders
• Leaders have to be evenly distributed across all brokers for 2
main reasons:
• Leaders can change in case of failure
• Leaders do more work as discussed in the previous slides
Preferred replica
• When we create a topic the preferred replica is set automatically.
• It’s the first replica in the list of assigned replicas
• Kafka-topics –zookeeper zookeeper:2181 –describe –topic my-topic
Topic:my-topic PartitionCount:1 ReplicationFactor:3 Configs:
Topic: my-
topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
In Sync Replica (ISR)
• In sync replica is a list of the replicas – leader + followers
• A message is committed if it’s received by every replica in
the list
Note for troubleshooting: Where is it kept the isr list? It’s in
the leader
What does committed mean?
• Committed means in this context that the message is
received and written to the disk by all replicas
• The data is not available for consuming if it’s not
committed
• Who decides when to commit a message? The leader has
this responsibility
Using Kafka command line tools
#create topic with replication factor 1 and partition 1
• kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 1 --partitions 1 --topic test
#delete topic with name test
• kafka-topics.sh --delete --zookeeper localhost:2181 --topic test
#list info regarding topic
• kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
#list topics
• kafka-topics.sh --list --zookeeper localhost:2181
Links!
• https://kafka.apache.org
• https://www.confluent.io
• https://www.cesaro.io

More Related Content

PDF
Apache Kafka Introduction
PDF
Fundamentals of Apache Kafka
PPTX
Introduction to Apache Kafka
PDF
An Introduction to Apache Kafka
PDF
Kafka Overview
PPTX
Introduction to Apache Kafka
PPTX
Kafka presentation
Apache Kafka Introduction
Fundamentals of Apache Kafka
Introduction to Apache Kafka
An Introduction to Apache Kafka
Kafka Overview
Introduction to Apache Kafka
Kafka presentation

What's hot (20)

PPTX
Apache kafka
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Kafka 101
PPTX
Apache kafka
PPTX
Apache Kafka
PDF
Introduction to apache kafka
PDF
PPTX
Apache kafka
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PPTX
A visual introduction to Apache Kafka
PDF
Kafka Streams: What it is, and how to use it?
PDF
Apache Kafka
PDF
Apache Kafka - Martin Podval
PPTX
Apache kafka
PDF
PDF
Introduction to Apache Kafka
PPTX
Apache Kafka
PPTX
Kafka tutorial
PDF
Kafka At Scale in the Cloud
Apache kafka
Apache Kafka Architecture & Fundamentals Explained
Kafka 101
Apache kafka
Apache Kafka
Introduction to apache kafka
Apache kafka
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Apache Kafka Fundamentals for Architects, Admins and Developers
A visual introduction to Apache Kafka
Kafka Streams: What it is, and how to use it?
Apache Kafka
Apache Kafka - Martin Podval
Apache kafka
Introduction to Apache Kafka
Apache Kafka
Kafka tutorial
Kafka At Scale in the Cloud
Ad

Similar to Fundamentals and Architecture of Apache Kafka (20)

PPTX
Unleashing Real-time Power with Kafka.pptx
PDF
Introduction_to_Kafka - A brief Overview.pdf
PDF
apachekafka-160907180205.pdf
PPTX
Apache kafka
PPTX
Copy of Kafka-Camus
PPTX
Building an Event Bus at Scale
PPTX
Distributed messaging with Apache Kafka
PPTX
Microservices deck
PPTX
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
PPTX
Kafka overview v0.1
PPTX
Introduction to Kafka Streams Presentation
PPTX
Session 23 - Kafka and Zookeeper
PPTX
Kafkha real time analytics platform.pptx
PPTX
Envoy and Kafka
PDF
Building High-Throughput, Low-Latency Pipelines in Kafka
PPTX
Columbus mule soft_meetup_aug2021_Kafka_Integration
PDF
Event driven-arch
PPTX
Kafka
PDF
Linked In Stream Processing Meetup - Apache Pulsar
PDF
Kafka in action - Tech Talk - Paytm
Unleashing Real-time Power with Kafka.pptx
Introduction_to_Kafka - A brief Overview.pdf
apachekafka-160907180205.pdf
Apache kafka
Copy of Kafka-Camus
Building an Event Bus at Scale
Distributed messaging with Apache Kafka
Microservices deck
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka overview v0.1
Introduction to Kafka Streams Presentation
Session 23 - Kafka and Zookeeper
Kafkha real time analytics platform.pptx
Envoy and Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
Columbus mule soft_meetup_aug2021_Kafka_Integration
Event driven-arch
Kafka
Linked In Stream Processing Meetup - Apache Pulsar
Kafka in action - Tech Talk - Paytm
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Global journeys: estimating international migration
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Supervised vs unsupervised machine learning algorithms
Lecture1 pattern recognition............
Global journeys: estimating international migration
Moving the Public Sector (Government) to a Digital Adoption
Business Acumen Training GuidePresentation.pptx
climate analysis of Dhaka ,Banglades.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Knowledge Engineering Part 1
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Mega Projects Data Mega Projects Data
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
oil_refinery_comprehensive_20250804084928 (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Clinical guidelines as a resource for EBP(1).pdf
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
.pdf is not working space design for the following data for the following dat...
Supervised vs unsupervised machine learning algorithms

Fundamentals and Architecture of Apache Kafka

  • 1. Fundamentals and Architecture of Apache Kafka® Angelo Cesaro
  • 2. Who am I? • I’m Angelo! • Consultant and Data Engineer at Cesaro.io • More than 10 years of experience • Worked at ServiceNow, Sky • Follow me on https://www.linkedin.com/in/angelocesaro https://twitter.com/angelocesaro https://github.com/cesaroangelo
  • 3. Apache Kafka – Overview • A distributed streaming platform used for building real time data pipelines and mission-critical streaming applications with the following characteristics: 1. Horizontally scalable 2. Fault tolerant 3. Really fast 4. Used by thousands of companies in production
  • 4. Kafka’s benefits over traditional messages queues There are few key differences between kafka and other traditional messages queues • Durability and availability 1. Cluster can handle broker failures 2. Messages are replicated for reliability • Very high throughput • Data retention • Excellent scalability 1. A small kafka cluster can process a large number of messages • Support real-time and batch consumption 1. Kafka was born for real time processing of data, but can also handle batch oriented jobs, for example feeding data to Hadoop or a data warehouse
  • 5. High level of a Kafka cluster • Producers send data to the kafka cluster • Consumers read data from the kafka cluster • Brokers are the main storage and messaging components of the kafka cluster Note: the components above can be physical machines, VMs or docker containers, kafka works the same on of those platforms.
  • 6. Messages • The basic unit of data in kafka is a message and the messages are the atomic unit of data sent by producers • A message is a key-value pair: • All the data is stored in Kafka as byte arrays (very important!) • Producer provides serializers to convert the key and value to byte arrays • Key and value can be any data type
  • 7. Topic • Kafka keeps streams of messages called topic and they categorize messages into groups • Developers can decide which topics have to exist and by default Kafka auto-create topics when they are first used • Kafka has no limit to the number of topics that can be used • Topics are logical representation that spans across brokers Note: By analogy, we can think topics as tables in a dbms, just like we separate data in a db in different tables, we do the same with topics
  • 8. Data partitioning • Producers shard data over a group of partitions and this is needed to allow for parallel access to the topic for increased throughput • Each partition contains a subset of messages and they are ordered and immutable • Usually the message key is used to control which partition a message is assigned to
  • 9. Kafka components • 4 key components are in a kafka system • Brokers • Producers • Consumers • Zookeeper
  • 10. Kafka broker • Brokers receive and store data sent by the producers • Brokers are server class systems that provide messages to the consumers when requested • Messages are spread across multiple partitions in different brokers • Kafka provides a configurable retention policy for messages and each message is identified by its offset number • The commit log is an append only data structure that lives in ram for fast access and it’s flushed to disk periodically • Producer sends requests to the brokers that append messages to the end of the log • Consumers consumes from a specific offset (usually the lowest available) and consumes all messages sequentially
  • 11. Kafka producers • Each producer writes data as messages to the kafka cluster • Producers can be written in any language • Kafka provides a tool to send messages to the cluster • Confluent develops a rest (representational state transfer) server which can be used by clients written in any language • Confluent Enterprise includes a MQTT (message queuing telemetry transport) proxy that allows direct ingestion of IoT data
  • 12. Kafka consumers • Each consumer pull events from topics as they are written • The latest message read are kept tracked in a special ‘consumer offset’ topic • If necessary the consumers can be reset to start reading from a specific offset (parameter to set in the configuration for the default behavior) Note: other similar solutions use to push events
  • 13. Distributed consumption • The way kafka uses to scale the consumption is the combination of multiple consumers into consumer groups • Each consumer in that scenario will be assigned a subset of partitions for consumption It’s important to know that traditional systems tend to be point to point, that means that a message is gone once it has been consumed and can’t be read again. Kafka was designed to work differently, to allow to use the data multiple times
  • 14. Zookeeper • Zookeeper is a centralized and distributed service that can be used to enable highly reliable distributed coordination • It maintains configuration information (in this context kafka cluster configurations) • It provides distributed synchronization • It runs in cluster and provides resiliency against failures
  • 15. Kafka & Zookeeper Kafka uses Zookeeper for various important features • Cluster management • Storage of ACLs and passwords • Failure detection and recovery Note: 1. kafka can’t run without zookeeper 2. In the previous kafka releases (<0.11), the clients had to access to zookeeper, from 0.11 only the brokers need that access and then the cluster is isolated from the clients for better security and performance
  • 16. Advantages of a pull architecture • Ability to add more consumers to the system without reconfiguring the cluster • Ability for a consumer to go offline and return back later, resuming from where it left off • Consumer won’t get overwhelmed by data, consumer decides what speed to get data and slow consumers won’t affect fast producers
  • 17. Speeding up data transfer Kafka is fast, but why? • Kafka uses system page cache for producing and consuming messages. (linux kernel feature) • The use of page cache enables zero-copy, the feature that allows to transfer data directly from local file channel to a remote socket. that saves cpu cycles and memory bandwidth.
  • 18. Kafka metrics • Kafka metrics can be exposed via jmx and showed through jmx clients • Type of metrics exposed are: 1. Gauge: instantaneous measurement of one value 2. Meter: measurement of ticks in a time range. E.g. one minute rate, 10 minutes rate, etc 3. Histogram: measurement of a value variants. E.g. 50th percentile, 98th percentile, etc 4. Timer: measurement of timings meter + histogram Kafka uses yammer metrics on the broker and in the older <0.9 clients. New clients uses new internal metric package. Confluent plans to consolidate the jmx metrics packages in the future.
  • 19. Why Replication? • Each partition is stored in a broker • If we wouldn’t have any replication and if a broker goes offline, then the partitions stored in that broker won’t be available and a permanent data loss could occur • Without redundancy, partitions will be not available for reads and writes if the server goes offline and if the server has a fatal crash, the data is gone permanently Kafka uses replication for durability and availability
  • 20. Replica • Each partition can have replicas • Each replica is placed on different brokers • Replicas are spread evenly across brokers for load balancing We specify the replication factor at topic creation time
  • 21. Rack awareness of replicas • Rack awareness enables each replica to be placed on brokers in different racks. That helps to improve performance and fault tolerance. • Each broker can be configured with a broker.rack property, e.g. rack-1, us-east-1a • It’s useful if we need to deploy kafka on AWS across availability zones • Rack awareness was introduced in Confluent 3.0
  • 22. Replica configurations • Increase the replication factor for better durability • For auto created topics, by default Kafka use replication factor 1, that needs to be configured accordingly in server.properties Kafka-topics –create –zookeeper zookeeper:2181 –partitions 1 – replication-factor 3 –topic mytopic
  • 23. How brokers are involved in replication • Brokers ensure strongly consistent replicas • One replica is on the leader broker • All messages produced go to the leader • The leader propagates those messages to the followers brokers • All consumers read messages from the leader Note: very important to understand this above in case of troubleshooting ;)
  • 24. Leaders and followers • Leader: 1. Accepts all reads and writes 2. Manages replicas 3. Leader election rate (meter metric): kafka.controller:type=controllerstatus,name=leaderelectionrateandtime ms • Follower: • Provide fault tolerance • keep up with the leader • There is a special thread running in the cluster that manage the current list of leaders and followers for every partition. It’s a complex and mission- critical task, for this reason there is a replica of this information in zookeeper and then cached on every broker for faster access
  • 25. Partition leaders • Leaders have to be evenly distributed across all brokers for 2 main reasons: • Leaders can change in case of failure • Leaders do more work as discussed in the previous slides
  • 26. Preferred replica • When we create a topic the preferred replica is set automatically. • It’s the first replica in the list of assigned replicas • Kafka-topics –zookeeper zookeeper:2181 –describe –topic my-topic Topic:my-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my- topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
  • 27. In Sync Replica (ISR) • In sync replica is a list of the replicas – leader + followers • A message is committed if it’s received by every replica in the list Note for troubleshooting: Where is it kept the isr list? It’s in the leader
  • 28. What does committed mean? • Committed means in this context that the message is received and written to the disk by all replicas • The data is not available for consuming if it’s not committed • Who decides when to commit a message? The leader has this responsibility
  • 29. Using Kafka command line tools #create topic with replication factor 1 and partition 1 • kafka-topics.sh --create --zookeeper localhost:2181 --replication- factor 1 --partitions 1 --topic test #delete topic with name test • kafka-topics.sh --delete --zookeeper localhost:2181 --topic test #list info regarding topic • kafka-topics.sh --describe --zookeeper localhost:2181 --topic test #list topics • kafka-topics.sh --list --zookeeper localhost:2181