SlideShare a Scribd company logo
APACHE KAFKA
ALEX PONGPECH
Data integration and stream processing
● Streaming data includes a wide variety of data such as
ecommerce purchases, information from social networks or
geospatial services from mobile devices.
● Streaming data needs to be processed sequentially and
incrementally in order to, for instance, integrate
different applications that consumes the data, or storing
and processing the data to update metrics, reports, and
summary statistics in response to each arriving data
record.
2
Data integration and stream
processing
It is better suited for real-time monitoring and response
functions such as
● Classifying a banking transaction as fraudulent based on
an analytical model then automatically blocking the
transaction
● Sending push notifications to users based on models about
their behavior
● Adjusting the parameters of a machine based on result of
real-time analysis of its sensor data
3
Data integration and stream processing
The problem, therefore, is how to build an infrastructure
that is:
● Decoupled
● Evolvable
● Operationally transparent
● Resilient to traffic spikes
● Highly available
● Distributed
4
sTREAMING PROCESSING
5
sTREAMING PROCESSING
A streaming platform has three key capabilities:
● Publish and subscribe to streams of records, similar to a
message queue. This allows multiple application
subscribed to a same or different data sources that
produce data in one or different topics.
● Store streams of records in a fault-tolerant durable way.
This means different clients can access all the events
(or a fraction of them) at any time, at their own pace.
● Process streams of records as they occur. This allows
filtering, analysing, aggregating or transforming data.
6
aPACHE KAFKA
Kafka is a publish/subscribe messaging system designed to
solve the problem of managing continuous data flows.
● Building real-time streaming data pipelines that reliably
get data between systems or applications
● Building real-time streaming applications that transform
or react to the streams of data
7
Components & Key concepts
Kafka use four different Application Programming Interfaces
(API) to enable building different and some key concepts
components:
1. Producers
2. Consumers
3. Brokers
4. Clustering and Leadership Election
8
PRODUCER
● The Producer API allows an
application to publish a
stream of records to one or
more topics.
● In Kafka, the data records
are know as messages and
they are categorized into
topics .
● Think of messages as the
data records and topics as
a database table.
9
PRODUCER
A message is made of up two components:
● 1. Key : The key of the message. This key would
determines the partition the message would be sent to.
Careful thought has to be given when deciding on a key
for a message. Since a key is mapped to a single
partition, an application that pushes millions of
messages with one particular key and only a fraction with
other key would result in an uneven distribution of load
on the Kafka cluster. If the key is set to null, the
producer will use a Round robin algorithm and assign a
key to random partition 10
PRODUCER
● 2. Body : Each message has a body. Since Kafka provides
APIs for multiple programming languages, the message has
to be serialized in a way that the subscribers on the
other end can understand it. There are encoding protocols
that the developer can use to serialize a message like
JSON, Protocol Buffers, Apache AVRO, Apache Thrift, to
name a few
11
CONSUMERS
● The Consumer API on the other hand allows application
subscription to one or more topics to store/process/react
to the stream of records produced to them.
● To accomplish this, Kafka adds to each message a unique
integer value. This integer is incremented by one for
every message that is.
12
CONSUMERS
● This value is known as the offset of the message. By
storing the offset of the last consumed message for each
partition, a consumer can stop and restart without losing
its place.
● This is why Kafka allows different types of applications
to integrate to a single source of data. The data can be
processed at different rates by each consumer.
13
CONSUMERS
● Here is also important to know that Kafka allows the
existence of consumer groups , which are nothing more
than consumers working together to process a topic.
● The concept of consumer groups allows to add scale
processing of data in Kafka, this is will be reviewed in
the next section.
● Each consumer group identifies itself by a Group Id. This
value has to be unique amongst all consumer groups.
14
CONSUMERS
● In a consumer group, each consumer is assigned to a
partition. If there are more consumers than the number of
partitions, some consumers will be sitting idle.
● For instance, if the number of partitions for a topic is
5, and the consumer group has 6 consumers, one of those
consumers will not get any data.
● On the other hand, if the number of consumers in this
case is 3, then Kafka will assign a combination of
partitions to each consumer.
15
CONSUMERS
● When a consumer in the consumer group is removed, Kafka
will reassign some partitions that were originally for
this consumer to other consumers in the group.
● If a consumer is added to a consumer group, then Kafka
will take some partition from the existing consumers to
the new consumer.
16
BROKERS
● Take into account that in the Kafka cluster a single
Kafka server is called a broker. The broker receives
messages from producers, assigns offsets to them, and
commits the messages to storage on disk.
● It also services consumers, responding to fetch requests
for partitions and responding with the messages that have
been committed to disk.
● Depending on the specific hardware and its performance
characteristics, a single broker can easily handle
thousands of partitions and millions of messages per
second.
17
cLUSTERING AnD lEADERSHIP ELECTION
● a Kafka cluster is made up of brokers. Each broker is
allocated a set of partitions.
● A broker can either be the leader or a replica of the
partition.
● Each partition also has a replication factor that goes
with it.
● For instance, imagine that we have 3 brokers namely b1,
b2 and b3. A producer pushes a message to the Kafka
cluster.
18
cLUSTERING AnD lEADERSHIP ELECTION
● Using the key of the message, the producer will decide
which partition to send the message to.
● Let’s say the message goes to a partition p1 which
resides on the broker b1 which is also the leader of this
partition, and the replication factor of 2, and the
replica of partition p1 resides on b2.
19
cLUSTERING AnD lEADERSHIP ELECTION
● When the producer sends a message to partition p1, the
producer API will send the message to broker b1, which is
the leader. If the “acks” property has been configured to
“all”, then the leader b1 will wait till the message has
been replicated to broker b2. If the “acks” has been set
to “one”, then b1 will write the message in its local log
and the request would be considered complete.
● In the event of the leader b1 going down, Apache
Zookeeper initiates a Leadership election.
● In this particular case, since we have only broker b2
available for this partition, it becomes the leader for
p1. 20
CONNECTOR API
● The Connector API allows building and running reusable
producers or consumers that connect Kafka topics to
existing applications or data systems.
● For example, a connector to a relational database might
capture every change to a table. If you think in a ETL
system, connectors are involved in the Extraction and
Load of data
21
CONNECTOR API- stream api
● The Streams API allows an application to act as a stream
processor, consuming an input stream from one or more
topics and producing an output stream to one or more
output topics, effectively transforming the input streams
to output streams.
● Basically, the streams API helps to organize the data
pipeline. Again, thinking in a ETL system, Streams are
involved in data transformation.
22
● Apache Zookeeper is a distributed,
open-source configuration,
synchronization service that
provides multiple features for
distributed applications.
● Kafka uses it in to manage the
cluster, storing for instance
shared information about consumers
and brokers.
● The most important role is to keep
a track of consumer group offsets.
23
CONNECTOR API-Apache zookeeper
WHY kAFKA : Messaging - Publishers/Subscribers
● Read and write streams of data like a messaging system.
● As mentioned before producers create new messages or
events .
○ a message will be produced to a specific topic. In the same way,
consumers (also known as subscribers) read messages.
○ The consumer subscribes to one or more topics and reads the messages
in the order in which they were produced.
○ The consumer keeps track of which messages it has already consumed by
keeping track of the offset of messages.
● important concept to explore: A message queue allows you
to scale processing of data over multiple consumer’s
instances that process the data.
24
WHY kAFKA : Messaging - Publishers/Subscribers
● Unfortunately, once a message is consumed from the queue
the message is not available anymore for others consumers
that may be interested in the same message.
● Publisher/subscriber in contrast allows you to publish
broadcast each message to a list of consumers or
subscribers , but by itself does not scale processing.
● Kafka offers a mix of those two messaging models: Kafka
publishes messages in topics that broadcast all the
messages to different consumer groups. The consumer group
acts as a message queue that divides up processing over
all the members of a group.
25
WHY KAFKA: STORE
● Store streams of data safely in a distributed,
replicated, fault-tolerant cluster.
● A file system or database commit log is designed to
provide a durable record of all transactions so that they
can be replayed to consistently build the state of a
system.
● Similarly, data within Kafka is stored durably, in order,
and can be read deterministically (Narkhede et al. 2017)
26
WHY KAFKA: STORE
● This last point is important, in a message queue once the
message is consumed it disappears, a log in contrast
allows to maintain the message (or events) during a
configurable time period, this is known as retention.
● Thanks to the retention feature, Kafka allows different
applications to consume and process messages at their own
pace, depending for instance in its processing capacity
or its business purpose.
27
WHY KAFKA: STORE
● In the same way, if a consumer goes down, it can continue
to process the messages in the log once it is recovered.
This technique is called event sourcing, which is that
whenever we make a change to the state of a system, we
record that state change as an event, and we can
confidently rebuild the system state by reprocessing the
events at any time in the future (Fowler 2017).
● In addition, since the data is distributed within the
system it provides additional protection against
failures, as well as significant opportunities for
scaling performance (Helland 2015) .
28
WHY KAFKA: STORE
● Kafka is therefore a kind of special purpose distributed
file system dedicated to high-performance, low-latency
commit log storage, replication, and propagation.
● This doesn’t mean Kafka purpose is to replace storage
systems, but it results helpful in keeping data
consistency in a distributed set of applications.
29
WHY KAFKA: pROCESS
● Write scalable stream processing applications that react
to events in real-time.
● As previously described, a stream represents data moving
from the producers to the consumers through the brokers.
Nevertheless, it is not enough to just read, write, and
store streams of data; the purpose is to enable real-time
processing of streams.
● Kafka allows building very low-latency pipelines with
facilities to transform data as it arrives, including
windowed operations, joins, aggregations, etc.
30
WHY KAFKA: FEATURES
31
Why kafka: features muliple producers
● Kafka is able to seamlessly handle multiple producers
that help to aggregate data from many data sources in a
consistent way.
● A single producer can send messages to one or more
topics. Some important properties that can be set at the
producer’s end are as follows:
○ acks : The number of acknowledgements that the producer requires from
the leader of the partition after sending a message. If set to 0, the
producer will not wait for any reply from the leader. If set to 1,
the producer will wait till the leader of the partition writes the
message to its local log. If set to all, the producer will wait for
the leader to replicate the message to the replicas of the partition.
32
Why kafka: features muliple producers
● Some important properties that can be set at the
producer’s end are as follows:
○ batch.size : The producer will batch messages together as set by this
parameter in bytes. This improves performance as now a single TCP
connection will send batches of data instead of sending one message
at a time
○ compression.type : The Producer API provides mechanisms to compress a
message before pushing it to the Kafka cluster. The default is no
compression, but the API provides gzip, snappy and and lz4
33
Why kafka: features multiple consumers
● In the same way, the publish/subscribe architecture
allows multiple consumers to process the data.
● Each message is broadcasted by the Kafka cluster to all
the subscribed consumers. This allows to connect multiple
applications to the same or different data sources,
enabling the business to connect new services as they
emerge.
34
Why kafka: features multiple consumers
● Furthermore, remember that for scaling up processing,
Kafka provides the consumer groups. Whenever a new
consumer groups comes online, and is identified by its
group id, it has a choice of reading messages from a
topic from the first offset, or the latest offset.
● Kafka also provides a mechanism to control the influx of
messages to a Kafka.
35
Why kafka: features multiple consumers
● Some interesting properties that could be set at the
consumers end are discussed below:
○ auto.offset.reset : One can configure the consumer library of choice
to start reading messages from the earliest offset or the latest
offset. It is useful to use “earliest” if the consumer group is new
and wants to processing from the very beginning
○ enable.auto.commit : If set to true, the library would commit the
offsets periodically in the background, as specified by the
auto.commit.interval.ms interval
○ Max.poll.records : The maximum number of records returned in a batch
36
Why kafka: features disk-based retention
● log.retention.bytes : The maximum size a log for a
partition can go to before being deleted
● log.retention.hours : The number of hours to keep a log
file before deleting it (in hours), tertiary to
log.retention.ms property
● log.retention.minutes : The number of minutes to keep a
log file before deleting it (in minutes), secondary to
log.retention.ms property. If not set, the value in
log.retention.hours is used
● Log.retention.ms : The number of milliseconds to keep a
log
37
Why kafka: features high performance
● Kafka’s flexible allows adding multiple brokers to makes
scale horizontally.
● Expansions can be performed while the cluster is online,
with no impact on the availability of the system as a
whole.
● All these features converge to improve Kafka’s
performance.
38
Typical architecture in real scenarios
39
references
https://cs.ulb.ac.be/public/_media/teaching/infoh415/student_projects/2019/kafka.
pdf
40

More Related Content

PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Kafka presentation
PDF
Fundamentals of Apache Kafka
PPTX
Apache kafka
PDF
Apache Kafka Introduction
PPTX
Envoy and Kafka
PPTX
Introduction to Apache Kafka
PDF
Apache Kafka - Martin Podval
Apache Kafka Architecture & Fundamentals Explained
Kafka presentation
Fundamentals of Apache Kafka
Apache kafka
Apache Kafka Introduction
Envoy and Kafka
Introduction to Apache Kafka
Apache Kafka - Martin Podval

What's hot (20)

ODP
Stream processing using Kafka
PDF
An Introduction to Apache Kafka
PPTX
Kafka 101
PDF
Disaster Recovery Plans for Apache Kafka
PDF
Introduction to Apache Kafka
PDF
Hello, kafka! (an introduction to apache kafka)
PDF
Kafka as Message Broker
PDF
Introduction to apache kafka
PPTX
Apache kafka
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PPTX
Apache kafka
PDF
Apache Kafka® Security Overview
PDF
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PDF
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
PDF
IBM MQ High Availability 2019
PPTX
Apache Kafka - Messaging System Overview
PPTX
Kubernetes
PPTX
[오픈소스컨설팅]Kafka message system 맛보기
PDF
Securing Kafka
Stream processing using Kafka
An Introduction to Apache Kafka
Kafka 101
Disaster Recovery Plans for Apache Kafka
Introduction to Apache Kafka
Hello, kafka! (an introduction to apache kafka)
Kafka as Message Broker
Introduction to apache kafka
Apache kafka
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache kafka
Apache Kafka® Security Overview
APACHE KAFKA / Kafka Connect / Kafka Streams
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
IBM MQ High Availability 2019
Apache Kafka - Messaging System Overview
Kubernetes
[오픈소스컨설팅]Kafka message system 맛보기
Securing Kafka
Ad

Similar to Apache Kafka (20)

PPTX
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
PDF
Introduction_to_Kafka - A brief Overview.pdf
PPTX
Distributed messaging with Apache Kafka
DOCX
KAFKA Quickstart
PPTX
Kafka overview
PPTX
Introduction to Kafka and Event-Driven
PDF
Introduction to Kafka and Event-Driven
PPTX
Kafka
PDF
Devoxx university - Kafka de haut en bas
PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
PDF
Kafka syed academy_v1_introduction
PPTX
Kafkha real time analytics platform.pptx
PPTX
kafka_session_updated.pptx
PPTX
Fundamentals and Architecture of Apache Kafka
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
PPTX
Design Patterns for working with Fast Data in Kafka
PPTX
Design Patterns for working with Fast Data
PDF
Streaming Data with Apache Kafka
PPTX
Kafka101
PPTX
A Short Presentation on Kafka
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Introduction_to_Kafka - A brief Overview.pdf
Distributed messaging with Apache Kafka
KAFKA Quickstart
Kafka overview
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
Kafka
Devoxx university - Kafka de haut en bas
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka syed academy_v1_introduction
Kafkha real time analytics platform.pptx
kafka_session_updated.pptx
Fundamentals and Architecture of Apache Kafka
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Design Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data
Streaming Data with Apache Kafka
Kafka101
A Short Presentation on Kafka
Ad

More from Worapol Alex Pongpech, PhD (9)

PDF
Blockchain based Customer Relation System
PDF
Fast analytics kudu to druid
PDF
Building business intuition from data
PDF
10 basic terms so you can talk to data engineer
PDF
Why are we using kubernetes
PDF
Airflow 4 manager
PPTX
PPTX
In15orlesss hadoop
Blockchain based Customer Relation System
Fast analytics kudu to druid
Building business intuition from data
10 basic terms so you can talk to data engineer
Why are we using kubernetes
Airflow 4 manager
In15orlesss hadoop

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Introduction to Business Data Analytics.
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Foundation of Data Science unit number two notes
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
climate analysis of Dhaka ,Banglades.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Acumen Training GuidePresentation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Business Data Analytics.
.pdf is not working space design for the following data for the following dat...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Foundation of Data Science unit number two notes
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx

Apache Kafka

  • 2. Data integration and stream processing ● Streaming data includes a wide variety of data such as ecommerce purchases, information from social networks or geospatial services from mobile devices. ● Streaming data needs to be processed sequentially and incrementally in order to, for instance, integrate different applications that consumes the data, or storing and processing the data to update metrics, reports, and summary statistics in response to each arriving data record. 2
  • 3. Data integration and stream processing It is better suited for real-time monitoring and response functions such as ● Classifying a banking transaction as fraudulent based on an analytical model then automatically blocking the transaction ● Sending push notifications to users based on models about their behavior ● Adjusting the parameters of a machine based on result of real-time analysis of its sensor data 3
  • 4. Data integration and stream processing The problem, therefore, is how to build an infrastructure that is: ● Decoupled ● Evolvable ● Operationally transparent ● Resilient to traffic spikes ● Highly available ● Distributed 4
  • 6. sTREAMING PROCESSING A streaming platform has three key capabilities: ● Publish and subscribe to streams of records, similar to a message queue. This allows multiple application subscribed to a same or different data sources that produce data in one or different topics. ● Store streams of records in a fault-tolerant durable way. This means different clients can access all the events (or a fraction of them) at any time, at their own pace. ● Process streams of records as they occur. This allows filtering, analysing, aggregating or transforming data. 6
  • 7. aPACHE KAFKA Kafka is a publish/subscribe messaging system designed to solve the problem of managing continuous data flows. ● Building real-time streaming data pipelines that reliably get data between systems or applications ● Building real-time streaming applications that transform or react to the streams of data 7
  • 8. Components & Key concepts Kafka use four different Application Programming Interfaces (API) to enable building different and some key concepts components: 1. Producers 2. Consumers 3. Brokers 4. Clustering and Leadership Election 8
  • 9. PRODUCER ● The Producer API allows an application to publish a stream of records to one or more topics. ● In Kafka, the data records are know as messages and they are categorized into topics . ● Think of messages as the data records and topics as a database table. 9
  • 10. PRODUCER A message is made of up two components: ● 1. Key : The key of the message. This key would determines the partition the message would be sent to. Careful thought has to be given when deciding on a key for a message. Since a key is mapped to a single partition, an application that pushes millions of messages with one particular key and only a fraction with other key would result in an uneven distribution of load on the Kafka cluster. If the key is set to null, the producer will use a Round robin algorithm and assign a key to random partition 10
  • 11. PRODUCER ● 2. Body : Each message has a body. Since Kafka provides APIs for multiple programming languages, the message has to be serialized in a way that the subscribers on the other end can understand it. There are encoding protocols that the developer can use to serialize a message like JSON, Protocol Buffers, Apache AVRO, Apache Thrift, to name a few 11
  • 12. CONSUMERS ● The Consumer API on the other hand allows application subscription to one or more topics to store/process/react to the stream of records produced to them. ● To accomplish this, Kafka adds to each message a unique integer value. This integer is incremented by one for every message that is. 12
  • 13. CONSUMERS ● This value is known as the offset of the message. By storing the offset of the last consumed message for each partition, a consumer can stop and restart without losing its place. ● This is why Kafka allows different types of applications to integrate to a single source of data. The data can be processed at different rates by each consumer. 13
  • 14. CONSUMERS ● Here is also important to know that Kafka allows the existence of consumer groups , which are nothing more than consumers working together to process a topic. ● The concept of consumer groups allows to add scale processing of data in Kafka, this is will be reviewed in the next section. ● Each consumer group identifies itself by a Group Id. This value has to be unique amongst all consumer groups. 14
  • 15. CONSUMERS ● In a consumer group, each consumer is assigned to a partition. If there are more consumers than the number of partitions, some consumers will be sitting idle. ● For instance, if the number of partitions for a topic is 5, and the consumer group has 6 consumers, one of those consumers will not get any data. ● On the other hand, if the number of consumers in this case is 3, then Kafka will assign a combination of partitions to each consumer. 15
  • 16. CONSUMERS ● When a consumer in the consumer group is removed, Kafka will reassign some partitions that were originally for this consumer to other consumers in the group. ● If a consumer is added to a consumer group, then Kafka will take some partition from the existing consumers to the new consumer. 16
  • 17. BROKERS ● Take into account that in the Kafka cluster a single Kafka server is called a broker. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk. ● It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to disk. ● Depending on the specific hardware and its performance characteristics, a single broker can easily handle thousands of partitions and millions of messages per second. 17
  • 18. cLUSTERING AnD lEADERSHIP ELECTION ● a Kafka cluster is made up of brokers. Each broker is allocated a set of partitions. ● A broker can either be the leader or a replica of the partition. ● Each partition also has a replication factor that goes with it. ● For instance, imagine that we have 3 brokers namely b1, b2 and b3. A producer pushes a message to the Kafka cluster. 18
  • 19. cLUSTERING AnD lEADERSHIP ELECTION ● Using the key of the message, the producer will decide which partition to send the message to. ● Let’s say the message goes to a partition p1 which resides on the broker b1 which is also the leader of this partition, and the replication factor of 2, and the replica of partition p1 resides on b2. 19
  • 20. cLUSTERING AnD lEADERSHIP ELECTION ● When the producer sends a message to partition p1, the producer API will send the message to broker b1, which is the leader. If the “acks” property has been configured to “all”, then the leader b1 will wait till the message has been replicated to broker b2. If the “acks” has been set to “one”, then b1 will write the message in its local log and the request would be considered complete. ● In the event of the leader b1 going down, Apache Zookeeper initiates a Leadership election. ● In this particular case, since we have only broker b2 available for this partition, it becomes the leader for p1. 20
  • 21. CONNECTOR API ● The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. ● For example, a connector to a relational database might capture every change to a table. If you think in a ETL system, connectors are involved in the Extraction and Load of data 21
  • 22. CONNECTOR API- stream api ● The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. ● Basically, the streams API helps to organize the data pipeline. Again, thinking in a ETL system, Streams are involved in data transformation. 22
  • 23. ● Apache Zookeeper is a distributed, open-source configuration, synchronization service that provides multiple features for distributed applications. ● Kafka uses it in to manage the cluster, storing for instance shared information about consumers and brokers. ● The most important role is to keep a track of consumer group offsets. 23 CONNECTOR API-Apache zookeeper
  • 24. WHY kAFKA : Messaging - Publishers/Subscribers ● Read and write streams of data like a messaging system. ● As mentioned before producers create new messages or events . ○ a message will be produced to a specific topic. In the same way, consumers (also known as subscribers) read messages. ○ The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. ○ The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. ● important concept to explore: A message queue allows you to scale processing of data over multiple consumer’s instances that process the data. 24
  • 25. WHY kAFKA : Messaging - Publishers/Subscribers ● Unfortunately, once a message is consumed from the queue the message is not available anymore for others consumers that may be interested in the same message. ● Publisher/subscriber in contrast allows you to publish broadcast each message to a list of consumers or subscribers , but by itself does not scale processing. ● Kafka offers a mix of those two messaging models: Kafka publishes messages in topics that broadcast all the messages to different consumer groups. The consumer group acts as a message queue that divides up processing over all the members of a group. 25
  • 26. WHY KAFKA: STORE ● Store streams of data safely in a distributed, replicated, fault-tolerant cluster. ● A file system or database commit log is designed to provide a durable record of all transactions so that they can be replayed to consistently build the state of a system. ● Similarly, data within Kafka is stored durably, in order, and can be read deterministically (Narkhede et al. 2017) 26
  • 27. WHY KAFKA: STORE ● This last point is important, in a message queue once the message is consumed it disappears, a log in contrast allows to maintain the message (or events) during a configurable time period, this is known as retention. ● Thanks to the retention feature, Kafka allows different applications to consume and process messages at their own pace, depending for instance in its processing capacity or its business purpose. 27
  • 28. WHY KAFKA: STORE ● In the same way, if a consumer goes down, it can continue to process the messages in the log once it is recovered. This technique is called event sourcing, which is that whenever we make a change to the state of a system, we record that state change as an event, and we can confidently rebuild the system state by reprocessing the events at any time in the future (Fowler 2017). ● In addition, since the data is distributed within the system it provides additional protection against failures, as well as significant opportunities for scaling performance (Helland 2015) . 28
  • 29. WHY KAFKA: STORE ● Kafka is therefore a kind of special purpose distributed file system dedicated to high-performance, low-latency commit log storage, replication, and propagation. ● This doesn’t mean Kafka purpose is to replace storage systems, but it results helpful in keeping data consistency in a distributed set of applications. 29
  • 30. WHY KAFKA: pROCESS ● Write scalable stream processing applications that react to events in real-time. ● As previously described, a stream represents data moving from the producers to the consumers through the brokers. Nevertheless, it is not enough to just read, write, and store streams of data; the purpose is to enable real-time processing of streams. ● Kafka allows building very low-latency pipelines with facilities to transform data as it arrives, including windowed operations, joins, aggregations, etc. 30
  • 32. Why kafka: features muliple producers ● Kafka is able to seamlessly handle multiple producers that help to aggregate data from many data sources in a consistent way. ● A single producer can send messages to one or more topics. Some important properties that can be set at the producer’s end are as follows: ○ acks : The number of acknowledgements that the producer requires from the leader of the partition after sending a message. If set to 0, the producer will not wait for any reply from the leader. If set to 1, the producer will wait till the leader of the partition writes the message to its local log. If set to all, the producer will wait for the leader to replicate the message to the replicas of the partition. 32
  • 33. Why kafka: features muliple producers ● Some important properties that can be set at the producer’s end are as follows: ○ batch.size : The producer will batch messages together as set by this parameter in bytes. This improves performance as now a single TCP connection will send batches of data instead of sending one message at a time ○ compression.type : The Producer API provides mechanisms to compress a message before pushing it to the Kafka cluster. The default is no compression, but the API provides gzip, snappy and and lz4 33
  • 34. Why kafka: features multiple consumers ● In the same way, the publish/subscribe architecture allows multiple consumers to process the data. ● Each message is broadcasted by the Kafka cluster to all the subscribed consumers. This allows to connect multiple applications to the same or different data sources, enabling the business to connect new services as they emerge. 34
  • 35. Why kafka: features multiple consumers ● Furthermore, remember that for scaling up processing, Kafka provides the consumer groups. Whenever a new consumer groups comes online, and is identified by its group id, it has a choice of reading messages from a topic from the first offset, or the latest offset. ● Kafka also provides a mechanism to control the influx of messages to a Kafka. 35
  • 36. Why kafka: features multiple consumers ● Some interesting properties that could be set at the consumers end are discussed below: ○ auto.offset.reset : One can configure the consumer library of choice to start reading messages from the earliest offset or the latest offset. It is useful to use “earliest” if the consumer group is new and wants to processing from the very beginning ○ enable.auto.commit : If set to true, the library would commit the offsets periodically in the background, as specified by the auto.commit.interval.ms interval ○ Max.poll.records : The maximum number of records returned in a batch 36
  • 37. Why kafka: features disk-based retention ● log.retention.bytes : The maximum size a log for a partition can go to before being deleted ● log.retention.hours : The number of hours to keep a log file before deleting it (in hours), tertiary to log.retention.ms property ● log.retention.minutes : The number of minutes to keep a log file before deleting it (in minutes), secondary to log.retention.ms property. If not set, the value in log.retention.hours is used ● Log.retention.ms : The number of milliseconds to keep a log 37
  • 38. Why kafka: features high performance ● Kafka’s flexible allows adding multiple brokers to makes scale horizontally. ● Expansions can be performed while the cluster is online, with no impact on the availability of the system as a whole. ● All these features converge to improve Kafka’s performance. 38
  • 39. Typical architecture in real scenarios 39