2. Course Outline
Introduction
- Key concept
- Architecture
- Decoupling application
- Example use case
- Technical term e.g. Topic / Partition
/ Producer / Consumer
- Component
- Retention policy
- FAQ and Limitation
Setting up
- Setup Docker Compose for test bed
kafka cluster
- Use GUI to connect Kafa to familiar
with tool and concept
Hand on
- Implement Kafka producer and
consumer application
- Testing and debugging
Monitoring (If possible)
- Monitoring tool
2
3. Kafka’s origin story
- Initial project 2010, Initial release 2011
- Develop at Linkedin
- Written in Java/Scala
- Named after Franz Kafka, "a system
optimized for writing"
- Written in Java/Scala
- Page view, page event, aggregated logs to
Apache Hadoop.
- Need distributed architecture
Jay Kreps
CEO, Confluent
Neha Narkhede
CTO, Confluent
Jun Rao
Co-founder, Confluent
3
5. references:
[1]Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
[2] Deep Learning Infrastructure for Extreme Scale with the Apache Kafka Open Source Ecosystem - DataScienceCentral.com
As of 2019 | 100+ cluster | 4,000 brokers | 100,000 topics |
> 4.5 Trillion message per day
As of 2015 | 20 Gbps
As of 2015 | 5 Billion session per day | Stream processing
use case (Apache Storm,Apache Hadoop and AWS
Elastic MapReduce)
As of 2016 | 700 Billion event | Stream processing with
Keystone | > 1 Trillion message per day
Tech giant use case
5
19. Kafka key feature (1)
reference: Top 10 Kafka Features | Why Apache Kafka Is So Popular - DataFlair (data-flair.training)
a. Scalability
Apache Kafka can handle scalability in all the four
dimensions, i.e. event producers, event processors,
event consumers and event connectors. In other
words, Kafka scales easily without downtime.
b. High-Volume
Kafka can work with the huge volume of data streams,
easily.
c. Data Transformations
Kafka offers provision for deriving new data streams
using the data streams from producers.
d. Fault Tolerance
The Kafka cluster can handle
failures with the masters and
databases.
e. Reliability
Since Kafka is distributed,
partitioned, replicated and fault
tolerant, it is very Reliable.
19
20. Kafka key feature (2)
reference: Top 10 Kafka Features | Why Apache Kafka Is So Popular - DataFlair (data-flair.training)
f. Durability
It is durable because Kafka uses Distributed commit
log, that means messages persists on disk as fast as
possible.
g. Performance
For both publishing and subscribing messages, Kafka
has high throughput. Even if many TB of messages is
stored, it maintains stable performance.
h. Zero Downtime
Kafka is very fast and guarantees zero downtime and
zero data loss.
i. Extensibility
There are as many ways by which
applications can plug in and make
use of Kafka. In addition, offers
ways by which to write new
connectors as needed.
j. Replication
By using ingest pipelines, it can
replicate the events.
So, this was all about Apache Kafka
Features. Hope you like our
explanation.
20
22. reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
Kafka architecture overview
22
23. reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
Broker node overview
> Java process ☕
> Increase broker then IO,
Availability and Durability
🚀
> Append only log
> Consumer will be reading
from broker node (leader)
> Single node is elected as
cluster controller
23
24. reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
Zookeeper overview
> Elected controller broker
node
> Ensure at most one
controller broker node
> Store some configuration
> Maintain cluster metadata
> Housekeeping item
> Number of ZK node must
be odds!
> Security weak 🔐
24
26. reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
Producer overview
> Kafka client to produce
message
> TCP connect to Kafka
cluster
> Unable to modify
message log
26
27. reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
Consumer overview
> Kafka client to consume
message
> Parallel at most as
number of partition
> Lost consumer can cause
rebalance
27
28. Topic
reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
28
37. Watermark
reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
37
38. Consumer processing
reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
38
> max.poll.interval.ms how much time
permit to complete processing by
consumer, increased time,increases the
rebalancing time
> session.timeout.ms is the timeout used
to identify if the consumer is still alive and
sending a heartbeat
40. Component relation
reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
40
48. reference: The Apache Kafka Control Plane ZooKeeper vs. KRaft - YouTube
KRaft, KIP-500 (2)
48
49. Before Avro, CSV
+ อานงาย
+ Parse งาย
+ เขาใจงาย
===========
- ไมมี Data type กํากับ
- ถาขอมูลมี , อยูงานงอก
- name,,ID ตองหาวิธีแยก Empty กับ Null
49
50. Before Avro, Relational Database
+ มี Data type
+ อยูในรูป Table
===========
- Data type อาจจะไมเหมือนกันถา DB
engine ตางกัน
- ตองมองขอมูลในรูป Row ของ Table
50
54. Delivery semantic for consumer (1)
> At most once: offsets are committed as soon as the message is
received. If the processing goes wrong, the message will be lost
(it won’t be read again).
reference: Delivery Semantics for Kafka Consumers | Learn Apache Kafka (conduktor.io)
enable.auto.commit=true
auto.commit.interval.ms=5000 (5s)
54
55. Delivery semantic for consumer (2)
> At least once: offsets are committed after the message is processed. If the processing goes
wrong, the message will be read again. This can result in duplicate processing of messages. Make
sure your processing is idempotent (i.e. processing again the messages won’t impact your
systems)
reference: Delivery Semantics for Kafka Consumers | Learn Apache Kafka (conduktor.io)
enable.auto.commit=true
auto.commit.interval.ms=5000 (5s)
55
56. Delivery semantic for consumer (3)
> Exactly once: Every message is
guaranteed to be persisted in Kafka
exactly once without any duplicates and
data loss even where there is a broker
failure or producer retry.
reference: Exactly-Once Processing in Kafka explained | by sudan | Medium
Producer
enable.idempotence=true
transactional.id=<id> *
Consumer
enable.auto.commit=false
isolation.level=read_committed
(* is optional)
56
62. Segment
reference: Log Compacted Topics in Apache Kafka | by Seyed Morteza Mousavi | Towards Data Science
Segment Active
Segment
62
63. Broker config for production
> Recommended Java 8, 11
> Memory at least 8 GiB
> Change log.dirs (default is temp dir)
> At least 3 broker (Replication Factor 3)
> auto.create.topics.enable=false
> change num.partitions to expect consumer
63
64. Log compaction (changelog topic)
reference: Log Compacted Topics in Apache Kafka | by Seyed Morteza Mousavi | Towards Data Science
64
69. KSteam benefit
reference: Kafka Streams Overview | Confluent Documentation
> Makes your applications highly scalable, elastic, distributed,
fault-tolerant
> Supports exactly-once processing semantics
> Stateful and stateless processing
> Event-time processing with windowing, joins, aggregations
> Supports Kafka Streams Interactive Queries to unify the worlds of
streams and databases
> Choose between a declarative, functional API and a lower-level
imperative API for maximum control and flexibility
69
73. Security
In-transit At rest
> GSSAPI
Kerberos authentication
> PLAIN
Username/password authentication
> SCRAM-SHA-256 and SCRAM-SHA-512
Username/password authentication
> OAUTHBEARER
Authentication using OAuth
No out of the box solution.
reference: Encrypting Kafka messages at rest to secure applications | Kafka Summit Europe 2021 - Confluent
73