Kafka for developer

for developer #1
ภูริเดช สุดสี (เอาะ)
โปรแกรมเมอรนอรแมด

Course Outline
Introduction
- Key concept
- Architecture
- Decoupling application
- Example use case
- Technical term e.g. Topic / Partition
/ Producer / Consumer
- Component
- Retention policy
- FAQ and Limitation
Setting up
- Setup Docker Compose for test bed
kafka cluster
- Use GUI to connect Kafa to familiar
with tool and concept
Hand on
- Implement Kafka producer and
consumer application
- Testing and debugging
Monitoring (If possible)
- Monitoring tool
2

Kafka’s origin story
- Initial project 2010, Initial release 2011
- Develop at Linkedin
- Written in Java/Scala
- Named after Franz Kafka, "a system
optimized for writing"
- Written in Java/Scala
- Page view, page event, aggregated logs to
Apache Hadoop.
- Need distributed architecture
Jay Kreps
CEO, Confluent
Neha Narkhede
CTO, Confluent
Jun Rao
Co-founder, Confluent
3

references:
[1]Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Koutanov, Emil, eBook - Amazon.com
[2] Deep Learning Infrastructure for Extreme Scale with the Apache Kafka Open Source Ecosystem - DataScienceCentral.com
As of 2019 | 100+ cluster | 4,000 brokers | 100,000 topics |
> 4.5 Trillion message per day
As of 2015 | 20 Gbps
As of 2015 | 5 Billion session per day | Stream processing
use case (Apache Storm,Apache Hadoop and AWS
Elastic MapReduce)
As of 2016 | 700 Billion event | Stream processing with
Keystone | > 1 Trillion message per day
Tech giant use case
5

Use cases (1) Application activity tracking
- Kafka มีจุดเริ่มตนมาจาก Linkedin อยากรูการใช
งานของผูใช
- ทําเครื่องมือสําหรับเก็บการกด Click, เวลาที่ใชแตละ
Page, การลงทะเบียนและขอมูลตาง ๆ ของผูใช
- ทํา User personalize
- จัดขอมูลใหผูอานแตละคนโดยเฉพาะได
- การเก็บขอมูลจํานวนมาก ทําใหรูความตองการ หรือ
กระทั้งปลูกฝงความตองการ
- การเห็นขอมูลความตองการทําใหเตรียม Stock
สินคา หรือบริการได
6

Use cases (2) Messaging system
- ชวยสงผานขอมูลในรูปของ Message
- วิธีคิดแบบขอความชวยใหเราสามารถแบงการประมวลผลเปน Process ยอย ๆ ไดดี
- Message ทําใหการแบงการประมวลผลเปนแบบขนานทําไดดีขึ้น
- บางงานสามารถใชเปน Buffer คําขอได
7

Use cases (3) Decoupling service
- Microservice ทําให Context ของ Application อยูใน Domain ของตัวเอง
- Service ทําสําคัญจะตรวจสอบไดงาย ทําใหดูแลงาย
- หลังจากแยก Service ก็ยังตองการติดตอกัน
- สามารถติดตอผานทาง Kafka โดย Consumer จะทํางานตามความเร็วของตัว
เอง ทําให Service สามารถแยกกันได และไดความเสถียรมากขึ้น
8

Use cases (4) Application logs gathering and monitoring
- ใชเก็บ Log แบบ Near real-time
(AWS/GCP มี Service แตอาจจะชากวาและ
ตองจายเพิ่มในบางกรณี)
- บางกรณีตองการเอา Application log ไปเก็บ
ที่อื่น
- Reduce log message
- ตรวจความผิดปกติของ Application
- ชวยทํา Alert
- เก็บ Metric ตาง ๆ ของระบบ (Log-base
metric)
9

Use cases (5) Real-time data processing
- มี Internal tool สําหรับ Process data แบบ
Real-time (KStream)
- เชื่อมตอกับ External tool ได (Hadoop, Spark)
- ประมวลผลขอมูลเชน Recommend สินคาไดทัน
ความตองการ
- ทําพวก Fraud detection / Security Intrusion
Detection
- ลดการประมวลผลขอมูลทีละมาก ๆ (Resources
cost)
10

Use cases (5) Tracking log
- เก็บขอมูล IOT
- เก็บขอมูล Sensor
- GPS
- Application / User positioning
11

Messaging Queue vs Pub/Sub
reference: Pub-Sub vs. Message Queues | Baeldung
Exactly once approach At least 1 approach
12

Pub/Sub messaging pattern (1)
reference: Kafka: The Definitive Guide 2, Shapira, Gwen, Palino, Todd, Sivaram, Rajini, Petty, Krit, eBook - Amazon.com
13

14

15

reference: Apache Kafka Core Concepts - Learning Journal
16

17

Kafka Pub/Sub messaging pattern
18

Kafka key feature (1)
reference: Top 10 Kafka Features | Why Apache Kafka Is So Popular - DataFlair (data-flair.training)
a. Scalability
Apache Kafka can handle scalability in all the four
dimensions, i.e. event producers, event processors,
event consumers and event connectors. In other
words, Kafka scales easily without downtime.
b. High-Volume
Kafka can work with the huge volume of data streams,
easily.
c. Data Transformations
Kafka offers provision for deriving new data streams
using the data streams from producers.
d. Fault Tolerance
The Kafka cluster can handle
failures with the masters and
databases.
e. Reliability
Since Kafka is distributed,
partitioned, replicated and fault
tolerant, it is very Reliable.
19

Kafka key feature (2)
reference: Top 10 Kafka Features | Why Apache Kafka Is So Popular - DataFlair (data-flair.training)
f. Durability
It is durable because Kafka uses Distributed commit
log, that means messages persists on disk as fast as
possible.
g. Performance
For both publishing and subscribing messages, Kafka
has high throughput. Even if many TB of messages is
stored, it maintains stable performance.
h. Zero Downtime
Kafka is very fast and guarantees zero downtime and
zero data loss.
i. Extensibility
There are as many ways by which
applications can plug in and make
use of Kafka. In addition, offers
ways by which to write new
connectors as needed.
j. Replication
By using ingest pipelines, it can
replicate the events.
So, this was all about Apache Kafka
Features. Hope you like our
explanation.
20

Kafka architecture overview
reference: Kafka Enterprise Architecture - Learning Journal
21

reference: Effective Kafka: A Hands-On Guide to Building Robust and Scalable Event-Driven Applications with Code Examples in Java ,
Kafka architecture overview
22

Broker node overview
> Java process ☕
> Increase broker then IO,
Availability and Durability
🚀
> Append only log
> Consumer will be reading
from broker node (leader)
> Single node is elected as
cluster controller
23

Zookeeper overview
> Elected controller broker
node
> Ensure at most one
controller broker node
> Store some configuration
> Maintain cluster metadata
> Housekeeping item
> Number of ZK node must
be odds!
> Security weak 🔐
24

Placement concern
> เลี่ยงการวาง Broker ไวในที่เดียวกัน ถาใช Cloud ใหกําหนด
Placement ใหกระจาย Zone หรือ Rack
> Zookeeper กับ Broker วางที่เดียวกันไดแตระวังเรื่อง Network
bandwidth
> Setup broker.rack=[rack-id]
25

Producer overview
> Kafka client to produce
message
> TCP connect to Kafka
cluster
> Unable to modify
message log
26

Consumer overview
> Kafka client to consume
message
> Parallel at most as
number of partition
> Lost consumer can cause
rebalance
27

Topic
28

reference: Understanding Kafka Topic Partitions | by Dunith Dhanushka | Event-driven Utopia | Medium
Partition
29

reference: Why you should know Apache Kafka | b-nova
Partition
30

Partition leader and follower (Replication factor)
RF <= No of broker
31

Message, Record, Log
> ที่ Offset เดียวกัน แตคนละ Partition
ขอความเปนคนละขอความ
> Partial order in partition
> Default max = 1MB
Timestamp: เซ็ตจาก Producer หรือใชอัตโนมัติได
Header: Set ของ Key/Value สําหรับใชเปน Metadata
Partition number: 0-base index partition number
Offset: 64 signed integer ตําแหนงของ message ใน partition
Key: Byte array เปนอะไรก็ไดไมจําเปนตอง Unique
Value: Byte array
32

Partition awareness
> same key go to same partition (hash function)
> Number of partition can increase but cannot decrease
33

reference: Kafka Streams, co-partitioning requirements illustrated | by Loïc DIVAD | Publicis Sapient Engineering | Medium
Partition hash function
34

Partition and consumer group
35

Advertise listener
reference: Apache Kafka Series - Kafka Cluster Setup & Administration | Udemy
./config/server.properties
36

Watermark
37

Consumer processing
38
> max.poll.interval.ms how much time
permit to complete processing by
consumer, increased time,increases the
rebalancing time
> session.timeout.ms is the timeout used
to identify if the consumer is still alive and
sending a heartbeat

Rebalance
39

Component relation
40

Architecture wrap-up
reference: Apache Kafka® Architecture: A Complete Guide - Instaclustr
41

Retention policy
> Time-base (7 days) default: log.retention.hours=168
> Size-base (1 GiB) default: log.retention.bytes=1073741824
> per broker
> per topic
42

reference: A Deep Dive into Apache Kafka This is Event Streaming by Andrew Dunnings & Katherine Stanley - YouTube
Offset
43

reference: Kafka Topic Configuration: Min In-Sync Replicas | Learn Apache Kafka (conduktor.io)
Sync & Async Producer
Sync
- ACK=0 (fire and forgot)
- ACK=1 (only leader)
- ACK=all (all min.insync.replicas)
Async
- Support callback (specify
number of request
max.in.flight.requests.per.connec
tion,and retries flag)
44

Commit sync & async consumer
commitSync
Commit message แลวจะ
Block I/O รอผลจากการ
commit จะจัดการ Error ได
งายกวา
commitAsync
Commit message แลวจะไม
Block I/O ทําใหจัดการ Error
ไดยากกวา หากตองการจัดการ
Error ใชผาน callback
function ได
45

reference: Kafka Producer Batching | Learn Apache Kafka with Conduktor
Batch.size and linger.ms
⬆ batch.size
⬆ Thoughtput
⬆ Latency
46

KRaft, KIP-500 (1)
reference: The Apache Kafka Control Plane ZooKeeper vs. KRaft - YouTube
47

reference: The Apache Kafka Control Plane ZooKeeper vs. KRaft - YouTube
KRaft, KIP-500 (2)
48

Before Avro, CSV
+ อานงาย
+ Parse งาย
+ เขาใจงาย
===========
- ไมมี Data type กํากับ
- ถาขอมูลมี , อยูงานงอก
- name,,ID ตองหาวิธีแยก Empty กับ Null
49

Before Avro, Relational Database
+ มี Data type
+ อยูในรูป Table
===========
- Data type อาจจะไมเหมือนกันถา DB
engine ตางกัน
- ตองมองขอมูลในรูป Row ของ Table
50

Before Avro, JSON
+ ขอมูลอยูในรูป nested JSON ซอน
กันได
+ อยูในรูป Array ได
+ สามารถใชไดกับเกือบทุกภาษา
+ เปนที่นิยม
+ easily to share over network
===========
- ไมมี Schema ที่ชัดเจน (อาจทําให
ขอมูลเดิมพัง)
- ตองใช Key ทุกครั้ง ทําใหเสีย
Bandwidth / Storage เยอะ
51

+ กําหนด Schema ดวย JSON
+ มี Data type
+ Auto compression
+ อานขามภาษาได
+ Schema เปลี่ยนแปลงได (verify backward
compatibility ได)
+ เหมาะกับการเขียน (มากกวา Parquet)
+ Support Hive
===========
- อาจจะไมมี Support ในภาษาบางภาษา
- เนื่องจากมัน compressed อยู ตองใชเครื่องมืออาน
Avro
52

Avro compatibility
reference: Schema Evolution and
Compatibility | Confluent Documentation 53

Delivery semantic for consumer (1)
> At most once: offsets are committed as soon as the message is
received. If the processing goes wrong, the message will be lost
(it won’t be read again).
reference: Delivery Semantics for Kafka Consumers | Learn Apache Kafka (conduktor.io)
enable.auto.commit=true
auto.commit.interval.ms=5000 (5s)
54

> At least once: offsets are committed after the message is processed. If the processing goes
wrong, the message will be read again. This can result in duplicate processing of messages. Make
sure your processing is idempotent (i.e. processing again the messages won’t impact your
systems)
reference: Delivery Semantics for Kafka Consumers | Learn Apache Kafka (conduktor.io)
enable.auto.commit=true
auto.commit.interval.ms=5000 (5s)
55

> Exactly once: Every message is
guaranteed to be persisted in Kafka
exactly once without any duplicates and
data loss even where there is a broker
failure or producer retry.
reference: Exactly-Once Processing in Kafka explained | by sudan | Medium
Producer
enable.idempotence=true
transactional.id=<id> *
Consumer
enable.auto.commit=false
isolation.level=read_committed
(* is optional)
56

Topic design
> อยาใช Topic เดียวทําทุกอยาง
> Consumer อานขอมูลจาก Partition leader ถา leader
อยูเครื่องเดียวกันเยอะ อาจมีปญหา Performance
> ควรใช Data schema แบบเดิม / ไม break
backward-compat
> ตองการ Order ควรให Message ตก Partition เดียวกัน
> ออกแบบ Key ดีไดทั้ง Ordering และ Performance
> พิจารณาแยก Topic ในกรณีที่ตองการ Throughput สูงๆ
57

Software architecture design
reference: Disaster Recovery for Multi-Region Kafka at Uber - Uber Engineering Blog
58

59

60

Software architecture design (Chatting)
61

Segment
reference: Log Compacted Topics in Apache Kafka | by Seyed Morteza Mousavi | Towards Data Science
Segment Active
Segment
62

Broker config for production
> Recommended Java 8, 11
> Memory at least 8 GiB
> Change log.dirs (default is temp dir)
> At least 3 broker (Replication Factor 3)
> auto.create.topics.enable=false
> change num.partitions to expect consumer
63

Log compaction (changelog topic)
reference: Log Compacted Topics in Apache Kafka | by Seyed Morteza Mousavi | Towards Data Science
64

Kafdrop
java --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
-jar target/kafdrop-3.30.0.jar.jar
--kafka.brokerConnect=localhost:9092
67

KSteam overview
reference: Kafka Streams Overview | Confluent Documentation
68

KSteam benefit
> Makes your applications highly scalable, elastic, distributed,
fault-tolerant
> Supports exactly-once processing semantics
> Stateful and stateless processing
> Event-time processing with windowing, joins, aggregations
> Supports Kafka Streams Interactive Queries to unify the worlds of
streams and databases
> Choose between a declarative, functional API and a lower-level
imperative API for maximum control and flexibility
69

KTable
reference: Streams Concepts | Confluent Documentation
> KSteam is a KTable duality
70

Kafka connect
> Data Centric Pipeline
Connect uses meaningful
data abstractions to pull
or push data.
> Flexibility
> Scalability
> Reusability
> Extensibility
=> File sync
=> Database Sync
71

Security
In-transit At rest
> GSSAPI
Kerberos authentication
> PLAIN
Username/password authentication
> SCRAM-SHA-256 and SCRAM-SHA-512
Username/password authentication
> OAUTHBEARER
Authentication using OAuth
No out of the box solution.
reference: Encrypting Kafka messages at rest to secure applications | Kafka Summit Europe 2021 - Confluent
73

SSL
reference: Apache Kafka Security 101 | Confluent
74

Kafka for developer

More Related Content

Similar to Kafka for developer (20)

More from Bhuridech Sudsee (20)

Kafka for developer