SlideShare a Scribd company logo
Building Data Streaming
Platform with Apache Kafka
Serhii Kalinets
System Architect
History of Kafka
Created in Linkedin
Creators then founded Confluent
Why name is Kafka? Jay Kreps (Confluent CEO): I thought that since
Kafka was a system optimized for writing, using a writer’s name
would make sense. I had taken a lot of lit classes in college and liked
Franz Kafka. Plus the name sounded cool for an open source project.
Kafka use cases
Message Broker
Logs
Commit log
Streaming
What is Kafka
A publish/subscribe messaging system that has an
interface typical of messaging systems
but a storage layer more like a log-aggregation system
Messaging System
Messages
Topics
Partitions
Producers
Consumers
Messages
Key / Value pair, both can be nulls
Kafka treats both just as bytes
Serialization / deserialization happens on clients
Confluent broker can validate messages against schema
https://kafka.apache.org/intro
How many partitions?
What is the throughput you expect to achieve for the topic?
What is the maximum throughput you expect to achieve when
consuming from a single partition?
Throughput for producers can be ignored
How many partitions?
Adding partitions later can be very challenging
Consider the number of partitions you will place on each broker and
available disk space and network bandwidth per broker.
Avoid overestimating, as each partition uses memory and other
resources on the broker and will increase the time for leader elections.
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka
Producers
Can specify partition explicitly or explicitly (via partitioners)
Decision is taken on producer side
Different SKDs might have different default partitioners
Adding new partitions can change partition assignments
Producers guarantees
Kafka guarantees ordering within partition for producers
Can be broken for retries if max.in.flights.requests.per.session > 1
Idempotent producers (retries will not cause duplicates)
Transactions (messages sent within transactions will be available for
consumers only after transaction completes)
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka
Consumer Groups
Common group.id
One consumer is a group coordinator
Poll loop
Simple for developer: while (true) { consumer.poll(); processMessages(); }
Complicated implementation: coordination, rebalancing, heartbeats etc.
Commits and offsets
Consumers commit their last offsets to Kafka
Automatic / manual commits
Sync / async commits
auto.offset.reset from where start reading (start or end)
Datastore
Partitions
Replicas
Segments
Compaction
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka
Replication
Default topic configuration
Replication factor = 3
min.insync.replicas = 2
In producers: acks = all
Segments
Physical files with raw data
Kafka keep open handles to all segments, including inactive
Writes are being done to active segments
Retention, compaction are applied only to inactive segments
Retention
Kafka does not wait until all consumers read data
log.retention.ms -- retention by time
log.retention.bytes -- retention by size (per partition)
log.segment.bytes -- size of when active segment is closed
log.segment.ms -- time when active segment is closed
Compaction: removes old data
Compaction
min.compaction.lag.ms when to compact messages
To delete event, send new message with key and null value
(tombstone)
delete.retention.ms when tombstone can be deleted (the default is 24
hours)
Compaction process is configurable (# of threads, resource
consumption, frequency etc.)
Brokers
Cluster use zookeeper to handle membership
One of broker is a controller (leader), it is responsible for partition
leader election
There are plans to get rid of zookeeper
Kafka guaranties
Durability and high availability
Message ordering in partition
At least once / exactly once
Transactions
Kafka Streams
High level DSL for working with Kafka topics as stream
Currently JVM only (Java / Scala)
DSL is rather simple (kind of map / join / reduce)
Supports joins, filters, aggregations
Streams and tables
Handles all low level stuff
Kafka Streams
Kafka Connect
Is a framework for connecting Kafka with external systems such as
databases, key-value stores, search indexes, and file systems
Built with Kafka streams
Deploys as cluster via operators / helm charts
Configurable via REST endpoint
Add connector to mysql
echo '{"name":"mysql-login-connector",
"config":{"connector.class": "JdbcSourceConnector",
"connection.url":"jdbc:mysql://127.0.0.1:3306/test? user=root",
"mode":"timestamp","table.whitelist":"login", "validate.non.null":false,
"timestamp.column.name":"login_time","topic.prefix":"mysql."}}' | 
curl -X POST -d @- http://localhost:8083/connectors 
--header "content-Type:application/json"
https://kafka.apache.org/intro
ksqlDB
is an event streaming database
SQL on top of Kafka streams + materialized views
ksqlDB Components
Streams: immutable sequences of events
Tables: mutable sequences of events
Stream processing: transform, filter, aggregate and join
Push queries let you subscribe to a query's result as it changes in
real-time.
Pull queries allow you to fetch the current state of a materialized
view.
Creating tables
CREATE TABLE currentCarLocations (
vehicleId VARCHAR,
latitude DOUBLE(10, 2),
longitude DOUBLE(10, 2)
) WITH (
kafka_topic = 'locations',
partitions = 3,
key = 'vehicleId',
value_format = 'json'
);
Queries
SELECT vehicleId,
latitude,
longitude
FROM currentCarLocations
WHERE ROWKEY = '6fd0fcdb'
EMIT CHANGES;
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka
Advantages
Non developers can write their queries
Read from and write to many data sources
Much less code -- less bugs
Data exploration
Our Roadmap
Consumer / producer API
Kafka Streams / Connect ← we are here
ksqlDB
Thanks!
serhii.kalinets@pm.bet
@skalinets

More Related Content

PDF
Kubernetes and the hybrid cloud with Skupper | DevNation tech talk
PDF
Kubernetes - Starting with 1.2
PPTX
KubeCon EU 2016: Multi-Tenant Kubernetes
PPTX
Introduction kubernetes 2017_12_24
PDF
Kubernetes basics and hands on exercise
PDF
From Code to Kubernetes
PDF
Effective Building your Platform with Kubernetes == Keep it Simple
PDF
How to integrate Kubernetes in OpenStack: You need to know these project
Kubernetes and the hybrid cloud with Skupper | DevNation tech talk
Kubernetes - Starting with 1.2
KubeCon EU 2016: Multi-Tenant Kubernetes
Introduction kubernetes 2017_12_24
Kubernetes basics and hands on exercise
From Code to Kubernetes
Effective Building your Platform with Kubernetes == Keep it Simple
How to integrate Kubernetes in OpenStack: You need to know these project

What's hot (20)

PDF
Deploying WSO2 Middleware on Kubernetes
PDF
Docker on docker leveraging kubernetes in docker ee
PDF
Kubernetes in Docker
PDF
Introduction to Kubernetes
PDF
Getting started with kubernetes
PDF
Proactive ops for container orchestration environments
PPT
Building Clustered Applications with Kubernetes and Docker
PDF
K8s Pod Scheduling - Deep Dive. By Tsahi Duek.
PDF
KubeCon EU 2016: Killing containers to make weather beautiful
PDF
Leveraging the Power of containerd Events - Evan Hazlett
PDF
How to Integrate Kubernetes in OpenStack
PDF
Quick introduction to Kubernetes
PDF
Kubernetes extensibility
PDF
Orchestrating Microservices with Kubernetes
PDF
Containers, Clusters and Kubernetes - Brendan Burns - Defrag 2014
PDF
Setting up CI/CD pipeline with Kubernetes and Kublr step-by-step
PDF
KubeCon EU 2016: "rktnetes": what's new with container runtimes and Kubernetes
PDF
Kubernetes persistence 101
PDF
CraftConf 2019: CRI Runtimes Deep Dive: Who Is Running My Pod?
PPTX
Kubernetes101 - Pune Kubernetes Meetup 6
Deploying WSO2 Middleware on Kubernetes
Docker on docker leveraging kubernetes in docker ee
Kubernetes in Docker
Introduction to Kubernetes
Getting started with kubernetes
Proactive ops for container orchestration environments
Building Clustered Applications with Kubernetes and Docker
K8s Pod Scheduling - Deep Dive. By Tsahi Duek.
KubeCon EU 2016: Killing containers to make weather beautiful
Leveraging the Power of containerd Events - Evan Hazlett
How to Integrate Kubernetes in OpenStack
Quick introduction to Kubernetes
Kubernetes extensibility
Orchestrating Microservices with Kubernetes
Containers, Clusters and Kubernetes - Brendan Burns - Defrag 2014
Setting up CI/CD pipeline with Kubernetes and Kublr step-by-step
KubeCon EU 2016: "rktnetes": what's new with container runtimes and Kubernetes
Kubernetes persistence 101
CraftConf 2019: CRI Runtimes Deep Dive: Who Is Running My Pod?
Kubernetes101 - Pune Kubernetes Meetup 6
Ad

Similar to DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka (20)

PDF
Kafka syed academy_v1_introduction
PPTX
Kafkha real time analytics platform.pptx
PDF
Devoxx university - Kafka de haut en bas
PDF
An Introduction to Apache Kafka
PDF
Data Pipelines with Apache Kafka
PPTX
Kafka
PDF
PDF
Kafka for begginer
PPTX
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
PPTX
Fundamentals and Architecture of Apache Kafka
PDF
Etl, esb, mq? no! es Apache Kafka®
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PDF
Apache Kafka - Scalable Message Processing and more!
PDF
PPTX
Apache kafka
PDF
Streaming Data with Apache Kafka
PDF
Introduction to apache kafka
PDF
Apache Kafka Scalable Message Processing and more!
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
PDF
Introduction_to_Kafka - A brief Overview.pdf
Kafka syed academy_v1_introduction
Kafkha real time analytics platform.pptx
Devoxx university - Kafka de haut en bas
An Introduction to Apache Kafka
Data Pipelines with Apache Kafka
Kafka
Kafka for begginer
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Fundamentals and Architecture of Apache Kafka
Etl, esb, mq? no! es Apache Kafka®
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Apache Kafka - Scalable Message Processing and more!
Apache kafka
Streaming Data with Apache Kafka
Introduction to apache kafka
Apache Kafka Scalable Message Processing and more!
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Introduction_to_Kafka - A brief Overview.pdf
Ad

More from DevOps_Fest (20)

PPTX
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
PDF
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
PDF
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
PDF
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
PDF
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
PPTX
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
PDF
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
PPTX
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
PDF
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
PPTX
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
PDF
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
PDF
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
PPTX
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
PPTX
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
PDF
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
PPTX
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
PDF
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
PPTX
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
PDF
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
PDF
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...

Recently uploaded (20)

PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
master seminar digital applications in india
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Complications of Minimal Access Surgery at WLH
PDF
Insiders guide to clinical Medicine.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
human mycosis Human fungal infections are called human mycosis..pptx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Anesthesia in Laparoscopic Surgery in India
FourierSeries-QuestionsWithAnswers(Part-A).pdf
master seminar digital applications in india
102 student loan defaulters named and shamed – Is someone you know on the list?
Pharmacology of Heart Failure /Pharmacotherapy of CHF
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Complications of Minimal Access Surgery at WLH
Insiders guide to clinical Medicine.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
RMMM.pdf make it easy to upload and study
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pre independence Education in Inndia.pdf
TR - Agricultural Crops Production NC III.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Microbial disease of the cardiovascular and lymphatic systems
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
O5-L3 Freight Transport Ops (International) V1.pdf

DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka

  • 1. Building Data Streaming Platform with Apache Kafka Serhii Kalinets System Architect
  • 2. History of Kafka Created in Linkedin Creators then founded Confluent Why name is Kafka? Jay Kreps (Confluent CEO): I thought that since Kafka was a system optimized for writing, using a writer’s name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.
  • 3. Kafka use cases Message Broker Logs Commit log Streaming
  • 4. What is Kafka A publish/subscribe messaging system that has an interface typical of messaging systems but a storage layer more like a log-aggregation system
  • 6. Messages Key / Value pair, both can be nulls Kafka treats both just as bytes Serialization / deserialization happens on clients Confluent broker can validate messages against schema
  • 8. How many partitions? What is the throughput you expect to achieve for the topic? What is the maximum throughput you expect to achieve when consuming from a single partition? Throughput for producers can be ignored
  • 9. How many partitions? Adding partitions later can be very challenging Consider the number of partitions you will place on each broker and available disk space and network bandwidth per broker. Avoid overestimating, as each partition uses memory and other resources on the broker and will increase the time for leader elections.
  • 11. Producers Can specify partition explicitly or explicitly (via partitioners) Decision is taken on producer side Different SKDs might have different default partitioners Adding new partitions can change partition assignments
  • 12. Producers guarantees Kafka guarantees ordering within partition for producers Can be broken for retries if max.in.flights.requests.per.session > 1 Idempotent producers (retries will not cause duplicates) Transactions (messages sent within transactions will be available for consumers only after transaction completes)
  • 14. Consumer Groups Common group.id One consumer is a group coordinator Poll loop Simple for developer: while (true) { consumer.poll(); processMessages(); } Complicated implementation: coordination, rebalancing, heartbeats etc.
  • 15. Commits and offsets Consumers commit their last offsets to Kafka Automatic / manual commits Sync / async commits auto.offset.reset from where start reading (start or end)
  • 19. Default topic configuration Replication factor = 3 min.insync.replicas = 2 In producers: acks = all
  • 20. Segments Physical files with raw data Kafka keep open handles to all segments, including inactive Writes are being done to active segments Retention, compaction are applied only to inactive segments
  • 21. Retention Kafka does not wait until all consumers read data log.retention.ms -- retention by time log.retention.bytes -- retention by size (per partition) log.segment.bytes -- size of when active segment is closed log.segment.ms -- time when active segment is closed
  • 23. Compaction min.compaction.lag.ms when to compact messages To delete event, send new message with key and null value (tombstone) delete.retention.ms when tombstone can be deleted (the default is 24 hours) Compaction process is configurable (# of threads, resource consumption, frequency etc.)
  • 24. Brokers Cluster use zookeeper to handle membership One of broker is a controller (leader), it is responsible for partition leader election There are plans to get rid of zookeeper
  • 25. Kafka guaranties Durability and high availability Message ordering in partition At least once / exactly once Transactions
  • 26. Kafka Streams High level DSL for working with Kafka topics as stream Currently JVM only (Java / Scala) DSL is rather simple (kind of map / join / reduce) Supports joins, filters, aggregations Streams and tables Handles all low level stuff
  • 28. Kafka Connect Is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems Built with Kafka streams Deploys as cluster via operators / helm charts Configurable via REST endpoint
  • 29. Add connector to mysql echo '{"name":"mysql-login-connector", "config":{"connector.class": "JdbcSourceConnector", "connection.url":"jdbc:mysql://127.0.0.1:3306/test? user=root", "mode":"timestamp","table.whitelist":"login", "validate.non.null":false, "timestamp.column.name":"login_time","topic.prefix":"mysql."}}' | curl -X POST -d @- http://localhost:8083/connectors --header "content-Type:application/json"
  • 31. ksqlDB is an event streaming database SQL on top of Kafka streams + materialized views
  • 32. ksqlDB Components Streams: immutable sequences of events Tables: mutable sequences of events Stream processing: transform, filter, aggregate and join Push queries let you subscribe to a query's result as it changes in real-time. Pull queries allow you to fetch the current state of a materialized view.
  • 33. Creating tables CREATE TABLE currentCarLocations ( vehicleId VARCHAR, latitude DOUBLE(10, 2), longitude DOUBLE(10, 2) ) WITH ( kafka_topic = 'locations', partitions = 3, key = 'vehicleId', value_format = 'json' );
  • 36. Advantages Non developers can write their queries Read from and write to many data sources Much less code -- less bugs Data exploration
  • 37. Our Roadmap Consumer / producer API Kafka Streams / Connect ← we are here ksqlDB