SlideShare a Scribd company logo
Data Pipeline with Kafka
Dr. Mole T.Y. WONG @ HK OSCON 2018
2018 / 06 / 16 - 17
1
whoami
2
深入了解用戶行為,洞悉可行的改善方法
Understand our users.
Provide actionable insights.
Why
HK01 Data Team | About
以數據驅動產品方向
Data driven: steer our product direction.
How
HK01 Data Team | About
數據:定義、收集、處理、洞見
Data: definition, ingress, process, insight.
What
HK01 Data Team | About
6
Data-Driven Product Development
7
Browser - Page A
Fashion
Article
Car
Article
Browser - Page B
Fashion
Article
Stock
Article
Fashion Article
Click eventClick event
Traffic Source Analysis
Click-Through Rate VS Pageview
Machine Learning Products
8
Collaborative filtering
Image source: wikipedia
User Reading History
NLP Content-based
Clustering
Personalized Recommendation Feed
Outline
● Data pipeline - what is it?
● Kafka - roles in a data pipeline
● Other use cases of Kafka
9
API
Gateway
Kinesis
S3
Redshift
Spectrum
AWS
EMR
WEB
APP
Apache
Airflow
Tracker
Typical Data Pipeline Setup
Metabase
Different Aspects of a Data Pipeline
11
WEB
APP
Tracker
Data Ingress
JS Library (WEB)
Native Library (APP)
Google Analytics
Mixpanel
Matomo (Piwik)
Data Tracker
● Nature
○ Lightweight
○ Programmable
● Capability
○ Page view / Screen view
○ Custom events
○ Device identification
○ Session management
Infrastructure
- AWS Kinesis
- Google Pub/Sub
- Apache Kafka
API
Gateway
KinesisWEB
APP
Different Aspects of a Data Pipeline
Tracker
Data Infrastructure
● Main Roles
○ Buffering
○ Routing
○ Writing
● Characteristics
○ Multiple producers
○ Multiple consumers
○ Batch / Real-time
Data Ingress
JS Library (WEB)
Native Library (APP)
Google Analytics
Mixpanel
Matomo (Piwik)
S3
Pre-processing
Cleansing,
Transformation,
Data Warehousing
S3
Redshift
Spectrum
AWS
EMR
Apache
Airflow
Different Aspects of a Data Pipeline
Pre-processing
● Main Roles
○ Avoid direct querying raw data
○ Cleansing
○ ETL - Extract, Transform, Load
○ Scheduling
● Characteristics
○ Defining data sets
○ Time-frame-based queries
Pre-processing
Cleansing,
Transformation,
Data Warehousing
S3
AWS
EMR
Apache
Airflow
Application
Dashboard,
Reporting,
Recommendation
Engine, etc
Redshift
Spectrum
Metabase
Different Aspects of a Data Pipeline
Application
● Main Roles
○ KPI VS Exploration
○ Operators VS Data Scientists
○ Planned VS Ad-hoc queries
● Characteristics
○ Production-grade data
○ Fast is a must
What is Kafka? https://kafka.apache.org/ Main Contributor: Gene NG
15
API
Gateway
WEB
APP
Tracker
Metabase
Data Pipeline with Kafka
What is Kafka?
Kafka
Connect
API
Optional: data persists in S3
Kafka
Connect
API
API
Gateway
WEB
APP
Tracker
Metabase
Data Pipeline with Kafka
What is Kafka?
Kafka
Connect
API
Optional: data persists in S3
Kafka
Connect
API
What is Kafka - terminology
18
Basics: Producer-Consumer Model
while(1) {
var e = produce_event()
producer.produce(e)
}
Producer
while(1) {
var m = consumer.poll()
consume_msgs(m)
}
Consumer
What is Kafka - terminology
19
Data
Source
Data
Sink
Kafka
Connect
API
Kafka
Connect
API
Connect API
- For database / data source
- Wrapped consumer & producer code
- Nice thing: config file only!
What is Kafka - terminology
20
Data
Source
Data
Sink
Kafka
Connect
API
Kafka
Connect
API
Connect API - common connectors
JDBC - MySQL, PgSQL S3
HDFS ElasticSearch
Kafka Connect
21
Data
Source 1
Data
Source 2
Data
Source 3
Topic 1
Topic 2
Topic 3
Data Topic Model
● One-to-one (most common)
Feature
● Autonomous
○ Loads data from sources
whenever changes occur
● Storage
○ Writes data to the hosted
HDD
○ Optional: sync data to S3
Kafka Connect - Source Property File
Source: https://github.com/confluentinc/kafka-connect-jdbc/blob/master/config/source-quickstart-sqlite.properties
22
name=test-source-sqlite-jdbc-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:sqlite:test.db
mode=incrementing
incrementing.column.name=id
topic.prefix=test-sqlite-jdbc-
1
2
3
4
5
6
7
Kafka Connect - Source Property File
Source: https://github.com/confluentinc/kafka-connect-jdbc/blob/master/config/source-quickstart-sqlite.properties
23
name=test-source-sqlite-jdbc-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:sqlite:test.db
mode=incrementing
incrementing.column.name=id
topic.prefix=test-sqlite-jdbc-
1
2
3
4
5
6
7
Topic naming convention
- Prefix, and
- DB table name
How it works:
- Each table implies one
topic.
Kafka Connect
24
App
Traffic
Web
traffic
Data
Source X
All traffic
Topic X
Data Topic Model
● One-to-one (most common)
● Many-to-one
Kafka Connect
25
App
Traffic
Web
traffic
All traffic
Schema-less
● Practically, you can write any
types of data to the topic
● Most common choice is Avro
Btw, Avro is an open-source library
for schema specification and data
serialization.
Kafka Connect
26
Data
Source X
Data
Source Y
Topic 1
Topic 3
Data Topic Model
● One-to-one (most common)
● Many-to-one
● One-to-many (most rare)
Topic 2
Kafka Connect
27
Data
Source X
Topic 1
A more practical approach
● Use the same truth / data
● Consume by multiple guys!
Consumer A
Consumer B
Takeaway Messages
● Producers and consumers are actors
○ Push data to or pull data from Kafka
● Connect API automates the above actions
○ Work nicely with databases
28
29
Data Pipeline Use Cases
Kafka as a data pipeline - data resiliency
30
Data
Sink
Kafka
Connect
API
Kafka Internal - consumer’s state
Consumer Topic Current Topic
Position
Your last-read
position
Lag behind by
hello_world foobar 1080 1000 80
Kafka keeps track on consumer’s
state:
- A consumer can always resume
work-in-progress
- New consumer can start fresh!
Source:
https://www.cloudera.com/documentation/kafka/latest/topics/kafka_command_line.html
31
Kafka as a data pipeline - Replace ETL
32
.filter()
.map()
.reduce()
Function Use Case
filter() Cleansing
map() Transformatio
n
reduce() Aggregation
Data
Source
Data
Sink
Kafka
Connect
API
Kafka
Connect
API
Kafka
Stream
API
E L
T
.filter()
.map()
.reduce()
JAVA
Or
Scala
Source: https://i.redd.it/yf7rw3pjiapx.jpg
33
Kafka - Streaming Example Code
Source: https://kafka.apache.org/11/documentation/streams/tutorial
34
A New Topic is Created!
API
Gateway
WEB
APP
Tracker
Metabase
Data Pipeline with Kafka v2
Kafka - Replacing ETL
Kafka
Connect
API
Optional: data persists in S3
Kafka
Connect
API
Kafka
Stream
API
36
Experimenting Kafka in HK01
Experimenting Kafka in HK01
37
Metabase
Kafka
Connect
API
Kafka
Connect
API
Kafka
Stream
API
Apache
Airflow
External
Services 1.
Fetch data from an external service every hour.
Experimenting Kafka in HK01
38
Metabase
Kafka
Connect
API
Kafka
Connect
API
Kafka
Stream
API
Apache
Airflow
External
Services 2.
When data arrives at S3, Kafka takes it in.
Experimenting Kafka in HK01
39
Metabase
Kafka
Connect
API
Kafka
Connect
API
Kafka
Stream
API
Apache
Airflow
External
Services 3.
Stream API counts the number of new users
using certain services.
Experimenting Kafka in HK01
40
Metabase
Kafka
Connect
API
Kafka
Connect
API
Kafka
Stream
API
Apache
Airflow
External
Services 4.
Connect API automatically updates the MySQL
table. Metabase can display the updates.
Experimenting Kafka in HK01
41
Will display live dashboard during the talk
42
Other Use Cases
Message Queue | Source: https://www.confluent.io/blog/stream-data-platform-1/
43
Highly-coupled:
application & storage
Message Queue | Source: https://www.confluent.io/blog/stream-data-platform-1/
44
As a message queue (MQ):
- Pub/Sub
- Transformation
- Roles; clear that who are the
sources and the sinks,
respectively
Other Use Cases | Source: https://kafka.apache.org/uses
45
Things that we didn’t explore
● Logs aggregation
● Database log compaction
● Event sourcing
46
Key Takeaways
Pros
1. Kafka simplifies your ETL tasks.
2. Kafka unitifies your data storage.
3. Kafka gives your other possibilities.
47
Key Takeaways
Cons
1. Ops problems - scalability, HA, Zookeeper, etc.
2. Learning curve is *STEEP*.
We Love to Share
Mole Wong
Data Pipeline with
Apache Kafka
Day 1 17:40
Conference Hall 4-5
Ivan Ha
React Async
Rendering - Paradigm
Shift After React Fiber
Day 2 15:10
Conference Hall 6
Sunday Ku
Video.js with HLS
Day 2 12:30
Conference Hall 4-5
49https://goo.gl/j74Ztt

More Related Content

PPTX
Apache Flink and what it is used for
PDF
Kafka Connect & Streams - the ecosystem around Kafka
PDF
PDF
Cassandra Introduction & Features
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PPTX
Apache Beam: A unified model for batch and stream processing data
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Apache Flink and what it is used for
Kafka Connect & Streams - the ecosystem around Kafka
Cassandra Introduction & Features
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache Beam: A unified model for batch and stream processing data
Spark (Structured) Streaming vs. Kafka Streams
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

What's hot (20)

PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PPTX
Autoscaling Flink with Reactive Mode
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PPTX
RocksDB compaction
PDF
Fundamentals of Apache Kafka
PDF
Kafka Security 101 and Real-World Tips
PPTX
Learn Apache Spark: A Comprehensive Guide
PPTX
Apache Spark Architecture
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
PPTX
Improving Kafka at-least-once performance at Uber
PPTX
Flink Streaming
PDF
Etsy Activity Feeds Architecture
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Monitoring with Prometheus
PDF
Distributed computing with spark
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
RocksDB detail
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Developing Scylla Applications: Practical Tips
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Autoscaling Flink with Reactive Mode
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
RocksDB compaction
Fundamentals of Apache Kafka
Kafka Security 101 and Real-World Tips
Learn Apache Spark: A Comprehensive Guide
Apache Spark Architecture
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Improving Kafka at-least-once performance at Uber
Flink Streaming
Etsy Activity Feeds Architecture
Flexible and Real-Time Stream Processing with Apache Flink
Monitoring with Prometheus
Distributed computing with spark
A Thorough Comparison of Delta Lake, Iceberg and Hudi
RocksDB detail
A Deep Dive into Query Execution Engine of Spark SQL
Practical learnings from running thousands of Flink jobs
Developing Scylla Applications: Practical Tips
Ad

Similar to Data pipeline with kafka (20)

PPTX
Streaming Data and Stream Processing with Apache Kafka
PDF
Kafka Vienna Meetup 020719
PDF
Beyond the brokers - Un tour de l'écosystème Kafka
PDF
Building scalable data with kafka and spark
PPTX
kafka for db as postgres
PDF
Beyond the brokers - A tour of the Kafka ecosystem
PDF
Beyond the Brokers: A Tour of the Kafka Ecosystem
PDF
Introduction to Apache Kafka
PDF
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
PDF
Introduction to Apache Kafka and why it matters - Madrid
PDF
Data Pipeline with Kafka
PDF
Introduction to apache kafka, confluent and why they matter
PDF
Connect K of SMACK:pykafka, kafka-python or?
PDF
Devoxx university - Kafka de haut en bas
PDF
JHipster conf 2019 - Kafka Ecosystem
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
PPTX
Big Data Analytics_basic introduction of Kafka.pptx
PPTX
Current and Future of Apache Kafka
PPTX
Apache kafka
PDF
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Streaming Data and Stream Processing with Apache Kafka
Kafka Vienna Meetup 020719
Beyond the brokers - Un tour de l'écosystème Kafka
Building scalable data with kafka and spark
kafka for db as postgres
Beyond the brokers - A tour of the Kafka ecosystem
Beyond the Brokers: A Tour of the Kafka Ecosystem
Introduction to Apache Kafka
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Introduction to Apache Kafka and why it matters - Madrid
Data Pipeline with Kafka
Introduction to apache kafka, confluent and why they matter
Connect K of SMACK:pykafka, kafka-python or?
Devoxx university - Kafka de haut en bas
JHipster conf 2019 - Kafka Ecosystem
Apache Kafka as Event Streaming Platform for Microservice Architectures
Big Data Analytics_basic introduction of Kafka.pptx
Current and Future of Apache Kafka
Apache kafka
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Ad

Recently uploaded (20)

PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PPT on Performance Review to get promotions
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
additive manufacturing of ss316l using mig welding
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Welding lecture in detail for understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Sustainable Sites - Green Building Construction
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
web development for engineering and engineering
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT
Mechanical Engineering MATERIALS Selection
PDF
Well-logging-methods_new................
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT on Performance Review to get promotions
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
additive manufacturing of ss316l using mig welding
CH1 Production IntroductoryConcepts.pptx
Welding lecture in detail for understanding
Internet of Things (IOT) - A guide to understanding
Sustainable Sites - Green Building Construction
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Operating System & Kernel Study Guide-1 - converted.pdf
web development for engineering and engineering
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Foundation to blockchain - A guide to Blockchain Tech
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mechanical Engineering MATERIALS Selection
Well-logging-methods_new................

Data pipeline with kafka