SlideShare a Scribd company logo
Streaming process with
Kafka Connect & Kafka Streams
鄭紹志@亦思科技
vito@is-land.com.tw
2017/09/30
About me
● 鄭紹志 Vito
● 亦思科技, R&D Director
● BigData 相關研究開發工作
● Enjoy Java / Scala development
Producer
Consumer
High Level 架構
Kafka
(Broker)
Kafka
(Broker)
Kafka Streams
Data Source
- Database
- Filesystem
- . . .
Data Sink
- Database
- Filesystem
- . . .
KafkaConnect
KafkaConnect
Kafka Connect
Kafka Connect 使用場景: ETL
● 把 X (Source) 的資料送進 Kafka
○ 儲存系統, ex: FileSystem, RDB, Cassandra, S3, ...
○ 外部應用系統, ex: Twitter, Github
● 把 Kafka 的資料送進 Y (Sink)
○ 儲存系統, ex: FileSystem, RDB, Cassandra, S3, ...
○ Search, ex: Elastic, Solr
Kafka Connect overview
● Apache Kafka 0.9+
● A common framework for Kafka connectors
● Standalone and distributed mode
● REST interface(distributed mode)
● Automatic offset management
● Distributed and scalable by default
● Lightweight transformations
https://kafka.apache.org/documentation/#connect_overview
Source & Sink
Kafka Connect
connector
connector
connector
Kafka
(Broker)
Kafka
(Broker)
Database
File
?
Database
?
connector
connector
connector
Source Sink
Elastic
Running Kafka Connect
● Standalone
● Distributed
$ bin/connect-standalone.sh 
config/connect-standalone.properties 
connector1.properties [connector2.properties]...
$ bin/connect-distributed.sh 
config/connect-distributed.properties
Connector
● Connector 架構可實作客製化需求
● Apache Kafka
○ FileStreamSourceConnector / FileStreamSinkConnector
● More connectors: https://www.confluent.io/product/connectors/
Worker
● Worker: 一個 Kafka Connect 的執行單位(JVM process)
● 負責執行 connector 以及 task
● Two types: Standalone / Distributed
● Automatically load balance & fail over
Kafka Connect (Worker)
Conn-1
Conn-1, Task 1
Conn-1, Task 2
Partition 1
Partition 2
Partition 3
Conn-2
Conn-2, Task 1
Conn-2, Task 2
Conn-2, Task 3
: : :
. . .
. . .
thread
JVM
process
Inside the worker
Max task config
(per connector): tasks.max
Distributed mode: Worker cluster
Worker 1
Conn-1
Conn-1, Task 1
Conn-1, Task 2
Conn-1, Task 3
Worker 1
Conn-1
Conn-1, Task 1
Worker 2
Conn-1, Task 2
Conn-1, Task 3
Conn-1, Task 2
Conn-1, Task 3
Kafka Streams
Overview & Concept
Streaming data
● Overloaded term
○ streaming data / data stream / event stream ...
○ event / message / log
● 常見特徵
○ Unbounded data(unlimited size) - 沒有範圍
○ Immutable - 產生後即不再變更
○ Time ordered - 有時間順序
○ Replayable - 重覆播放
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Kafka Streams overview
● Included with the Apache Kafka v0.10+ , May 2016
○ Not compatible with old Kafka broker
● Just a java library, no dedicated cluster required
● Realtime
● Highly scalable, Fault-tolerant
● Stateful / Stateless transformation
Time
● Event time
● Ingestion time(Log append time)
● Processing time
● Message's timestamp in Kafka
○ 0.10+ , Add timestamps to Kafka message(KIP-32)
○ Depend on configuration
■ Event time → Producer Time → CreateTime
■ Ingestion time → Broker Time → LogAppendTime
State & State stores
● Stateful transformation 需要持續維持某些狀態(state)
● StateStore:
○ For cache: Memory(HashMap)
○ For persist: RocksDB
https://stackoverflow.com/a/40114039/3155650
Steam Processing
Topology
http://kafka.apache.org/0110/documentation/str
eams/core-concepts#streams_topology
Building a topology:
● High level: DSL
● Low level: Processor API
Cluster!!
Local state store
一個 Kafka Streams
應用程式
https://kafka.apache.org/0110/documentation/streams/developer-guide#treams_developer-guide_interactive
-queries_your_app
Quick Sample (DSL)
Question: 計算每個州的機場數量
"iata","airport","city","state","country"
"L70","Agua Dulce Airpark", "Agua Dulce","CA","USA"
"TPA","Tampa International ","Tampa","FL","USA"
airportpush
Topic
美國各州的機場資料(csv)
http://stat-computing.org/dataexpo/2009/
airport
Topic
Get 'State' value
(Parse csv message)
Input message
from 'airport'
groupBy
'State'
Count
recordsairport-count
Topic
output message to
'airport-count'
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> textLines = builder.stream("airport");
KTable<String, Long> airportCounts = textLines.mapValues(textLine->{
String state;
try {
state = csvParser.parseLine(textLine)[3];
} catch (Exception e) {
state = null;
}
return state;
}).groupBy((key, state)-> state)
.count("counts");
airportCounts.to(Serdes.String(), Serdes.Long(), "airport-counts");
Demo
airport
Topic
Get 'State' value
(Parse csv message)
Input message
from 'airport'
groupBy
'State'
Count
recordsairport-count
Topic
output message to
'airport-count'
KStream<String, String>
KStream<String, String>
KGroupedStream<String, String>
KTable<String, Long>
airport
Topic
transform
Create source
stream
transform
tranformairport-count
Topic
Write stream to
Kafka
計算結果
AS 3
CT 15
VT 13
IN 65
MT 71
: : : :
$ bin/kafka-console-consumer.sh --topic airport-counts --from-beginning 
--property print.key=true 
--property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
Key Value
Kafka Streams Application Reset
● 重新執行 streaming 計算, 需要狀態重置
● Local reset
○ call KafkaStreams#cleanUp()
● Global reset
○
○ Resetting offsets to zero for input topics
○ Delete all internal(auto-created) topics for application
■ {application.id}-xxxx-repartition
■ {application.id}-xxxx-changelog
$ bin/kafka-streams-application-reset.sh
Kafka Streams DSL
Kafka Streams DSL overview
● KStream, KTable, GlobalKTable
● Stateless transformation
● Stateful transformation
○ State
○ Aggregation
○ Join
○ Window
KStream vs KTable
|jack| Taipei|
|vito|Hsinchu|
|jack|Hsinchu|
stream data
(Person, City)
KStream vs KTable
|jack| Taipei|
|vito|Hsinchu|
|jack|Hsinchu|
jack 去了 Taipei
KStream
jack 去了 Taipei,
Hsinchu jack 住在 Taipei Hsinchu
KTable
jack 住在 Taipei
stream data
(Person, City)
time1
time2
Streaming process with Kafka Connect and Kafka Streams
Streaming process with Kafka Connect and Kafka Streams
KStream, KTable 互相轉換
● KStream → KStream
● KTable → KTable
● KStream → KTable
● KTable → KStream
http://kafka.apache.org/0110/documentation/strea
ms/developer-guide#streams_duality
Stateless transformation
● filter() , filterNot()
● map(), mapValues()
● flatMap() , flatMapValues()
● foreach() , peek()
Key 轉變時會re-partition !!
Stateful transformation
● Join
● Aggregation
● Window
Join operations
https://docs.confluent.io/3.3.0/streams/developer-guide.html#joining
● Key-based
● Require co-partitioning of the input data
Aggregation operations
● Key-based
● count()
● reduce()
● aggregate()
● Two type
○ Latest(rolling) aggregation
○ Windowed aggregation
Window
● 一個時間區段處理
● Tumbling window
● Hopping window
● Sliding window
● Session window
Tumbling Window Window size: 3 mins
Window move: 3 mins
(advance interval)
| | | | |
0 3 6 9 12
stream.map( /* do something */ )
.groupByKey()
.count(TimeWindows.of(5*60*1000L), "store");
Hopping Window
| | | | |
0 3 6 9 12
Window size: 3 mins
Window move: 2 mins
(advance interval)
stream.map( /* do something */ )
.groupByKey()
.count(
TimeWindows.of(5*60*1000L)
.advanceBy(60 * 1000L),
"store");
● move on every record
● used only for join operation
Silding window
Session window
| | | | |
0 3 6 9 12
final Long INACTIVITY_GAP = TimeUnit.MINUTES.toMinutes(6);
stream.map( /* do something */ )
.groupByKey()
.count(SessionWindows.with(INACTIVITY_GAP), "store");
Parallelism Model
https://kafka.apache.org/documentation/streams/architecture
● Partition: Topic partitions / Stream partitions
● 一個 Thread 執行多個 StreamTask
● Partition 數量決定 StreamTask 數量
● 一個 partition 只會分配給一個 StreamTask 處理
● 一個 StreamTask 執行一個 Topology
● StreamConfig: num.stream.threads
Parallelism Model
https://kafka.apache.org/documentation/streams/architecture
Thank you !

More Related Content

PPTX
TenMax Data Pipeline Experience Sharing
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PDF
Uber Real Time Data Analytics
PDF
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
TenMax Data Pipeline Experience Sharing
Cassandra vs. ScyllaDB: Evolutionary Differences
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Uber Real Time Data Analytics
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
Hoodie: How (And Why) We built an analytical datastore on Spark

What's hot (20)

PDF
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
PDF
War Stories: DIY Kafka
PDF
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
PPTX
Seastar Summit 2019 Keynote
PPTX
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
PDF
Putting Kafka Together with the Best of Google Cloud Platform
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
PDF
Streaming Data from Cassandra into Kafka
PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
PPTX
Scylla Summit 2018: Cassandra and ScyllaDB at Yahoo! Japan
PDF
Scylla Summit 2016: Compose on Containing the Database
PDF
Running Scylla on Kubernetes with Scylla Operator
PDF
Kafka on Kubernetes—From Evaluation to Production at Intuit
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PDF
Scylla Virtual Workshop 2020
PDF
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
PDF
Event Driven Microservices
PDF
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
PDF
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
War Stories: DIY Kafka
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
Seastar Summit 2019 Keynote
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
Putting Kafka Together with the Best of Google Cloud Platform
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Streaming Data from Cassandra into Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2018: Cassandra and ScyllaDB at Yahoo! Japan
Scylla Summit 2016: Compose on Containing the Database
Running Scylla on Kubernetes with Scylla Operator
Kafka on Kubernetes—From Evaluation to Production at Intuit
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Scylla Virtual Workshop 2020
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
Event Driven Microservices
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Ad

Similar to Streaming process with Kafka Connect and Kafka Streams (20)

PPTX
Kafka Streams for Java enthusiasts
PDF
Kafka Streams: the easiest way to start with stream processing
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
ODP
Stream processing using Kafka
PDF
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
PDF
ksqlDB Workshop
PDF
Introduction to Kafka Streams
PDF
Stream Processing made simple with Kafka
PDF
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
PPTX
Exactly-once Stream Processing with Kafka Streams
PDF
DataConf.TW2018: Develop Kafka Streams Application on Your Laptop
PDF
Queryable State for Kafka Streamsを使ってみた
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
PDF
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
PDF
Real-Time Stream Processing with KSQL and Apache Kafka
PPTX
Apache kafka
PDF
Apache Kafka, and the Rise of Stream Processing
PDF
Chti jug - 2018-06-26
PDF
Crossing the Streams: Rethinking Stream Processing with KStreams and KSQL
PPTX
Apache Kafka Streams Use Case
Kafka Streams for Java enthusiasts
Kafka Streams: the easiest way to start with stream processing
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Stream processing using Kafka
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
ksqlDB Workshop
Introduction to Kafka Streams
Stream Processing made simple with Kafka
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
Exactly-once Stream Processing with Kafka Streams
DataConf.TW2018: Develop Kafka Streams Application on Your Laptop
Queryable State for Kafka Streamsを使ってみた
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Real-Time Stream Processing with KSQL and Apache Kafka
Apache kafka
Apache Kafka, and the Rise of Stream Processing
Chti jug - 2018-06-26
Crossing the Streams: Rethinking Stream Processing with KStreams and KSQL
Apache Kafka Streams Use Case
Ad

More from vito jeng (7)

PDF
Quill - 一個 Scala 的資料庫存取利器
PDF
Scala fun part: Reflection(runtime)
PDF
Intellij IDEA Intro, Tips and Tricks
PDF
ScalaMatsuri 2017 Note
PDF
The SparkSQL things you maybe confuse
PDF
Scala introduction
ODP
JavaScript Web Development
Quill - 一個 Scala 的資料庫存取利器
Scala fun part: Reflection(runtime)
Intellij IDEA Intro, Tips and Tricks
ScalaMatsuri 2017 Note
The SparkSQL things you maybe confuse
Scala introduction
JavaScript Web Development

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
top salesforce developer skills in 2025.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
history of c programming in notes for students .pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
assetexplorer- product-overview - presentation
PPTX
L1 - Introduction to python Backend.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Transform Your Business with a Software ERP System
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
medical staffing services at VALiNTRY
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
VVF-Customer-Presentation2025-Ver1.9.pptx
top salesforce developer skills in 2025.pdf
Computer Software and OS of computer science of grade 11.pptx
CHAPTER 2 - PM Management and IT Context
Design an Analysis of Algorithms II-SECS-1021-03
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
history of c programming in notes for students .pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PTS Company Brochure 2025 (1).pdf.......
Adobe Illustrator 28.6 Crack My Vision of Vector Design
assetexplorer- product-overview - presentation
L1 - Introduction to python Backend.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Transform Your Business with a Software ERP System
Navsoft: AI-Powered Business Solutions & Custom Software Development
wealthsignaloriginal-com-DS-text-... (1).pdf
medical staffing services at VALiNTRY
How to Choose the Right IT Partner for Your Business in Malaysia
Upgrade and Innovation Strategies for SAP ERP Customers

Streaming process with Kafka Connect and Kafka Streams