SlideShare a Scribd company logo
Workshop Series:
ksqlDB
2021 10 20 2
:
Jupil Hwang ( )
Hyunsoo Kim ( )
:
14:00 – 17:00
2
Agenda — ksqlDB
3 3
01 02:00 - 02:10 PM
05 Lab: Hands on
03:00 AM - 05:00 PM
02 Talk: Kafka, Kafka Streams ksqlDB
02:10 - 02:30 PM
03 Lab:
02:30 - 02:45 PM
04 Lab:
02:45 - 03:00 PM
4
• Q&A
• 궁금한 점이 있으시다면 Q&A를 통해 질문 보내주시기 바랍니다. 발
표 이후 연사가 직접 답변 전달할 예정입니다.
• 온라인 설문조사
• 금일 워크샵에 대한 소중한 의견 보내주시기 바랍니다. 향후 알찬 내
용을 준비하는데 참고하겠습니다.
• 설문조사 참여 링크는 (1) Zoom 채팅창 통해 확인, (2) 행사 종료 이
후 웹 브라우저 통해 자동 참여
워크샵 안내사항
Confluent Platform & Cloud:
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
MOM MOM
ETL
ETL
EAI / ESB
●
●
● / (Pub/sub)
(Point-to-Point)
●
●
App App
NoSQL DBs Big Data Analytics
?
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
MOM MOM
ETL
ETL
EAI / ESB
App App
:
스트리밍 플랫폼은 조직
의 모든 사람과 시스템에
게 데이터에 대한 단일
정보 소스(single source
of truth)를 제공한다.
NoSQL DBs Big Data Analytics
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
App App
Streaming Platform
80% +
Fortune 100
Apache Kafka
Confluent
LinkedIn
•
• Producer Consumer (Decouple)
•
Event Streaming Platform
, , ,
Core Loans Credit Cards Patient
Lending
Data :
...
Device
Logs
... ...
...
Data Stores Logs 3rd Party
Apps
Custom Apps /
Microservices
Real-time
Inventory
Real-time
Fraud
Detection
Real-time
Customer
360
Machine
Learning
Models
Real-time
Data
Transformat
ion
...
Data in Motion Applications
Data-in-Motion Pipeline
Amazon
S3
SaaS
apps
Data in Motion :
, , .
Streaming Platform :
A Sale A shipment
A Trade
A Customer
Experience
11
…and more
12
Event Stream Processing
,
What’s stream processing good for?
13
Materialized Cache
view
Streaming ETL Pipeline
,
Source Sink
Event-Driven
Microservice
Confluent Platform Conceptual Architecture
14
OSS
Apache Kafka
Data
Sink
POJO /
MicroServices
Data
Sink
OSS Apache Kafka® Messaging Data Integration/ETL .
POJO /
MicroServices
Streams
Apps
Source
Connector
Data
Source
Sink
Connector
ksqlDB
Schema
Registry
Confluent Platform Conceptual Architecture
15
Confluent Platform
(Apache Kafka)
Enterprise
Security
ksqlDB
Replicator
Machine
Learning
Data
Sink
Data
Source
Schema
Registry
Control
Center
Source
Connector
Sink
Connector
Micro
Services
Mobile
Devices
Car/IoT
MQTT
Proxy
REST
Proxy
Sensor
Data
Sink
Confluent Platform
(Apache Kafka)
Confluent Platform Kafka Cluster Connect, Replicator, ksqlDB REST/MQTT Proxy .
Streams
Apps
Confluent
Hall of Innovation
CTO Innovation
Award Winner
2019
Enterprise Technology
Innovation
AWARDS
Vision
● Kafka
● Event streaming
Category Leadership
● Kafka commits 80%
● 1 Kafka
● 5000 Kafka
Value
● Risk
●
● TCO
● Time-to-market
Product
● Kafka
● Software
Cloud-Native Service
16
Confluent Enterprise Apache Kafka
17
- cloud, on-
prem, hybrid, or multi-cloud
,
– Connect
Stream processing
application
– KStreams, ksqlDB
Confluent
18
Open Source | Community licensed
Fully Managed Cloud Service
Self-managed Software
Training Partners
Enterprise
Support
Professional
Services
ARCHITECT
OPERATOR
DEVELOPER EXECUTIVE
Confluent Platform
Self-Balancing Clusters | Tiered Storage
DevOps
Operator | Ansible
GUI-
Control Center | Proactive Support
ksqlDB
Pre-built
Connectors | Hub | Schema Registry
Non-Java Clients | REST Proxy
Admin REST APIs
Multi-Region Clusters | Replicator
Cluster Linking
Schema Registry | Schema Validation
RBAC | Secrets | Audit Logs
TCO / ROI
Revenue / Cost / Risk Impact
Complete Engagement Model
Apache Kafka ?
19
Kafka distributed commit log
• Publish Subscribe .
• .
• Transaction .
1 2 3 4 5 6 7 8
Append-only
writes
Reads are a single
seek and scan
App App App
Producers
App App App
Consumers
Kafka
Cluster
Kafka Connect Kafka Streams ?
Kafka Streams API
• Java
• Producer/ Consumer APIs
Kafka Connect API
• Kafka
•
Orders
Customers
STREAM
PROCESSING
KStreams / KTable
Multi-Language
Development
Confluent
) Connector
21
200+ Pre-Built
Connectors
Event Stream
Processing
ksqlDB
/ KStream
Stream Processing by Analogy
Kafka Cluster
Connect API Stream Processing Connect API
$ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
Confluent 3
23
Kafka Clients Kafka Streams ksqlDB
ConsumerRecords<String, String> records =
consumer.poll(100);
Map<String, Integer> counts = new DefaultMap<String,
Integer>();
for (ConsumerRecord<String, Integer> record : records) {
String key = record.key();
int c = counts.get(key)
c += record.value()
counts.put(key, c)
}
for (Map.Entry<String, Integer> entry : counts.entrySet()) {
int stateCount;
int attempts;
while (attempts++ < MAX_RETRIES) {
try {
stateCount = stateStore.getValue(entry.getKey())
stateStore.setValue(entry.getKey(), entry.getValue() +
stateCount)
break;
} catch (StateStoreException e) {
RetryUtils.backoff(attempts);
}
}
}
builder
.stream("input-stream",
Consumed.with(Serdes.String(), Serdes.String()))
.groupBy((key, value) -> value)
.count()
.toStream()
.to("counts", Produced.with(Serdes.String(), Serdes.Long()));
SELECT x, count(*) FROM stream GROUP BY x EMIT
CHANGES;
subscribe(), poll(), send(),
flush(), beginTransaction(), …
KStream, KTable, filter(), map(),
flatMap(), join(), aggregate(),
transform(), …
CREATE STREAM, CREATE TABLE,
SELECT, JOIN, GROUP BY, SUM, …
Stream Processing
KSQL UDFs
24
25
3-5 ,
DB
CONNECTOR
CONNECTOR
APP
APP
DB
STREAM
PROCESSING
CONNECTOR APP
DB
2
3
4
1
26
ksqlDB , , Push Pull
DB
APP
APP
DB
PULL
PUSH
CONNECTORS
STREAM
PROCESSING
STATE STORES
ksqlDB
1 2
APP
Serve lookups against
materialized views
Create
materialized views
Perform continuous
transformations
Capture data
CREATE STREAM purchases AS
SELECT viewtime, userid,pageid, TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd')
FROM pageviews;
CREATE TABLE orders_by_country AS
SELECT country, COUNT(*) AS order_count, SUM(order_total) AS order_total
FROM purchases
WINDOW TUMBLING (SIZE 5 MINUTES)
LEFT JOIN user_profiles ON purchases.customer_id = user_profiles.customer_id
GROUP BY country
EMIT CHANGES;
SELECT * FROM orders_by_country WHERE country='usa';
CREATE SOURCE CONNECTOR jdbcConnector WITH (
‘connector.class’ = '...JdbcSourceConnector',
‘connection.url’ = '...',
…);
Connector
Stream
Table
Query
SQL
Filter messages to a separate topic in real-time
28
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets
Partition 0
Partition 1
Partition 2
Topic: Blue Widgets Only
STREAM
PROCESSING
Filters
29
Filters CREATE STREAM high_readings AS
SELECT sensor,
reading,
FROM readings
WHERE reading > 41
EMIT CHANGES;
Easily merge and join topics to one another
30
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets
Partition 0
Partition 1
Partition 2
Topic: Green and Yellow Widgets
Partition 0
Partition 1
Partition 2
Topic: Blue and Yellow Widgets
STREAM
PROCESSING
Joins
31
Joins
CREATE STREAM enriched_readings AS
SELECT reading, area, brand_name,
FROM readings
INNER JOIN brands b
ON b.sensor = readings.sensor
EMIT CHANGES;
Aggregate streams into tables and capture
summary statistics
32
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets Table: Widget Count
STREAM
PROCESSING
Widget Color Count
Blue 15
Red 9
Aggregate
33
Aggregate CREATE TABLE avg_readings AS
SELECT sensor,
AVG(reading) AS location
FROM readings
GROUP BY sensor
EMIT CHANGES;
Workshop
35
• Zoom과 브라우저(Instructions, ksqlDB console 및 Confluent
Control Center)로 작업하게 됩니다.
• 질문이 있는 경우 Zoom chat 기능을 통해 게시할 수 있습니다.
• 막히더라도 걱정하지 마세요 - Zoom에서 "Raise hand" 버튼을
사용하면 Confluent 엔지니어가 도와드릴 것입니다.
• 그냥 앞질러서 복사하여 붙여넣기 하는 것을 피하십시오 - 대부분의
사람들은 실제로 콘솔에 코드를 입력할 때 더 잘 배웁니다. 그리고
실수로부터 배울 수 있습니다.
•
교육 진행하는 방법
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
37
•
•
•
• submit
• id -
/
.
Use Case -
38
• . / ,
.
• 9/12/19 12:55:05 GMT, 5313, {
"rating_id": 5313,
"user_id": 3,
"stars": 1,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "why is it so difficult to keep the bathrooms clean?"
}
Use Case - Approach 1
39
리뷰를 데이터 웨어하우스로 이동시킵니다.
매월 말에 검토를 처리한 다음, 상당한 수의 의견이 접수된 해
당 부서에 전달합니다.
이 접근 방식은 이미 발생했었던 일을 알려줍니다.
Use Case - Approach 2
40
실시간으로 리뷰를 처리하고 공항 관리팀에 대시보드를
제공합니다.
이 대시보드는 주제별로 리뷰를 정렬하여 청결과 관련된
문제를 신속하게 표시할 수 있습니다.
이 접근 방식은 지금 무슨 일이 일어나고 있는지 알려줍
니다.
Use Case - Approach 3
41
실시간으로 리뷰를 처리합니다.
최근 10 동안의 화장실 청결과 관련된 3 나쁜 리뷰
에 대한 알림을 설정합니다.
자동으로 청소 직원을 호출하여 문제를 처리합니다.
이 접근 방식은 무슨 일이 일어나고 있는지에 따라 무언
가를 수행합니다.
Hands on
3.
3.2.1
Cluster Architectural Overview
43
MySQL
Microservice
Website
Kafka Connect
Datagen Source
connector
MySQL CDC
connector
Kafka
ksqlDB
transforms
enriches
queries
ksqlDB
44
ksqlDB Kafka Brokers
node
Confluent Control
Center
ksqlDB Editor &
DataFlow
ksqlDB
CLI
ksqlDB
RESTFul
API
ksqlDB console
45
ksqlDB console
46
> show topics;
> show streams;
> print 'ratings';
Hands on
4. ksqlDB
4.2.2
Discussion - tables vs streams
48
> describe extended customers;
> select * from customers emit changes;
> select * from customers_flat emit changes;
Stream <-> Table duality
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
49
Streams and Tables
{ "event_ts": "2020-02-17T15:22:00Z",
"person" : "robin",
"location": "Leeds"
}
{ "event_ts": "2020-02-17T17:23:00Z",
"person" : "robin",
"location": "London"
}
{ "event_ts": "2020-02-17T22:23:00Z",
"person" : "robin",
"location": "Wakefield"
}
{ "event_ts": "2020-02-18T09:00:00Z",
"person" : "robin",
"location": "Leeds"
+--------------------+-------+---------+
|EVENT_TS |PERSON |LOCATION |
+--------------------+-------+---------+
|2020-02-17 15:22:00 |robin |Leeds |
|2020-02-17 17:23:00 |robin |London |
|2020-02-17 22:23:00 |robin |Wakefield|
|2020-02-18 09:00:00 |robin |Leeds |
+-------+---------+
|PERSON |LOCATION |
+-------+---------+
|robin |Leeds |
Kafka topic
+-------+---------+
|PERSON |LOCATION |
+-------+---------+
|robin |London |
+-------+---------+
|PERSON |LOCATION |
+-------+---------+
|robin |Wakefield|
+-------+---------+
|PERSON |LOCATION |
+-------+---------+
|robin |Leeds |
ksqlDB Table
ksqlDB Stream
Stream (append-only series of
events):
Topic + Schema
Table: state for
given key
Topic + Schema
50
• Streams = INSERT only
Immutable, append-only
• Tables = INSERT, UPDATE, DELETE
Mutable, row key (event.key) identifies which
row
51
The key to mutability is … the event.key!
52
Stream Table
Has unique key constraint? No Yes
First event with key ‘alice’ arrives INSERT INSERT
Another event with key ‘alice’ arrives INSERT UPDATE
Event with key ‘alice’ and value == null arrives INSERT DELETE
Event with key == null arrives INSERT <ignored>
RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.
Creating a table from a stream or topic
streams
Aggregating a stream (COUNT example)
streams
Aggregating a stream (COUNT example)
streams
KSQL for Data Exploration
An easy way to inspect your data in Kafka
SHOW TOPICS;
SELECT page, user_id, status, bytes
FROM clickstream
WHERE user_agent LIKE 'Mozilla/5.0%';
PRINT 'my-topic' FROM BEGINNING;
56
KSQL for Data Transformation
Quickly make derivations of existing data in Kafka
CREATE STREAM clicks_by_user_id
WITH (PARTITIONS=6,
TIMESTAMP='view_time’
VALUE_FORMAT='JSON') AS
SELECT * FROM clickstream
PARTITION BY user_id;
Change number of partitions
1
Convert data to JSON
2
Repartition the data
3
57
Hands on
4.3
4.4 Query
8
.
59
• Kafka .
• Format .
• data streams join .
• Event Stream Query Query
.
• !
KSQL for Real-Time, Streaming ETL
Filter, cleanse, process data while it is in motion
CREATE STREAM clicks_from_vip_users AS
SELECT user_id, u.country, page, action
FROM clickstream c
LEFT JOIN users u ON c.user_id = u.user_id
WHERE u.level ='Platinum'; Pick only VIP users
1
60
CDC — only after state
61
JSON 데이터는 Debezium CDC를 통해
MySQL에서 가져오는 정보를 보여줍니다.
여기서 "BEFORE" 데이터가 없음을 알 수 있습
니다(null임).
이것은 레코드가 업데이트 없이 방금 생성되었음
을 의미합니다. 새 고객이 처음 추가된 경우를 예
로 들 수 있습니다.
CDC — before and after
62
이제 고객 레코드에 대한 업데이트가 있었기 때
문에 일부 "BEFORE" 데이터가 있습니다.
KSQL for Anomaly Detection
Aggregate data to identify patterns and anomalies in real-time
CREATE TABLE possible_fraud AS
SELECT card_number, COUNT(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 30 SECONDS)
GROUP BY card_number
HAVING COUNT(*) > 3;
Aggregate data
1
… per 30-sec windows
2
63
KSQL for Real-Time Monitoring
Derive insights from events (IoT, sensors, etc.) and turn them into actions
CREATE TABLE failing_vehicles AS
SELECT vehicle, COUNT(*)
FROM vehicle_monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE event_type = 'ERROR’
GROUP BY vehicle
HAVING COUNT(*) >= 5; Now we know to alert, and whom
1
64
Confluent Control Center
C3 - Connector
ksqlDB - Cloud UI (1/2)
67
ksqlDB - Cloud UI (2/2)
68
Monitoring ksqlDB applications
Data flow (1/2)
69
Monitoring ksqlDB applications
Data flow (2/2)
70
ksqlDB Internals
Storage Layer
(Brokers)
Processing Layer
(ksqlDB, KStreams,
etc.)
Partitions play a central role in Kafka
72
Topics are partitioned. Partitions enable scalability, elasticity, fault-tolerance.
stored in
replicated based on
ordered based on
partitions
Data is
joined based on
read from and written to
processed based on
Processing
Layer
(KSQL,
KStreams)
00100 11101 11000 00011 00100 00110
Topic
alice Paris bob Sydney alice Rome
Stream
plus schema (serdes)
alice 2
bob 1
Table
plus aggregation
Storage Layer
(Brokers)
Topics vs. Streams and Tables
73
Kafka Processing
Data is processed per-partition
...
...
...
...
P1
P2
P3
P4
Storage Processing
read via
network
Topic App Instance 1 Application
App Instance 2
‘payments’ with consumer group
‘my-app’
74
Kafka Processing
Data is processed per-partition
...
...
...
...
P1
P2
P3
P4
Storage Processing State
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
read via
network
Application Instance 1
Topic
Application Instance 2
75
Streams and Tables are partitioned, too
...
...
...
...
P1
P2
P3
P4
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
KTable / TABLE
2 GB
3 GB
5 GB
2 GB
Application Instance 1
Application Instance 2
76
Kafka Streams Architecture
77
Advanced Features
Windowing
79
“10 3 ”
Windowed Query ksqlDB 로직을 .
.
Tumbling Hopping Session
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY key
WINDOW HOPPING (SIZE 5 MINUTE, ADVANCE BY 1 MINUTE)
GROUP BY key
WINDOW SESSION (60 SECONDS)
GROUP BY key
UDF and machine learning
80
“ ”
ksqlDB는 스트림 처리를 단순화하기 위해 여러 내장 함수들을 제공합니다. 예는 다음과 같습니다.:
• GEODISTANCE: 두 위도/경도 좌표 사이의 거리를 측정
• MASK: 문자열을 마스크하거나 난독화된 버전으로 변환
• JSON_ARRAY_CONTAINS: 배열에 검색 값이 포함되어 있는지 확인
사용자 정의 함수를 개발하여 ksqlDB에서 사용 가능한 기능을 확장합니다. 일반적인 사용 사례는 ksqlDB를
통해 기계 학습 알고리즘을 구현하여 이러한 모델이 실시간 데이터 변환에 기여할 수 있도록 하는 것입니다.
ksqlDB ?
81
Streaming ETL Anomaly detection
Real-time monitoring
and Analytics
Sensor data and IoT Customer 360-view
https://docs.ksqldb.io/en/latest/#what-can-i-do-with-ksqldb
Example: Streaming ETL pipeline
82
* Full example here
• Apache Kafka is a popular choice for powering data pipelines
• ksqlDB makes it simple to transform data within the pipeline,
preparing the messages for consumption by another system.
Example: Anomaly detection
83
• Identify patterns and spot anomalies in real-time data with
millisecond latency, enabling you to properly surface out-of-the-
ordinary events and to handle fraudulent activities separately.
* Full example here
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Any questions?
87
one more …
Developer https://developer.confluent.io
Tutorials
90
•
• Kafka ksqlDB
Kafka ksqlDB
?
• ?
• Apache Kafka® .
https://kafka-tutorials.confluent.io/
Free eBooks
Kafka: The Definitive Guide
Neha Narkhede, Gwen Shapira, Todd
Palino
Making Sense of Stream Processing
Martin Kleppmann
I ❤ Logs
Jay Kreps
Designing Event-Driven Systems
Ben Stopford
http://cnfl.io/book-bundle
Confluent
92
Confluent Blog
cnfl.io/blog
Confluent Cloud
cnfl.io/confluent-cloud
Community
cnfl.io/meetups
93
Max processing parallelism = #input partitions
...
...
...
...
P1
P2
P3
P4
Topic Application Instance 1
Application Instance 2
Application Instance 3
Application Instance 4
Application Instance 5 *** idle ***
Application Instance 6 *** idle ***
→ Need higher parallelism? Increase the original topic’s partition count.
→ Higher parallelism for just one use case? Derive a new topic from the
original with higher partition count. Lower its retention to save storage.
94
How to increase # of partitions when needed
CREATE STREAM products_repartitioned
WITH (PARTITIONS=30) AS
SELECT * FROM products
95
KSQL example: statement below creates a new stream with the desired number of partitions.
‘Hot’ partitions is a problem, often caused by
Strategies to address hot partitions include
1a. Ingress: Find better partitioning function ƒ(event.key) for producers
1b. Storage: Re-partition data into new topic if you can’t change the original
2. Scale processing vertically, e.g. more powerful CPU instances
...
...
...
...
P1
P2
P3
P4
96
1. Events not evenly distributed across partitions
2. Events evenly distributed but certain events take longer to process
Joining Streams and Tables
Data must be ‘co-partitioned’
Table
Stream
Join Output
(Stream) 97
Joining Streams and Tables
Data must be ‘co-partitioned’
bob male
alice female
alex male
alice Paris
Table
P1
P2
P3
zoie female
andrew male
mina female
natalie female
blake male
alice Paris
Stream
P2
(alice, Paris) from
stream’s P2 has a
matching entry for
alice in the table’s P2.
female 98
Joining Streams and Tables
Data is looked up in same partition number
99
alice Paris alice male
alice female
alice Paris
Stream Table
P2 P1
P2
P3
Here, key ‘alice’ exists in
multiple partitions.
But entry in P2
(female) is used
because the stream-
side event is from
stream’s partition P2.
female
Scenario 2
Joining Streams and Tables
Data is looked up in same partition number
100
alice Paris alice male
alice Paris
Stream Table
P2 P1
P2
P3
Here, key ‘alice’ exists
only in the table’s P1 !=
P2.
null
no
match!
Scenario 3
Data co-partitioning requirements in detail
Further Reading on Joining Streams and Tables:
https://www.confluent.io/kafka-summit-sf18/zen-and-the-art-of-streaming-joins
https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html
101
1. Same keying scheme for both input sides
2. Same number of partitions
3. Same partitioning function ƒ(event.key)
Why is that so?
Because of how input data is mapped to stream tasks
...
...
...
P1
P2
P3
storage
processing state
Stream Task 2
read via
network
Strea
m
Topic
...
...
...
P1
P2
P3
Table
Topic
from stream’s P2
from table’s P2
102
How to re-partition your data when needed
CREATE STREAM products_repartitioned
WITH (PARTITIONS=42) AS
SELECT * FROM products
PARTITION BY product_id;
103
KSQL example: statement below creates a new stream with changed number of partitions and a new field as
event.key (so that its data is now correctly co-partitioned for joining)

More Related Content

PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
Kappa vs Lambda Architectures and Technology Comparison
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PPTX
PDF
ksqlDB로 시작하는 스트림 프로세싱
PDF
Kafka Deep Dive
PDF
Apache Druid 101
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Real-time Analytics with Trino and Apache Pinot
Kappa vs Lambda Architectures and Technology Comparison
APACHE KAFKA / Kafka Connect / Kafka Streams
ksqlDB로 시작하는 스트림 프로세싱
Kafka Deep Dive
Apache Druid 101
Dynamic Rule-based Real-time Market Data Alerts

What's hot (20)

PDF
ksqlDB로 실시간 데이터 변환 및 스트림 처리
PDF
Kafka Streams: What it is, and how to use it?
PDF
Apache Flink internals
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
PDF
Building Event Driven (Micro)services with Apache Kafka
PPTX
Apache Kafka Best Practices
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
超実践 Cloud Spanner 設計講座
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Apache ZooKeeper
PPSX
Microservices Docker Kubernetes Istio Kanban DevOps SRE
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PPTX
Autoscaling Flink with Reactive Mode
PDF
Spring Boot Observability
PPTX
[211] HBase 기반 검색 데이터 저장소 (공개용)
PDF
Observability
PDF
Introducing Change Data Capture with Debezium
PPTX
Microservices Architecture - Bangkok 2018
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Loki - like prometheus, but for logs
ksqlDB로 실시간 데이터 변환 및 스트림 처리
Kafka Streams: What it is, and how to use it?
Apache Flink internals
Event Sourcing & CQRS, Kafka, Rabbit MQ
Building Event Driven (Micro)services with Apache Kafka
Apache Kafka Best Practices
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
超実践 Cloud Spanner 設計講座
Flexible and Real-Time Stream Processing with Apache Flink
Apache ZooKeeper
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Autoscaling Flink with Reactive Mode
Spring Boot Observability
[211] HBase 기반 검색 데이터 저장소 (공개용)
Observability
Introducing Change Data Capture with Debezium
Microservices Architecture - Bangkok 2018
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Loki - like prometheus, but for logs
Ad

Similar to Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드 (20)

PDF
APAC ksqlDB Workshop
PDF
All Streams Ahead! ksqlDB Workshop ANZ
PDF
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
PDF
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
PDF
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
PPTX
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
PDF
ksqlDB Workshop
PDF
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
PDF
Concepts and Patterns for Streaming Services with Kafka
PDF
KSQL---Streaming SQL for Apache Kafka
PDF
How to Build Streaming Apps with Confluent II
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
PPTX
KSQL and Kafka Streams – When to Use Which, and When to Use Both
PDF
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
PDF
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
PDF
KSQL – An Open Source Streaming Engine for Apache Kafka
PPTX
Real Time Stream Processing with KSQL and Kafka
PDF
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
PDF
Jug - ecosystem
PDF
Un'introduzione a Kafka Streams e KSQL... and why they matter!
APAC ksqlDB Workshop
All Streams Ahead! ksqlDB Workshop ANZ
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
ksqlDB Workshop
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Concepts and Patterns for Streaming Services with Kafka
KSQL---Streaming SQL for Apache Kafka
How to Build Streaming Apps with Confluent II
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
KSQL and Kafka Streams – When to Use Which, and When to Use Both
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL – An Open Source Streaming Engine for Apache Kafka
Real Time Stream Processing with KSQL and Kafka
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
Jug - ecosystem
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Advanced IT Governance
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Cloud computing and distributed systems.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Advanced IT Governance
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Cloud computing and distributed systems.
Understanding_Digital_Forensics_Presentation.pptx

Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드

  • 2. : Jupil Hwang ( ) Hyunsoo Kim ( ) : 14:00 – 17:00 2
  • 3. Agenda — ksqlDB 3 3 01 02:00 - 02:10 PM 05 Lab: Hands on 03:00 AM - 05:00 PM 02 Talk: Kafka, Kafka Streams ksqlDB 02:10 - 02:30 PM 03 Lab: 02:30 - 02:45 PM 04 Lab: 02:45 - 03:00 PM
  • 4. 4 • Q&A • 궁금한 점이 있으시다면 Q&A를 통해 질문 보내주시기 바랍니다. 발 표 이후 연사가 직접 답변 전달할 예정입니다. • 온라인 설문조사 • 금일 워크샵에 대한 소중한 의견 보내주시기 바랍니다. 향후 알찬 내 용을 준비하는데 참고하겠습니다. • 설문조사 참여 링크는 (1) Zoom 채팅창 통해 확인, (2) 행사 종료 이 후 웹 브라우저 통해 자동 참여 워크샵 안내사항
  • 6. App App DWH Transactional Databases Analytics Databases Data Flow DB DB App App MOM MOM ETL ETL EAI / ESB ● ● ● / (Pub/sub) (Point-to-Point) ● ● App App
  • 7. NoSQL DBs Big Data Analytics ? App App DWH Transactional Databases Analytics Databases Data Flow DB DB App App MOM MOM ETL ETL EAI / ESB App App
  • 8. : 스트리밍 플랫폼은 조직 의 모든 사람과 시스템에 게 데이터에 대한 단일 정보 소스(single source of truth)를 제공한다. NoSQL DBs Big Data Analytics App App DWH Transactional Databases Analytics Databases Data Flow DB DB App App App App Streaming Platform
  • 9. 80% + Fortune 100 Apache Kafka Confluent LinkedIn • • Producer Consumer (Decouple) •
  • 10. Event Streaming Platform , , , Core Loans Credit Cards Patient Lending Data : ... Device Logs ... ... ... Data Stores Logs 3rd Party Apps Custom Apps / Microservices Real-time Inventory Real-time Fraud Detection Real-time Customer 360 Machine Learning Models Real-time Data Transformat ion ... Data in Motion Applications Data-in-Motion Pipeline Amazon S3 SaaS apps Data in Motion : , , .
  • 11. Streaming Platform : A Sale A shipment A Trade A Customer Experience 11 …and more
  • 13. What’s stream processing good for? 13 Materialized Cache view Streaming ETL Pipeline , Source Sink Event-Driven Microservice
  • 14. Confluent Platform Conceptual Architecture 14 OSS Apache Kafka Data Sink POJO / MicroServices Data Sink OSS Apache Kafka® Messaging Data Integration/ETL . POJO / MicroServices Streams Apps Source Connector Data Source Sink Connector ksqlDB Schema Registry
  • 15. Confluent Platform Conceptual Architecture 15 Confluent Platform (Apache Kafka) Enterprise Security ksqlDB Replicator Machine Learning Data Sink Data Source Schema Registry Control Center Source Connector Sink Connector Micro Services Mobile Devices Car/IoT MQTT Proxy REST Proxy Sensor Data Sink Confluent Platform (Apache Kafka) Confluent Platform Kafka Cluster Connect, Replicator, ksqlDB REST/MQTT Proxy . Streams Apps
  • 16. Confluent Hall of Innovation CTO Innovation Award Winner 2019 Enterprise Technology Innovation AWARDS Vision ● Kafka ● Event streaming Category Leadership ● Kafka commits 80% ● 1 Kafka ● 5000 Kafka Value ● Risk ● ● TCO ● Time-to-market Product ● Kafka ● Software Cloud-Native Service 16
  • 17. Confluent Enterprise Apache Kafka 17 - cloud, on- prem, hybrid, or multi-cloud , – Connect Stream processing application – KStreams, ksqlDB
  • 18. Confluent 18 Open Source | Community licensed Fully Managed Cloud Service Self-managed Software Training Partners Enterprise Support Professional Services ARCHITECT OPERATOR DEVELOPER EXECUTIVE Confluent Platform Self-Balancing Clusters | Tiered Storage DevOps Operator | Ansible GUI- Control Center | Proactive Support ksqlDB Pre-built Connectors | Hub | Schema Registry Non-Java Clients | REST Proxy Admin REST APIs Multi-Region Clusters | Replicator Cluster Linking Schema Registry | Schema Validation RBAC | Secrets | Audit Logs TCO / ROI Revenue / Cost / Risk Impact Complete Engagement Model
  • 19. Apache Kafka ? 19 Kafka distributed commit log • Publish Subscribe . • . • Transaction . 1 2 3 4 5 6 7 8 Append-only writes Reads are a single seek and scan App App App Producers App App App Consumers Kafka Cluster
  • 20. Kafka Connect Kafka Streams ? Kafka Streams API • Java • Producer/ Consumer APIs Kafka Connect API • Kafka • Orders Customers STREAM PROCESSING KStreams / KTable
  • 22. Stream Processing by Analogy Kafka Cluster Connect API Stream Processing Connect API $ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
  • 23. Confluent 3 23 Kafka Clients Kafka Streams ksqlDB ConsumerRecords<String, String> records = consumer.poll(100); Map<String, Integer> counts = new DefaultMap<String, Integer>(); for (ConsumerRecord<String, Integer> record : records) { String key = record.key(); int c = counts.get(key) c += record.value() counts.put(key, c) } for (Map.Entry<String, Integer> entry : counts.entrySet()) { int stateCount; int attempts; while (attempts++ < MAX_RETRIES) { try { stateCount = stateStore.getValue(entry.getKey()) stateStore.setValue(entry.getKey(), entry.getValue() + stateCount) break; } catch (StateStoreException e) { RetryUtils.backoff(attempts); } } } builder .stream("input-stream", Consumed.with(Serdes.String(), Serdes.String())) .groupBy((key, value) -> value) .count() .toStream() .to("counts", Produced.with(Serdes.String(), Serdes.Long())); SELECT x, count(*) FROM stream GROUP BY x EMIT CHANGES;
  • 24. subscribe(), poll(), send(), flush(), beginTransaction(), … KStream, KTable, filter(), map(), flatMap(), join(), aggregate(), transform(), … CREATE STREAM, CREATE TABLE, SELECT, JOIN, GROUP BY, SUM, … Stream Processing KSQL UDFs 24
  • 26. 26 ksqlDB , , Push Pull DB APP APP DB PULL PUSH CONNECTORS STREAM PROCESSING STATE STORES ksqlDB 1 2 APP
  • 27. Serve lookups against materialized views Create materialized views Perform continuous transformations Capture data CREATE STREAM purchases AS SELECT viewtime, userid,pageid, TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd') FROM pageviews; CREATE TABLE orders_by_country AS SELECT country, COUNT(*) AS order_count, SUM(order_total) AS order_total FROM purchases WINDOW TUMBLING (SIZE 5 MINUTES) LEFT JOIN user_profiles ON purchases.customer_id = user_profiles.customer_id GROUP BY country EMIT CHANGES; SELECT * FROM orders_by_country WHERE country='usa'; CREATE SOURCE CONNECTOR jdbcConnector WITH ( ‘connector.class’ = '...JdbcSourceConnector', ‘connection.url’ = '...', …); Connector Stream Table Query SQL
  • 28. Filter messages to a separate topic in real-time 28 Partition 0 Partition 1 Partition 2 Topic: Blue and Red Widgets Partition 0 Partition 1 Partition 2 Topic: Blue Widgets Only STREAM PROCESSING Filters
  • 29. 29 Filters CREATE STREAM high_readings AS SELECT sensor, reading, FROM readings WHERE reading > 41 EMIT CHANGES;
  • 30. Easily merge and join topics to one another 30 Partition 0 Partition 1 Partition 2 Topic: Blue and Red Widgets Partition 0 Partition 1 Partition 2 Topic: Green and Yellow Widgets Partition 0 Partition 1 Partition 2 Topic: Blue and Yellow Widgets STREAM PROCESSING Joins
  • 31. 31 Joins CREATE STREAM enriched_readings AS SELECT reading, area, brand_name, FROM readings INNER JOIN brands b ON b.sensor = readings.sensor EMIT CHANGES;
  • 32. Aggregate streams into tables and capture summary statistics 32 Partition 0 Partition 1 Partition 2 Topic: Blue and Red Widgets Table: Widget Count STREAM PROCESSING Widget Color Count Blue 15 Red 9 Aggregate
  • 33. 33 Aggregate CREATE TABLE avg_readings AS SELECT sensor, AVG(reading) AS location FROM readings GROUP BY sensor EMIT CHANGES;
  • 35. 35 • Zoom과 브라우저(Instructions, ksqlDB console 및 Confluent Control Center)로 작업하게 됩니다. • 질문이 있는 경우 Zoom chat 기능을 통해 게시할 수 있습니다. • 막히더라도 걱정하지 마세요 - Zoom에서 "Raise hand" 버튼을 사용하면 Confluent 엔지니어가 도와드릴 것입니다. • 그냥 앞질러서 복사하여 붙여넣기 하는 것을 피하십시오 - 대부분의 사람들은 실제로 콘솔에 코드를 입력할 때 더 잘 배웁니다. 그리고 실수로부터 배울 수 있습니다. • 교육 진행하는 방법
  • 38. Use Case - 38 • . / , . • 9/12/19 12:55:05 GMT, 5313, { "rating_id": 5313, "user_id": 3, "stars": 1, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "why is it so difficult to keep the bathrooms clean?" }
  • 39. Use Case - Approach 1 39 리뷰를 데이터 웨어하우스로 이동시킵니다. 매월 말에 검토를 처리한 다음, 상당한 수의 의견이 접수된 해 당 부서에 전달합니다. 이 접근 방식은 이미 발생했었던 일을 알려줍니다.
  • 40. Use Case - Approach 2 40 실시간으로 리뷰를 처리하고 공항 관리팀에 대시보드를 제공합니다. 이 대시보드는 주제별로 리뷰를 정렬하여 청결과 관련된 문제를 신속하게 표시할 수 있습니다. 이 접근 방식은 지금 무슨 일이 일어나고 있는지 알려줍 니다.
  • 41. Use Case - Approach 3 41 실시간으로 리뷰를 처리합니다. 최근 10 동안의 화장실 청결과 관련된 3 나쁜 리뷰 에 대한 알림을 설정합니다. 자동으로 청소 직원을 호출하여 문제를 처리합니다. 이 접근 방식은 무슨 일이 일어나고 있는지에 따라 무언 가를 수행합니다.
  • 43. Cluster Architectural Overview 43 MySQL Microservice Website Kafka Connect Datagen Source connector MySQL CDC connector Kafka ksqlDB transforms enriches queries
  • 44. ksqlDB 44 ksqlDB Kafka Brokers node Confluent Control Center ksqlDB Editor & DataFlow ksqlDB CLI ksqlDB RESTFul API
  • 46. ksqlDB console 46 > show topics; > show streams; > print 'ratings';
  • 48. Discussion - tables vs streams 48 > describe extended customers; > select * from customers emit changes; > select * from customers_flat emit changes;
  • 49. Stream <-> Table duality http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables 49
  • 50. Streams and Tables { "event_ts": "2020-02-17T15:22:00Z", "person" : "robin", "location": "Leeds" } { "event_ts": "2020-02-17T17:23:00Z", "person" : "robin", "location": "London" } { "event_ts": "2020-02-17T22:23:00Z", "person" : "robin", "location": "Wakefield" } { "event_ts": "2020-02-18T09:00:00Z", "person" : "robin", "location": "Leeds" +--------------------+-------+---------+ |EVENT_TS |PERSON |LOCATION | +--------------------+-------+---------+ |2020-02-17 15:22:00 |robin |Leeds | |2020-02-17 17:23:00 |robin |London | |2020-02-17 22:23:00 |robin |Wakefield| |2020-02-18 09:00:00 |robin |Leeds | +-------+---------+ |PERSON |LOCATION | +-------+---------+ |robin |Leeds | Kafka topic +-------+---------+ |PERSON |LOCATION | +-------+---------+ |robin |London | +-------+---------+ |PERSON |LOCATION | +-------+---------+ |robin |Wakefield| +-------+---------+ |PERSON |LOCATION | +-------+---------+ |robin |Leeds | ksqlDB Table ksqlDB Stream Stream (append-only series of events): Topic + Schema Table: state for given key Topic + Schema 50
  • 51. • Streams = INSERT only Immutable, append-only • Tables = INSERT, UPDATE, DELETE Mutable, row key (event.key) identifies which row 51
  • 52. The key to mutability is … the event.key! 52 Stream Table Has unique key constraint? No Yes First event with key ‘alice’ arrives INSERT INSERT Another event with key ‘alice’ arrives INSERT UPDATE Event with key ‘alice’ and value == null arrives INSERT DELETE Event with key == null arrives INSERT <ignored> RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.
  • 53. Creating a table from a stream or topic streams
  • 54. Aggregating a stream (COUNT example) streams
  • 55. Aggregating a stream (COUNT example) streams
  • 56. KSQL for Data Exploration An easy way to inspect your data in Kafka SHOW TOPICS; SELECT page, user_id, status, bytes FROM clickstream WHERE user_agent LIKE 'Mozilla/5.0%'; PRINT 'my-topic' FROM BEGINNING; 56
  • 57. KSQL for Data Transformation Quickly make derivations of existing data in Kafka CREATE STREAM clicks_by_user_id WITH (PARTITIONS=6, TIMESTAMP='view_time’ VALUE_FORMAT='JSON') AS SELECT * FROM clickstream PARTITION BY user_id; Change number of partitions 1 Convert data to JSON 2 Repartition the data 3 57
  • 59. . 59 • Kafka . • Format . • data streams join . • Event Stream Query Query . • !
  • 60. KSQL for Real-Time, Streaming ETL Filter, cleanse, process data while it is in motion CREATE STREAM clicks_from_vip_users AS SELECT user_id, u.country, page, action FROM clickstream c LEFT JOIN users u ON c.user_id = u.user_id WHERE u.level ='Platinum'; Pick only VIP users 1 60
  • 61. CDC — only after state 61 JSON 데이터는 Debezium CDC를 통해 MySQL에서 가져오는 정보를 보여줍니다. 여기서 "BEFORE" 데이터가 없음을 알 수 있습 니다(null임). 이것은 레코드가 업데이트 없이 방금 생성되었음 을 의미합니다. 새 고객이 처음 추가된 경우를 예 로 들 수 있습니다.
  • 62. CDC — before and after 62 이제 고객 레코드에 대한 업데이트가 있었기 때 문에 일부 "BEFORE" 데이터가 있습니다.
  • 63. KSQL for Anomaly Detection Aggregate data to identify patterns and anomalies in real-time CREATE TABLE possible_fraud AS SELECT card_number, COUNT(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 30 SECONDS) GROUP BY card_number HAVING COUNT(*) > 3; Aggregate data 1 … per 30-sec windows 2 63
  • 64. KSQL for Real-Time Monitoring Derive insights from events (IoT, sensors, etc.) and turn them into actions CREATE TABLE failing_vehicles AS SELECT vehicle, COUNT(*) FROM vehicle_monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE event_type = 'ERROR’ GROUP BY vehicle HAVING COUNT(*) >= 5; Now we know to alert, and whom 1 64
  • 67. ksqlDB - Cloud UI (1/2) 67
  • 68. ksqlDB - Cloud UI (2/2) 68
  • 72. Storage Layer (Brokers) Processing Layer (ksqlDB, KStreams, etc.) Partitions play a central role in Kafka 72 Topics are partitioned. Partitions enable scalability, elasticity, fault-tolerance. stored in replicated based on ordered based on partitions Data is joined based on read from and written to processed based on
  • 73. Processing Layer (KSQL, KStreams) 00100 11101 11000 00011 00100 00110 Topic alice Paris bob Sydney alice Rome Stream plus schema (serdes) alice 2 bob 1 Table plus aggregation Storage Layer (Brokers) Topics vs. Streams and Tables 73
  • 74. Kafka Processing Data is processed per-partition ... ... ... ... P1 P2 P3 P4 Storage Processing read via network Topic App Instance 1 Application App Instance 2 ‘payments’ with consumer group ‘my-app’ 74
  • 75. Kafka Processing Data is processed per-partition ... ... ... ... P1 P2 P3 P4 Storage Processing State Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 read via network Application Instance 1 Topic Application Instance 2 75
  • 76. Streams and Tables are partitioned, too ... ... ... ... P1 P2 P3 P4 Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 KTable / TABLE 2 GB 3 GB 5 GB 2 GB Application Instance 1 Application Instance 2 76
  • 79. Windowing 79 “10 3 ” Windowed Query ksqlDB 로직을 . . Tumbling Hopping Session WINDOW TUMBLING (SIZE 5 MINUTES) GROUP BY key WINDOW HOPPING (SIZE 5 MINUTE, ADVANCE BY 1 MINUTE) GROUP BY key WINDOW SESSION (60 SECONDS) GROUP BY key
  • 80. UDF and machine learning 80 “ ” ksqlDB는 스트림 처리를 단순화하기 위해 여러 내장 함수들을 제공합니다. 예는 다음과 같습니다.: • GEODISTANCE: 두 위도/경도 좌표 사이의 거리를 측정 • MASK: 문자열을 마스크하거나 난독화된 버전으로 변환 • JSON_ARRAY_CONTAINS: 배열에 검색 값이 포함되어 있는지 확인 사용자 정의 함수를 개발하여 ksqlDB에서 사용 가능한 기능을 확장합니다. 일반적인 사용 사례는 ksqlDB를 통해 기계 학습 알고리즘을 구현하여 이러한 모델이 실시간 데이터 변환에 기여할 수 있도록 하는 것입니다.
  • 81. ksqlDB ? 81 Streaming ETL Anomaly detection Real-time monitoring and Analytics Sensor data and IoT Customer 360-view https://docs.ksqldb.io/en/latest/#what-can-i-do-with-ksqldb
  • 82. Example: Streaming ETL pipeline 82 * Full example here • Apache Kafka is a popular choice for powering data pipelines • ksqlDB makes it simple to transform data within the pipeline, preparing the messages for consumption by another system.
  • 83. Example: Anomaly detection 83 • Identify patterns and spot anomalies in real-time data with millisecond latency, enabling you to properly surface out-of-the- ordinary events and to handle fraudulent activities separately. * Full example here
  • 90. Tutorials 90 • • Kafka ksqlDB Kafka ksqlDB ? • ? • Apache Kafka® . https://kafka-tutorials.confluent.io/
  • 91. Free eBooks Kafka: The Definitive Guide Neha Narkhede, Gwen Shapira, Todd Palino Making Sense of Stream Processing Martin Kleppmann I ❤ Logs Jay Kreps Designing Event-Driven Systems Ben Stopford http://cnfl.io/book-bundle
  • 93. 93
  • 94. Max processing parallelism = #input partitions ... ... ... ... P1 P2 P3 P4 Topic Application Instance 1 Application Instance 2 Application Instance 3 Application Instance 4 Application Instance 5 *** idle *** Application Instance 6 *** idle *** → Need higher parallelism? Increase the original topic’s partition count. → Higher parallelism for just one use case? Derive a new topic from the original with higher partition count. Lower its retention to save storage. 94
  • 95. How to increase # of partitions when needed CREATE STREAM products_repartitioned WITH (PARTITIONS=30) AS SELECT * FROM products 95 KSQL example: statement below creates a new stream with the desired number of partitions.
  • 96. ‘Hot’ partitions is a problem, often caused by Strategies to address hot partitions include 1a. Ingress: Find better partitioning function ƒ(event.key) for producers 1b. Storage: Re-partition data into new topic if you can’t change the original 2. Scale processing vertically, e.g. more powerful CPU instances ... ... ... ... P1 P2 P3 P4 96 1. Events not evenly distributed across partitions 2. Events evenly distributed but certain events take longer to process
  • 97. Joining Streams and Tables Data must be ‘co-partitioned’ Table Stream Join Output (Stream) 97
  • 98. Joining Streams and Tables Data must be ‘co-partitioned’ bob male alice female alex male alice Paris Table P1 P2 P3 zoie female andrew male mina female natalie female blake male alice Paris Stream P2 (alice, Paris) from stream’s P2 has a matching entry for alice in the table’s P2. female 98
  • 99. Joining Streams and Tables Data is looked up in same partition number 99 alice Paris alice male alice female alice Paris Stream Table P2 P1 P2 P3 Here, key ‘alice’ exists in multiple partitions. But entry in P2 (female) is used because the stream- side event is from stream’s partition P2. female Scenario 2
  • 100. Joining Streams and Tables Data is looked up in same partition number 100 alice Paris alice male alice Paris Stream Table P2 P1 P2 P3 Here, key ‘alice’ exists only in the table’s P1 != P2. null no match! Scenario 3
  • 101. Data co-partitioning requirements in detail Further Reading on Joining Streams and Tables: https://www.confluent.io/kafka-summit-sf18/zen-and-the-art-of-streaming-joins https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html 101 1. Same keying scheme for both input sides 2. Same number of partitions 3. Same partitioning function ƒ(event.key)
  • 102. Why is that so? Because of how input data is mapped to stream tasks ... ... ... P1 P2 P3 storage processing state Stream Task 2 read via network Strea m Topic ... ... ... P1 P2 P3 Table Topic from stream’s P2 from table’s P2 102
  • 103. How to re-partition your data when needed CREATE STREAM products_repartitioned WITH (PARTITIONS=42) AS SELECT * FROM products PARTITION BY product_id; 103 KSQL example: statement below creates a new stream with changed number of partitions and a new field as event.key (so that its data is now correctly co-partitioned for joining)