Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka"

ETL is Dead.
Long Live Streams with Apache Kafka.
Taras Kloba,
BI Team Lead/Data Architect,
Intellias

Agenda
• About me;
• One problem in data transferring;
• Ways to solve this problem;
• About Apache Kafka;
• Demo of reliable data sending;
• Questions?

Taras Kloba
• 7 years of experience with databases;
• Certified Data Engineer on Google Cloud;
• Certified Expert Microsoft SQL Server;
• Co-organizer “SQL Saturday” in Lviv and Krakow;
• Trainer, speaker, consultant;
• Owner “SQL” trademark in Ukraine .
SQL.ua,
CEO/Founder
Intellias,
BI Team Lead/Data Architect
Quick facts
(Q62JCJRJGY77)(9DG5NZ4EVP7A) (M2HE6LPRJ6MV)

My current project:
One of the biggest B2B software solution
for the iGaming industry in the World.
+300 GB new
data
every day

Previous legacy system
00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:00:00’ AND ’2018-11-03 00:04:00’
SELECT *
00:04:00’ AND ’2018-11-03 00:08:00’

Previous legacy system
00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09
SELECT *
00:00:00’ AND ’2018-11-03 00:04:00’
SELECT *
00:02:00’ AND ’2018-11-03 00:06:00’

Phantom reads (classical definition)

Phantom reads (in our cases)
Tnx: 1
2018-11-03
12:00:00
Tnx: 2
2018-11-03
12:01:00
Tnx: 2
commit
12:03:00
Tnx: 1
commit
12:05:00
SELECT *
11:58:00’ AND ’2018-11-03 12:04:00’
Trans_id Upd
2
2018-11-03
12:01:00

#1. Isolation
levels -
Serializable
With a lock-based concurrency control
DBMS implementation, serializability
requires read and write locks (acquired on
selected data) to be released at the end of
the transaction. Also range-locks must be
acquired when a SELECT query uses a
ranged WHERE clause, especially to avoid
the phantom reads phenomenon.
A not best solution for high
load solutions.

#2. Triggers
Traditionally, the most common technique
used for capturing events was to use
database or application-level triggers. The
reason why this technique is still very
widespread is due to its simplicity and
familiarness. A not best solution for high
load solutions.

#3. Change
Data Capture
is a set of software design patterns used to
determine (and track) the data that has
changed so that action can be taken using
the changed data. Also, Change data
capture (CDC) is an approach to data
integration that is based on the identification,
capture and delivery of the changes made to
enterprise data sources.
(Wikipedia)
CDC solutions occur most often in data-
warehouse environments since capturing and
preserving the state of data across time is one of
the core functions of a data warehouse.

Apache Kafka
Kafka® is used for building real-time data
pipelines and streaming apps. It is
horizontally scalable, fault-tolerant, wicked
fast, and runs in production in thousands of
companies.

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka"

Typical data flow in companies

Streaming platform to coordinate all data flows.

Kafka Connect API (E and L in Streaming ETL)
• Scalability: Leverages Kafka for
scalability
• Fault tolerance: Builds on Kafka’s
fault tolerance model
• Management and monitoring: One
way of monitoring all connectors
• Schemas: Offers an option for
preserving schemas from source to
sink

Kafka Connect. Create new connector.

Kafka’s streams API (The T in ETL)
• Easiest way to do stream
processing using Kafka;
• True event-at-a-time stream “
processing; no microbatching;
• Dataflow-style windowing
based on “ event-time; handles
late-arriving data

Kafka Stream API. Create new processor

Conclusion
• Apache Kafka is robust
• Triggers will keep your data in sync but can have significant performance
overhead
• Utilizing a logical replication slot can eliminate trigger overhead and transfer the
computation load elsewhere
• Not a panacea: still need to use good architectural patterns

Thank you!
Taras Klioba
+38 093 74 876 15
taras@klioba.com

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka"

More Related Content

What's hot (20)

Similar to Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka" (20)

More from Lviv Startup Club (20)

Recently uploaded (20)

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka"