SlideShare a Scribd company logo
ETL is Dead.
Long Live Streams with Apache Kafka.
Taras Kloba,
BI Team Lead/Data Architect,
Intellias
Agenda
• About me;
• One problem in data transferring;
• Ways to solve this problem;
• About Apache Kafka;
• Demo of reliable data sending;
• Questions?
Taras Kloba
• 7 years of experience with databases;
• Certified Data Engineer on Google Cloud;
• Certified Expert Microsoft SQL Server;
• Co-organizer “SQL Saturday” in Lviv and Krakow;
• Trainer, speaker, consultant;
• Owner “SQL” trademark in Ukraine .
SQL.ua,
CEO/Founder
Intellias,
BI Team Lead/Data Architect
Quick facts
(Q62JCJRJGY77)(9DG5NZ4EVP7A) (M2HE6LPRJ6MV)
My current project:
One of the biggest B2B software solution
for the iGaming industry in the World.
+300 GB new
data
every day
Previous legacy system
00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:00:00’ AND ’2018-11-03 00:04:00’
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:04:00’ AND ’2018-11-03 00:08:00’
?
Previous legacy system
00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:00:00’ AND ’2018-11-03 00:04:00’
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:02:00’ AND ’2018-11-03 00:06:00’
?
Phantom reads (classical definition)
Phantom reads (in our cases)
Tnx: 1
2018-11-03
12:00:00
Tnx: 2
2018-11-03
12:01:00
Tnx: 2
commit
12:03:00
Tnx: 1
commit
12:05:00
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
11:58:00’ AND ’2018-11-03 12:04:00’
Trans_id Upd
2
2018-11-03
12:01:00
?
#1. Isolation
levels -
Serializable
With a lock-based concurrency control
DBMS implementation, serializability
requires read and write locks (acquired on
selected data) to be released at the end of
the transaction. Also range-locks must be
acquired when a SELECT query uses a
ranged WHERE clause, especially to avoid
the phantom reads phenomenon.
A not best solution for high
load solutions.
#2. Triggers
Traditionally, the most common technique
used for capturing events was to use
database or application-level triggers. The
reason why this technique is still very
widespread is due to its simplicity and
familiarness. A not best solution for high
load solutions.
#3. Change
Data Capture
is a set of software design patterns used to
determine (and track) the data that has
changed so that action can be taken using
the changed data. Also, Change data
capture (CDC) is an approach to data
integration that is based on the identification,
capture and delivery of the changes made to
enterprise data sources.
(Wikipedia)
CDC solutions occur most often in data-
warehouse environments since capturing and
preserving the state of data across time is one of
the core functions of a data warehouse.
Apache Kafka
Kafka® is used for building real-time data
pipelines and streaming apps. It is
horizontally scalable, fault-tolerant, wicked
fast, and runs in production in thousands of
companies.
Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka"
Typical data flow in companies
Streaming platform to coordinate all data flows.
Kafka Connect API (E and L in Streaming ETL)
• Scalability: Leverages Kafka for
scalability
• Fault tolerance: Builds on Kafka’s
fault tolerance model
• Management and monitoring: One
way of monitoring all connectors
• Schemas: Offers an option for
preserving schemas from source to
sink
Kafka Connect. Create new connector.
Kafka’s streams API (The T in ETL)
• Easiest way to do stream
processing using Kafka;
• True event-at-a-time stream “
processing; no microbatching;
• Dataflow-style windowing
based on “ event-time; handles
late-arriving data
Kafka Stream API. Create new processor
Demo
Conclusion
• Apache Kafka is robust
• Triggers will keep your data in sync but can have significant performance
overhead
• Utilizing a logical replication slot can eliminate trigger overhead and transfer the
computation load elsewhere
• Not a panacea: still need to use good architectural patterns
Questions?
Thank you!
Taras Klioba
+38 093 74 876 15
taras@klioba.com

More Related Content

PDF
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
PPTX
Disrupting Big Data with Apache Spark in the Cloud
PDF
Time Series Analysis Using an Event Streaming Platform
PDF
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
PDF
Enterprise Metadata Integration
PPTX
Realtime streaming architecture in INFINARIO
PPTX
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
PDF
Observability for Data Pipelines With OpenLineage
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
Disrupting Big Data with Apache Spark in the Cloud
Time Series Analysis Using an Event Streaming Platform
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Enterprise Metadata Integration
Realtime streaming architecture in INFINARIO
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
Observability for Data Pipelines With OpenLineage

What's hot (20)

PPTX
Getting It Right Exactly Once: Principles for Streaming Architectures
PDF
Data Pipelines With Streamsets
PPTX
The evolution of the big data platform @ Netflix (OSCON 2015)
PPTX
Streaming Data Ingest and Processing with Apache Kafka
PDF
The Future of ETL Isn't What It Used to Be
PPTX
Dealing with Drift: Building an Enterprise Data Lake
PDF
Open Source DataViz with Apache Superset
PDF
Using Hazelcast in the Kappa architecture
PPTX
Internet of Things and Multi-model Data Infrastructure
PDF
Journey to the Real-Time Analytics in Extreme Growth
PDF
The Lyft data platform: Now and in the future
PPTX
Spark Summit Keynote by Suren Nathan
PPTX
Real-Time Geospatial Intelligence at Scale
PDF
Insights Without Tradeoffs: Using Structured Streaming
PDF
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
PDF
Converging Database Transactions and Analytics
PDF
InfoTrack: Creating a single source of truth with the Elastic Stack
PPTX
See who is using MemSQL
PPTX
How Kafka and Modern Databases Benefit Apps and Analytics
PPTX
Modeling the Smart and Connected City of the Future with Kafka and Spark
Getting It Right Exactly Once: Principles for Streaming Architectures
Data Pipelines With Streamsets
The evolution of the big data platform @ Netflix (OSCON 2015)
Streaming Data Ingest and Processing with Apache Kafka
The Future of ETL Isn't What It Used to Be
Dealing with Drift: Building an Enterprise Data Lake
Open Source DataViz with Apache Superset
Using Hazelcast in the Kappa architecture
Internet of Things and Multi-model Data Infrastructure
Journey to the Real-Time Analytics in Extreme Growth
The Lyft data platform: Now and in the future
Spark Summit Keynote by Suren Nathan
Real-Time Geospatial Intelligence at Scale
Insights Without Tradeoffs: Using Structured Streaming
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Converging Database Transactions and Analytics
InfoTrack: Creating a single source of truth with the Elastic Stack
See who is using MemSQL
How Kafka and Modern Databases Benefit Apps and Analytics
Modeling the Smart and Connected City of the Future with Kafka and Spark
Ad

Similar to Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka" (20)

PPTX
Data Architectures for Robust Decision Making
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
BigDataFest_ Building Modern Data Streaming Apps
PDF
big data fest building modern data streaming apps
PDF
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
PDF
Kafka Vienna Meetup 020719
PDF
EDA Meets Data Engineering – What's the Big Deal?
PDF
Amsterdam meetup at ING June 18, 2019
PDF
Meetup: Streaming Data Pipeline Development
PDF
Handling eventual consistency in a transactional world with Matteo Cimini and...
PDF
Kafka Summit 2022: Handling Eventual Consistency in a Transactional World.pdf
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PPTX
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
PDF
BigDataFest Building Modern Data Streaming Apps
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
PPTX
Big Data Analytics_basic introduction of Kafka.pptx
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
PPTX
Capture the Streams of Database Changes
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Data Architectures for Robust Decision Making
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
BigDataFest_ Building Modern Data Streaming Apps
big data fest building modern data streaming apps
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Kafka Vienna Meetup 020719
EDA Meets Data Engineering – What's the Big Deal?
Amsterdam meetup at ING June 18, 2019
Meetup: Streaming Data Pipeline Development
Handling eventual consistency in a transactional world with Matteo Cimini and...
Kafka Summit 2022: Handling Eventual Consistency in a Transactional World.pdf
Streaming Analytics with Spark, Kafka, Cassandra and Akka
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
BigDataFest Building Modern Data Streaming Apps
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Big Data Analytics_basic introduction of Kafka.pptx
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Capture the Streams of Database Changes
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Ad

More from Lviv Startup Club (20)

PDF
Maksym Vyshnivetskyi: PMO KPIs (UA) - LemBS
PDF
Oleksandr Ivakhnenko: LinkedIn Marketing і Content Marketing: розширений підх...
PDF
Maksym Vyshnivetskyi: PMO Quality Management (UA)
PDF
Oleksandr Ivakhnenko: Вступ до генерації лідів для ІТ-аутсорсингу (UA)
PDF
Oleksandr Osypenko: Поради щодо іспиту та закриття курсу (UA)
PDF
Oleksandr Osypenko: Пробний іспит + аналіз (UA)
PDF
Oleksandr Osypenko: Agile / Hybrid Delivery (UA)
PDF
Oleksandr Osypenko: Стейкхолдери та їх вплив (UA)
PDF
Rostyslav Chayka: Prompt Engineering для проєктного менеджменту (Advanced) (UA)
PPTX
Dmytro Liesov: PMO Tools and Technologies (UA)
PDF
Rostyslav Chayka: Управління командою за допомогою AI (UA)
PDF
Oleksandr Osypenko: Tailoring + Change Management (UA)
PDF
Maksym Vyshnivetskyi: Управління закупівлями (UA)
PDF
Oleksandr Osypenko: Управління ризиками (UA)
PPTX
Dmytro Zubkov: PMO Resource Management (UA)
PPTX
Rostyslav Chayka: Комунікація за допомогою AI (UA)
PDF
Ihor Pavlenko: Комунікація за допомогою AI (UA)
PDF
Maksym Vyshnivetskyi: Управління якістю (UA)
PDF
Ihor Pavlenko: Робота зі стейкхолдерами за допомогою AI (UA)
PDF
Maksym Vyshnivetskyi: Управління вартістю (Cost) (UA)
Maksym Vyshnivetskyi: PMO KPIs (UA) - LemBS
Oleksandr Ivakhnenko: LinkedIn Marketing і Content Marketing: розширений підх...
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Oleksandr Ivakhnenko: Вступ до генерації лідів для ІТ-аутсорсингу (UA)
Oleksandr Osypenko: Поради щодо іспиту та закриття курсу (UA)
Oleksandr Osypenko: Пробний іспит + аналіз (UA)
Oleksandr Osypenko: Agile / Hybrid Delivery (UA)
Oleksandr Osypenko: Стейкхолдери та їх вплив (UA)
Rostyslav Chayka: Prompt Engineering для проєктного менеджменту (Advanced) (UA)
Dmytro Liesov: PMO Tools and Technologies (UA)
Rostyslav Chayka: Управління командою за допомогою AI (UA)
Oleksandr Osypenko: Tailoring + Change Management (UA)
Maksym Vyshnivetskyi: Управління закупівлями (UA)
Oleksandr Osypenko: Управління ризиками (UA)
Dmytro Zubkov: PMO Resource Management (UA)
Rostyslav Chayka: Комунікація за допомогою AI (UA)
Ihor Pavlenko: Комунікація за допомогою AI (UA)
Maksym Vyshnivetskyi: Управління якістю (UA)
Ihor Pavlenko: Робота зі стейкхолдерами за допомогою AI (UA)
Maksym Vyshnivetskyi: Управління вартістю (Cost) (UA)

Recently uploaded (20)

PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
737-MAX_SRG.pdf student reference guides
PPT
Project quality management in manufacturing
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Artificial Intelligence
PPTX
additive manufacturing of ss316l using mig welding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Well-logging-methods_new................
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
III.4.1.2_The_Space_Environment.p pdffdf
CH1 Production IntroductoryConcepts.pptx
Internet of Things (IOT) - A guide to understanding
Automation-in-Manufacturing-Chapter-Introduction.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
737-MAX_SRG.pdf student reference guides
Project quality management in manufacturing
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Artificial Intelligence
additive manufacturing of ss316l using mig welding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Foundation to blockchain - A guide to Blockchain Tech
Well-logging-methods_new................
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka"

  • 1. ETL is Dead. Long Live Streams with Apache Kafka. Taras Kloba, BI Team Lead/Data Architect, Intellias
  • 2. Agenda • About me; • One problem in data transferring; • Ways to solve this problem; • About Apache Kafka; • Demo of reliable data sending; • Questions?
  • 3. Taras Kloba • 7 years of experience with databases; • Certified Data Engineer on Google Cloud; • Certified Expert Microsoft SQL Server; • Co-organizer “SQL Saturday” in Lviv and Krakow; • Trainer, speaker, consultant; • Owner “SQL” trademark in Ukraine . SQL.ua, CEO/Founder Intellias, BI Team Lead/Data Architect Quick facts (Q62JCJRJGY77)(9DG5NZ4EVP7A) (M2HE6LPRJ6MV)
  • 4. My current project: One of the biggest B2B software solution for the iGaming industry in the World. +300 GB new data every day
  • 5. Previous legacy system 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:00:00’ AND ’2018-11-03 00:04:00’ SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:04:00’ AND ’2018-11-03 00:08:00’
  • 6. ?
  • 7. Previous legacy system 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:00:00’ AND ’2018-11-03 00:04:00’ SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:02:00’ AND ’2018-11-03 00:06:00’
  • 8. ?
  • 10. Phantom reads (in our cases) Tnx: 1 2018-11-03 12:00:00 Tnx: 2 2018-11-03 12:01:00 Tnx: 2 commit 12:03:00 Tnx: 1 commit 12:05:00 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 11:58:00’ AND ’2018-11-03 12:04:00’ Trans_id Upd 2 2018-11-03 12:01:00
  • 11. ?
  • 12. #1. Isolation levels - Serializable With a lock-based concurrency control DBMS implementation, serializability requires read and write locks (acquired on selected data) to be released at the end of the transaction. Also range-locks must be acquired when a SELECT query uses a ranged WHERE clause, especially to avoid the phantom reads phenomenon. A not best solution for high load solutions.
  • 13. #2. Triggers Traditionally, the most common technique used for capturing events was to use database or application-level triggers. The reason why this technique is still very widespread is due to its simplicity and familiarness. A not best solution for high load solutions.
  • 14. #3. Change Data Capture is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources. (Wikipedia) CDC solutions occur most often in data- warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse.
  • 15. Apache Kafka Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
  • 17. Typical data flow in companies
  • 18. Streaming platform to coordinate all data flows.
  • 19. Kafka Connect API (E and L in Streaming ETL) • Scalability: Leverages Kafka for scalability • Fault tolerance: Builds on Kafka’s fault tolerance model • Management and monitoring: One way of monitoring all connectors • Schemas: Offers an option for preserving schemas from source to sink
  • 20. Kafka Connect. Create new connector.
  • 21. Kafka’s streams API (The T in ETL) • Easiest way to do stream processing using Kafka; • True event-at-a-time stream “ processing; no microbatching; • Dataflow-style windowing based on “ event-time; handles late-arriving data
  • 22. Kafka Stream API. Create new processor
  • 23. Demo
  • 24. Conclusion • Apache Kafka is robust • Triggers will keep your data in sync but can have significant performance overhead • Utilizing a logical replication slot can eliminate trigger overhead and transfer the computation load elsewhere • Not a panacea: still need to use good architectural patterns
  • 26. Thank you! Taras Klioba +38 093 74 876 15 taras@klioba.com