SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Thiago Rigo and David Mariassy, GetYourGuide
Modern ETL Pipelines
with Change Data Capture
#UnifiedDataAnalytics #SparkAISummit
Who are we?
5 years of experience in
Business Intelligence and Data
Engineering roles from the Berlin
e-commerce scene.
Data Engineer, Data Platform.
Software engineer for the past 7
years, last 3 focused on data
engineering.
Senior Data Engineer, Data
Platform.
Agenda
1 Intro to GetYourGuide
2 GYG’s Legacy ETL Pipeline
3 Rivulus ETL Pipeline
4 Conclusion
5 Questions
Intro to GetYourGuide
We make it simple to book and enjoy
incredible experiences
Europe’s largest marketplace
for travel experiences
50k+
Products in 150+
countries
25M+
Tickets sold
$650M+
In VC funding
600+
Strong global team
150+
Traveler nationalities
GYG’s Legacy ETL Pipeline
Breaking
schema
changes
upstream
Requires
special
knowledge
Long
recovery
times
Difficult to
test
Bad SLAs
Where we started
Requires
special
knowledge
Breaking
schema
changes
upstream
Long
recovery
times
Difficult to
test
Bad SLAs
Automatic
handling of
schema
changes
Familiar
tooling (Scala,
SQL)
Maximum
parallelism
Built for
testability
Better SLAs
What we wanted
Rivulus ETL Pipeline
Overview
Extraction Layer
The pipeline
Debezium
● Open source distributed platform for change
data capture
● Can read several databases
○ MySQL, Postgres, Cassandra, Oracle, SQL
Server, and Mongo DB
● It works as a connector part of Kafka Connect
● It streams the database's event log into Kafka
● Streams those changes to Kafka
● Scala library
● Keeps track of all schema changes applied to
the tables
● Holds PK, timestamp and partition columns
● Prevents breaking changes from being
introduced
○ Type changes
● Upcast types
● Schema Service works on column level
Schema Service
Automatic
handling of
schema
changes
Avro Converter
● Regular Scala application
● Runs as part of Airflow DAG
● Reads raw Avro files from S3
● Communicates with Schema Service to handle
schema changes automatically
● Writes out Parquet files
Automatic
handling of
schema
changes
Upsert
● Spark application
● Runs as part of Airflow DAG
● Reads in new Parquet files
● Communicates with Schema Service to get PK,
timestamp and partition columns
● Compacts the data based on table’s PK
● Creates Hive table which contains a replica of
source DB
Transformation Layer
The performance penalty of managing
transformation dependencies
inefficiently
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
● Small set of
transformations.
● Small team / single
engineer.
● Simple one-to-one type
dependencies.
● Defining an optimal
dependency model by
hand is possible.
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
● Growing set of transformations.
● Growing team.
● One-to-many / many-to-many type
dependencies.
● Defining a dependency model by
hand becomes cumbersome and
error-prone
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
The hard
choice
between
performance
and
correctness
● As optimal dependency models
become ever more difficult to maintain
and expand manually without making
errors, teams decide to optimise for
correctness over performance.
● This results in crude dependency
models with a lot of sequential
execution in places where
parallelization would be possible.
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
The
performance
bottleneck
strikes back
The hard
choice
between
performance
and
correctness
● Sequential execution results in
long execution and long recovery
times. In other words Poor SLAs.
● 💣🔥
Rivulus SQL for automated dependency
inference
Maximum
parallelism
● SQL transformations
○ A collection of Rivulus SQL files that make use of a set of
custom template variables.
● Executor app
○ Spark app that executes a single transformation at a
time.
● DGB (Dependency Graph Builder)
○ Parses all files in the SQL library and builds a
dependency graph of the transformations by
interpolating Rivulus SQL template vars.
● Airflow
○ Executes the transformations on Databricks in the order
specified by the DGB.
Main components
Rivulus SQL syntax
● {% reference:target ‘dim_tour’ %}
○ Declares a dependency between this transformation and the
dim_tour transformation that must be defined in the same SQL
library
● {% reference:source ‘gyg__customer’ %}
○ Declares a dependency between this transformation and a raw
data source (gyg.customer) that is loaded to Hive by an
extraction job
● {% load ‘file.sql’ %}
○ Loads a reusable subquery defined in file.sql into this
transformation.
Familiar
tooling (Scala,
SQL)
{
"fact_nps_feedback": {
"source_dependencies": [
"gyg__nps_feedback"
],
"transformation_dependencies": [
"dim_nps_feedback_stage"
]
}
}
DGB
Airflow
Example
Executor
app
invocations
on DB
SELECT
nps_feedback_id
, nps_feedback_stage_id
, booking_id
, score
, feedback
, update_timestamp
, source
FROM {% reference:source 'gyg__nps_feedback' %} AS nf
LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs
ON nfs.nps_feedback_stage_name = nf.stage
Build time
Rivulus SQL
Build time
Runtime
A word on testing
● Maximum parallelism enhances testability
● Separation of config from code
○ Configurable input and output paths
Built for
testability
Conclusion
Results
Eliminated
vulnerability
to upstream
schema
changes
Democratized
our ETL by
migrating all
business logic
to SQL
Minimized
recovery
time by
maximizing
parallelism
Designed
for E2E
testability
Cut
processing
time by 70%
(further
reductions are
possible)
Next Steps
Intra-day
micro-batches
Database
Replication as
a Service
Rivulus SQL is
GYG’s
standard tool
for writing
transformations
Delta for Upsert
Questions?
We’re hiring!
https://careers.getyourguide.com

More Related Content

PPTX
Admission Control in Impala
PPTX
Stateful set in kubernetes implementation & usecases
PPSX
Docker Kubernetes Istio
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PDF
Terraform -- Infrastructure as Code
PPTX
Kubernetes Workshop
PDF
Secrets of Performance Tuning Java on Kubernetes
Admission Control in Impala
Stateful set in kubernetes implementation & usecases
Docker Kubernetes Istio
From cache to in-memory data grid. Introduction to Hazelcast.
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Terraform -- Infrastructure as Code
Kubernetes Workshop
Secrets of Performance Tuning Java on Kubernetes

What's hot (20)

PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PPTX
Salesforce DevOps using GitHub Action
PDF
Automate Your Kafka Cluster with Kubernetes Custom Resources
PDF
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
PDF
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PPTX
Automating Deployment Between Orgs Using Git & Continuous Integration
PDF
Deep Dive into Kubernetes - Part 1
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
PDF
Introduction to Git
PDF
Cool features in The Domino Optimizer - v4.pdf
PDF
Gitops: the kubernetes way
PDF
Facebook Messages & HBase
PDF
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
PDF
Let's build Developer Portal with Backstage
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Salesforce DevOps using GitHub Action
Automate Your Kafka Cluster with Kubernetes Custom Resources
Airflow Best Practises & Roadmap to Airflow 2.0
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Automating Deployment Between Orgs Using Git & Continuous Integration
Deep Dive into Kubernetes - Part 1
Presto Summit 2018 - 09 - Netflix Iceberg
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Introduction to Git
Cool features in The Domino Optimizer - v4.pdf
Gitops: the kubernetes way
Facebook Messages & HBase
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
Let's build Developer Portal with Backstage
Ad

Similar to Modern ETL Pipelines with Change Data Capture (20)

PDF
Encode Club workshop slides
PDF
Apache Spark Presentation good for big data
PDF
CDC patterns in Apache Kafka®
PDF
Introduction to spark 2.0
PDF
Building scalable data with kafka and spark
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PDF
AirBNB's ML platform - BigHead
PDF
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
PDF
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
DOCX
Firebird to Snowflake Migration _ A comprehensive Guide.docx
PDF
Data Pipeline for The Big Data/Data Science OKC
PPTX
Ultimate Guide to Microservice Architecture on Kubernetes
PPTX
Dataflow.pptx
PDF
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
PDF
A Tool For Big Data Analysis using Apache Spark
PDF
Introduction to Structured Data Processing with Spark SQL
PPTX
Seattle Spark Meetup Mobius CSharp API
PDF
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
PDF
Building data pipelines at Shopee with DEC
Encode Club workshop slides
Apache Spark Presentation good for big data
CDC patterns in Apache Kafka®
Introduction to spark 2.0
Building scalable data with kafka and spark
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
AirBNB's ML platform - BigHead
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Firebird to Snowflake Migration _ A comprehensive Guide.docx
Data Pipeline for The Big Data/Data Science OKC
Ultimate Guide to Microservice Architecture on Kubernetes
Dataflow.pptx
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
A Tool For Big Data Analysis using Apache Spark
Introduction to Structured Data Processing with Spark SQL
Seattle Spark Meetup Mobius CSharp API
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Building data pipelines at Shopee with DEC
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Mega Projects Data Mega Projects Data
PPTX
Global journeys: estimating international migration
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Database Infoormation System (DBIS).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
Launch Your Data Science Career in Kochi – 2025
climate analysis of Dhaka ,Banglades.pptx
Reliability_Chapter_ presentation 1221.5784
Business Ppt On Nestle.pptx huunnnhhgfvu
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Miokarditis (Inflamasi pada Otot Jantung)
Mega Projects Data Mega Projects Data
Global journeys: estimating international migration
STUDY DESIGN details- Lt Col Maksud (21).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
Supervised vs unsupervised machine learning algorithms
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Moving the Public Sector (Government) to a Digital Adoption
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Database Infoormation System (DBIS).pptx
Clinical guidelines as a resource for EBP(1).pdf

Modern ETL Pipelines with Change Data Capture

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Thiago Rigo and David Mariassy, GetYourGuide Modern ETL Pipelines with Change Data Capture #UnifiedDataAnalytics #SparkAISummit
  • 3. Who are we? 5 years of experience in Business Intelligence and Data Engineering roles from the Berlin e-commerce scene. Data Engineer, Data Platform. Software engineer for the past 7 years, last 3 focused on data engineering. Senior Data Engineer, Data Platform.
  • 4. Agenda 1 Intro to GetYourGuide 2 GYG’s Legacy ETL Pipeline 3 Rivulus ETL Pipeline 4 Conclusion 5 Questions
  • 6. We make it simple to book and enjoy incredible experiences
  • 7. Europe’s largest marketplace for travel experiences 50k+ Products in 150+ countries 25M+ Tickets sold $650M+ In VC funding 600+ Strong global team 150+ Traveler nationalities
  • 10. Requires special knowledge Breaking schema changes upstream Long recovery times Difficult to test Bad SLAs Automatic handling of schema changes Familiar tooling (Scala, SQL) Maximum parallelism Built for testability Better SLAs What we wanted
  • 15. Debezium ● Open source distributed platform for change data capture ● Can read several databases ○ MySQL, Postgres, Cassandra, Oracle, SQL Server, and Mongo DB ● It works as a connector part of Kafka Connect ● It streams the database's event log into Kafka ● Streams those changes to Kafka
  • 16. ● Scala library ● Keeps track of all schema changes applied to the tables ● Holds PK, timestamp and partition columns ● Prevents breaking changes from being introduced ○ Type changes ● Upcast types ● Schema Service works on column level Schema Service Automatic handling of schema changes
  • 17. Avro Converter ● Regular Scala application ● Runs as part of Airflow DAG ● Reads raw Avro files from S3 ● Communicates with Schema Service to handle schema changes automatically ● Writes out Parquet files Automatic handling of schema changes
  • 18. Upsert ● Spark application ● Runs as part of Airflow DAG ● Reads in new Parquet files ● Communicates with Schema Service to get PK, timestamp and partition columns ● Compacts the data based on table’s PK ● Creates Hive table which contains a replica of source DB
  • 20. The performance penalty of managing transformation dependencies inefficiently
  • 21. The gradual forsaking of performance on the altar of dependency management Humble beginnings ● Small set of transformations. ● Small team / single engineer. ● Simple one-to-one type dependencies. ● Defining an optimal dependency model by hand is possible.
  • 22. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon ● Growing set of transformations. ● Growing team. ● One-to-many / many-to-many type dependencies. ● Defining a dependency model by hand becomes cumbersome and error-prone
  • 23. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon The hard choice between performance and correctness ● As optimal dependency models become ever more difficult to maintain and expand manually without making errors, teams decide to optimise for correctness over performance. ● This results in crude dependency models with a lot of sequential execution in places where parallelization would be possible.
  • 24. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon The performance bottleneck strikes back The hard choice between performance and correctness ● Sequential execution results in long execution and long recovery times. In other words Poor SLAs. ● 💣🔥
  • 25. Rivulus SQL for automated dependency inference Maximum parallelism
  • 26. ● SQL transformations ○ A collection of Rivulus SQL files that make use of a set of custom template variables. ● Executor app ○ Spark app that executes a single transformation at a time. ● DGB (Dependency Graph Builder) ○ Parses all files in the SQL library and builds a dependency graph of the transformations by interpolating Rivulus SQL template vars. ● Airflow ○ Executes the transformations on Databricks in the order specified by the DGB. Main components
  • 27. Rivulus SQL syntax ● {% reference:target ‘dim_tour’ %} ○ Declares a dependency between this transformation and the dim_tour transformation that must be defined in the same SQL library ● {% reference:source ‘gyg__customer’ %} ○ Declares a dependency between this transformation and a raw data source (gyg.customer) that is loaded to Hive by an extraction job ● {% load ‘file.sql’ %} ○ Loads a reusable subquery defined in file.sql into this transformation. Familiar tooling (Scala, SQL)
  • 28. { "fact_nps_feedback": { "source_dependencies": [ "gyg__nps_feedback" ], "transformation_dependencies": [ "dim_nps_feedback_stage" ] } } DGB Airflow Example Executor app invocations on DB SELECT nps_feedback_id , nps_feedback_stage_id , booking_id , score , feedback , update_timestamp , source FROM {% reference:source 'gyg__nps_feedback' %} AS nf LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs ON nfs.nps_feedback_stage_name = nf.stage Build time Rivulus SQL Build time Runtime
  • 29. A word on testing ● Maximum parallelism enhances testability ● Separation of config from code ○ Configurable input and output paths Built for testability
  • 31. Results Eliminated vulnerability to upstream schema changes Democratized our ETL by migrating all business logic to SQL Minimized recovery time by maximizing parallelism Designed for E2E testability Cut processing time by 70% (further reductions are possible)
  • 32. Next Steps Intra-day micro-batches Database Replication as a Service Rivulus SQL is GYG’s standard tool for writing transformations Delta for Upsert