Modern ETL Pipelines with Change Data Capture

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Thiago Rigo and David Mariassy, GetYourGuide
Modern ETL Pipelines
with Change Data Capture
#UnifiedDataAnalytics #SparkAISummit

Who are we?
5 years of experience in
Business Intelligence and Data
Engineering roles from the Berlin
e-commerce scene.
Data Engineer, Data Platform.
Software engineer for the past 7
years, last 3 focused on data
engineering.
Senior Data Engineer, Data
Platform.

Agenda
1 Intro to GetYourGuide
2 GYG’s Legacy ETL Pipeline
3 Rivulus ETL Pipeline
4 Conclusion
5 Questions

We make it simple to book and enjoy
incredible experiences

Europe’s largest marketplace
for travel experiences
50k+
Products in 150+
countries
25M+
Tickets sold
$650M+
In VC funding
600+
Strong global team
150+
Traveler nationalities

Breaking
schema
changes
upstream
Requires
special
knowledge
Long
recovery
times
Difficult to
test
Bad SLAs
Where we started

Requires
special
knowledge
Breaking
schema
changes
upstream
Long
recovery
times
Difficult to
test
Bad SLAs
Automatic
handling of
schema
changes
Familiar
tooling (Scala,
SQL)
Maximum
parallelism
Built for
testability
Better SLAs
What we wanted

Debezium
● Open source distributed platform for change
data capture
● Can read several databases
○ MySQL, Postgres, Cassandra, Oracle, SQL
Server, and Mongo DB
● It works as a connector part of Kafka Connect
● It streams the database's event log into Kafka
● Streams those changes to Kafka

● Scala library
● Keeps track of all schema changes applied to
the tables
● Holds PK, timestamp and partition columns
● Prevents breaking changes from being
introduced
○ Type changes
● Upcast types
● Schema Service works on column level
Schema Service
Automatic
handling of
schema
changes

Avro Converter
● Regular Scala application
● Runs as part of Airflow DAG
● Reads raw Avro files from S3
● Communicates with Schema Service to handle
schema changes automatically
● Writes out Parquet files
Automatic
handling of
schema
changes

Upsert
● Spark application
● Runs as part of Airflow DAG
● Reads in new Parquet files
● Communicates with Schema Service to get PK,
timestamp and partition columns
● Compacts the data based on table’s PK
● Creates Hive table which contains a replica of
source DB

The performance penalty of managing
transformation dependencies
inefficiently

The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
● Small set of
transformations.
● Small team / single
engineer.
● Simple one-to-one type
dependencies.
● Defining an optimal
dependency model by
hand is possible.

Humble
beginnings
Complexity on
the horizon
● Growing set of transformations.
● Growing team.
● One-to-many / many-to-many type
dependencies.
● Defining a dependency model by
hand becomes cumbersome and
error-prone

Humble
beginnings
Complexity on
the horizon
The hard
choice
between
performance
and
correctness
● As optimal dependency models
become ever more difficult to maintain
and expand manually without making
errors, teams decide to optimise for
correctness over performance.
● This results in crude dependency
models with a lot of sequential
execution in places where
parallelization would be possible.

Humble
beginnings
Complexity on
the horizon
The
performance
bottleneck
strikes back
The hard
choice
between
performance
and
correctness
● Sequential execution results in
long execution and long recovery
times. In other words Poor SLAs.
● 💣🔥

Rivulus SQL for automated dependency
inference
Maximum
parallelism

● SQL transformations
○ A collection of Rivulus SQL files that make use of a set of
custom template variables.
● Executor app
○ Spark app that executes a single transformation at a
time.
● DGB (Dependency Graph Builder)
○ Parses all files in the SQL library and builds a
dependency graph of the transformations by
interpolating Rivulus SQL template vars.
● Airflow
○ Executes the transformations on Databricks in the order
specified by the DGB.
Main components

Rivulus SQL syntax
● {% reference:target ‘dim_tour’ %}
○ Declares a dependency between this transformation and the
dim_tour transformation that must be defined in the same SQL
library
● {% reference:source ‘gyg__customer’ %}
○ Declares a dependency between this transformation and a raw
data source (gyg.customer) that is loaded to Hive by an
extraction job
● {% load ‘file.sql’ %}
○ Loads a reusable subquery defined in file.sql into this
transformation.
Familiar
tooling (Scala,
SQL)

{
"fact_nps_feedback": {
"source_dependencies": [
"gyg__nps_feedback"
],
"transformation_dependencies": [
"dim_nps_feedback_stage"
]
}
}
DGB
Airflow
Example
Executor
app
invocations
on DB
SELECT
nps_feedback_id
, nps_feedback_stage_id
, booking_id
, score
, feedback
, update_timestamp
, source
FROM {% reference:source 'gyg__nps_feedback' %} AS nf
LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs
ON nfs.nps_feedback_stage_name = nf.stage
Build time
Rivulus SQL
Build time
Runtime

A word on testing
● Maximum parallelism enhances testability
● Separation of config from code
○ Configurable input and output paths
Built for
testability

Results
Eliminated
vulnerability
to upstream
schema
changes
Democratized
our ETL by
migrating all
business logic
to SQL
Minimized
recovery
time by
maximizing
parallelism
Designed
for E2E
testability
Cut
processing
time by 70%
(further
reductions are
possible)

Next Steps
Intra-day
micro-batches
Database
Replication as
a Service
Rivulus SQL is
GYG’s
standard tool
for writing
transformations
Delta for Upsert

Questions?
We’re hiring!
https://careers.getyourguide.com

Modern ETL Pipelines with Change Data Capture

More Related Content

What's hot (20)

Similar to Modern ETL Pipelines with Change Data Capture (20)

More from Databricks (20)

Recently uploaded (20)

Modern ETL Pipelines with Change Data Capture