SlideShare a Scribd company logo
Airflow Summit 2024
Integrating dbt-core
with Airflow
Overcoming Performance Hurdles
Pankaj Koti Senior Software Engineer
Tati Al-Chueyr Staff Software Engineer
San Francisco, 11 September 2024
dbt in Airflow 2023 Airflow Survey
32.5% of the 2023 Apache Airflow Survey respondents use dbt
dbt in Airflow Airflow Summit 2023
dbt-core and Airflow was one of the most popular topics in 2023
● “Airflow at Monzo: Evolving our data platform as the bank
scalesˮ by Jonathan Rainer & Ed Sparkes
● “Using Dynamic Task Mapping to Orchestrate dbtˮ by Pádraic
Slattery
● “Building an Airflow Pipeline with dbt and Snowflakeˮ by Rishi
Kar & George Yates
● “A Single Pane of Glass on Airflow using Astro Python SDK,
Snowflake, dbt, and Cosmosˮ by Luan Moreno Medeiros Maciel
● “Manifest destiny: Orchestrating dbt using Airflowˮ by Jonathan
Talmi
https://airflowsummit.org/sessions/2023/
dbt in Airflow Airflow Summit 2024
dbt-core and Airflow remains a popular topic in this yearʼs summit
● “Building on Cosmos: Making dbt on Airflow Easyˮ by Lewis
Macdonald & Ethan Stone 1100 on Tuesday 10/09
● “dbt-Core & Airflow 101 Building Data Pipelines Demystifiedˮ by
Luan Moreno Medeiros Maciel 1400 on Tuesday 10/09
● “Integrating dbt with Airflow: Overcoming Performance Hurdlesˮ
by Tatiana Al-Chueyr & Pankaj Koti
https://airflowsummit.org/sessions/2024/
dbt in Airflow Airflow Summit 2024
dbt-core and Airflow remains a popular topic in this yearʼs summit
● “Building on Cosmos: Making dbt on Airflow Easyˮ by Lewis
Macdonald & Ethan Stone 1100 on Tuesday 10/09
● “dbt-Core & Airflow 101 Building Data Pipelines Demystifiedˮ by
Luan Moreno Medeiros Maciel 1400 on Tuesday 10/09
● “Integrating dbt with Airflow: Overcoming Performance Hurdlesˮ
by Tatiana Al-Chueyr & Pankaj Koti
https://airflowsummit.org/sessions/2024/
dbt in Airflow Airflow Summit 2024
dbt-core and Airflow remains a popular topic in this yearʼs summit
● “Building on Cosmos: Making dbt on Airflow Easyˮ by Lewis
Macdonald & Ethan Stone 1100 on Tuesday 10/09
● “dbt-Core & Airflow 101 Building Data Pipelines Demystifiedˮ by
Luan Moreno Medeiros Maciel 1400 on Tuesday 10/09
● “Integrating dbt with Airflow: Overcoming Performance Hurdlesˮ
by Tatiana Al-Chueyr & Pankaj Koti
https://airflowsummit.org/sessions/2024/
intro why what metrics solutions
dbt in Airflow OSS Tools
PyPI downloads for OSS popular tools used to run dbt in Airflow
dbt in Airflow OSS Tools Adoption
dbt in Airflow OSS Tools Non-Adopters
53.4% of the dbt in Airflow survey respondents donʼt use any OSS tools
dbt in Airflow Performance is a challenge
Performance was the second most popular challenge raised by 33.3% of
the dbt in Airflow survey respondents. The most popular challenge was
integrating dbt and Airflow from separate repositories 35.9%
dbt in Airflow No solution fits all
dbt in Airflow Cosmos approach
$ pip install astronomer-cosmos
dbt in Airflow Cosmos approach
import os
from datetime import datetime
from pathlib import Path
from cosmos import DbtDag, ProjectConfig, ProfileConfig
from cosmos.profiles import PostgresUserPasswordProfileMapping
DEFAULT_DBT_ROOT_PATH = Path(__file__).parent / "dbt"
DBT_ROOT_PATH = Path(os.getenv("DBT_ROOT_PATH", DEFAULT_DBT_ROOT_PATH))
profile_config = ProfileConfig(
profile_name="jaffle_shop",
target_name="dev",
profile_mapping=PostgresUserPasswordProfileMapping(
conn_id="airflow_db",
profile_args={"schema": "public"},
),
)
basic_cosmos_dag = DbtDag(
project_config=ProjectConfig(
DBT_ROOT_PATH / "jaffle_shop",
),
profile_config=profile_config,
schedule_interval="@daily",
start_date=datetime(2023, 1, 1),
catchup=False,
dag_id="basic_cosmos_dag",
)
Cosmos Key Features
Translate a dbt-core workflow into an Airflow workflow
● Easily render a dbt-core project as an Airflow DAG or Task Group
● Automatically map Airflow connections into dbt profile files
● Dynamically create Airflow datasets for data-aware scheduling
● Only retry necessary dbt transformations
● Generate dbt docs and host them through the Airflow UI
● Growing active open-source community
https://github.com/astronomer/astronomer-cosmos
Apache
2.0
License
Cosmos Flexibility
The user is in control
How to parse the
dbt project
⏺ dbt ls
command
⏺ dbt manifest
⏺ dbt ls output
⏺ custom
Run dbt your way
Airflow worker
⏺ Local
⏺ Virtualenv
⏺ Docker
Remotely
⏺ Kubernetes
⏺ AWS EKS
⏺ Azure ACI
How the DAG is
rendered
⏺ dbt selectors
⏺ test behaviour
⏺ customise the
conversion
⏺ args
Where to declare
DB credentials
⏺ user-defined
profiles.yml
⏺ dynamically
create profile
from Airflow
connection
Cosmos Adoption
● 44k downloads in a month September 2023
● 244 stars in Github September 2023
●
●
●
●
https://pypistats.org/packages/astronomer-cosmos
April-Sept 2023
Cosmos Adoption
● > 1.1 million downloads in a month Aug-Sept 2024
● 576 stars in Github September 2024
● > 60 Astronomer customers
March-Sept 2024
https://pypistats.org/packages/astronomer-cosmos
Cosmos Trade-off
Each tool has their pros and cons
● Dynamic DAG rendering increases the DAG Parsing time
○ Larger CPU and/or memory consumption
○ Higher DAG processor time
○ Longer task queueing time
● To run one dbt model per task is slower than to run multiple dbt
models per task
● To run a small dbt-core pipeline in my terminal is faster than to run
on a distributed orchestration platform
Apache Airflow When DAGs are parsed
DAG Processor
(Scheduler)
Scheduler Executor
(Scheduler)
Worker
Metadata
Database
DAGs
folder
Fetch DAGs
Parse DAGs
Serialise DAG
Create DAG Run
Identify schedulable tasks
Queue Task Runs
Queue DAG Run
Delegate Task Run
Queue Task Run
Parse DAG
Run Task
Fetch DAG
Identify schedulable DAGs
Apache Airflow When DAGs are parsed
DAG Processor
(Scheduler)
Scheduler Executor
(Scheduler)
Worker
Metadata
Database
DAGs
folder
Fetch DAGs
Parse DAGs
Serialise DAG
Create DAG Run
Identify schedulable tasks
Queue Task Runs
Queue DAG Run
Delegate Task Run
Queue Task Run
Parse DAG
Run Task
Fetch DAG
Identify schedulable DAGs
(per task run)
(per DAG
reparse)
Cosmos DAG Parsing Steps
2. 3.
Create
profiles.yml
1. (optional)
Run dbt deps Parse the dbt
project
Select dbt
nodes
Build the
Airflow
TaskGroup
or DAG
2. 3. 4. 5.
(optional) (optional)
1. dbt ls --output json
2. manifest.json
3. dbt_ls_output.json
4. Cosmos parser
a. Pre-computed
b. Cosmos selector
c. no selection
Cosmos Task Run Steps (after DAG re-parsing)
2. 3.
Parse the
DAG
-1. (pre-exec)
Run dbt deps Parse the dbt
project
2. 3.
(optional) 2.
Create
profiles.yml
(optional)
Create
Python
virtualenv
Run dbt deps
2. 3.
(optional)
(dependent upon execution
mode & invocation method)
1. dbtRunner (python)
2. Subprocess (dbt cmd)
3. K8s.. etc
Run dbt
command
4.
(optional)
Specific to
ExecutionMode.
VIRTUALENV
Custom user
callback
5. (optional)
1.
Cosmos Performance Disclaimer
Your choices on how to use Cosmos
directly affect how Cosmos-powered
DAGs will perform in your Airflow
deployment.
Cosmos Performance Improvements
Throughout the past months, several
people have actively worked on to
improve the performance using various
strategies. This talk will discuss some of
these.
Cosmos 1.4
Dec 2022
Astronomer Hack Week
Sept 2024
Airflow Summit ‘24
0.1 - 12.2022
0.2 - 01.2023
0.3 - 01.2023
0.4 - 02.2023
0.5 - 03.2023
0.6 - 04.2023
0.7 - 05.2023
1.0 - 07.2023
1.1 - 09.2023
1.2 - 10.2023
1.3 - 01.2024
1.4 - 05.2024
1.5 - 06.2024
1.6 - 08.2024
Sept 2023
Airflow Summit ‘23
Cosmos Timeline
0.1 - 12.2022
0.2 - 01.2023
0.3 - 01.2023
0.4 - 02.2023
0.5 - 03.2023
0.6 - 04.2023
0.7 - 05.2023
1.0 - 07.2023
1.1 - 09.2023
1.2 - 10.2023
1.3 - 01.2024
1.4 - 05.2024
1.5 - 06.2024
1.6 - 08.2024
Cosmos Timeline
Some (non-performance) features
dbt global flags, render DAG with
LoadMode.DBT_LS without connection
model versioning, Athena, custom node
rendering, detach render/execution
YAML selector support, Vertica, Snowflake
encrypted key, DbtDocsGCSOperator
dbt Docs in Airflow UI, Azure Container
Instance, DbtBuild operators
AWS EKS, Clickhouse
LoadMode.DBT_MANIFEST from remote
store, render Source Nodes, Teradata
Performance
Hurdles
Cosmos 1.1 DAG Timeout issues
https://github.com/astronomer/astronomer-cosmos/issues/520
Jan 24
Cosmos 1.1 DAG Timeout support
https://github.com/astronomer/astronomer-cosmos/issues/520
Jan 24
Cosmos 1.2 Slowness report
https://github.com/astronomer/astronomer-cosmos/issues/840
Feb 24
Cosmos 1.3 Performance degradation
https://github.com/astronomer/astronomer-cosmos/issues/932
Apr 24
Cosmos 1.4 Large task queueing time
https://github.com/astronomer/astronomer-cosmos/issues/990
May 24
Cosmos 1.4 Large task queueing time
https://github.com/astronomer/astronomer-cosmos/issues/990
May 24
Measuring
Performance
Performance Metrics
DAG Parsing time
● (optional) Create profile
● (optional) Run dbt deps
● Parse the given dbt project
● (optional) Identify the selected dbt nodes
● Build the Airflow DAG or TaskGroup
Task Run time
● (optional) Create profile
● (optional) Run dbt deps
● (optional) Create virtualenv
● Setup/Run dbt command
● (optional) Callback
Performance Metrics
DAG Parsing time
● (optional) Create profile
● (optional) Run dbt deps
● Parse the given dbt project
● (optional) Identify the selected dbt nodes
● Build the Airflow DAG or TaskGroup
Task Run time
● (optional) Create profile
● (optional) Run dbt deps
● (optional) Create virtualenv
● Setup/Run dbt command
● (optional) Callback
Task Queue time
● DAG Parsing time
DAG Run time
● A combination of the
previous metrics
Airflow Deployment
Astro Runtime 11.10.0 Airflow 2.9.3
Execution
● Executor Celery
● Worker type A5 1 vCPU and 2GiB RAM
● Concurrency 1
● Storage 10 GiB
● Min # Workers 4
● Max # Workers 4
Scheduler
● High Availability on
● Medium
○ Scheduler 1 vCPU and 2 GiB
○ DAG Processor 1 vCPU and 2 GiB
Benchmark Project
https://github.com/astronomer/airflow-summit-2024-cosmos/
dbt project: (old) Jaffle Shop
Baseline
Cosmos 1.2.5
DAG Parsing time 000008
Task Run time 000009
Task Queue time 000009
DAG Run time 000129
(LoadMode.DBT_LS & ProfileMapping)
Overcoming
Performance Hurdles
https://drive.google.com/file/d/1R-v3fIgj5mnJWoqLe-OE0OirybdqRPAY/view?usp=sharing
Performance Improvements
● 1.2
○ Baseline
● 1.3
○ Introduction of LoadMode.DBT_LS_FILE
● 1.4
○ Script to evaluate performance
○ Introduce InvocationMode.DBT_RUNNER
○ Use & cache dbt partial parsing
○ Only run dbt deps when there is packages.yml
● 1.5
○ Cache LoadMode.DBT_LS using Airflow variables
○ Cache ProfileMapping
Performance Improvements
● 1.6
○ Cache package-lock.yml
○ Persist LoadMode.VIRTUALENV directory
○ Cache LoadMode.DBT_LS using remote store
1.3
Improvement in DAG Parsing
● Introduction of LoadMode.DBT_LS_FILE by @woogakoki
○ #733
○ Similar to LoadMode.DBT_MANIFEST
○ Users have to pre-compile the project (dbt ls --output json)
○ Cosmos understands this output file
Cosmos 1.3 Performance Improvements
Select
nodes
Build the
Airflow DAG
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
“This should increase performance compared to using dbt_ls.ˮ
Cosmos 1.3 Performance Improvements
1.2.5 1.3 DBT LS FILE
DAG Parsing time 000008 000002
Task Run time 000009 000008
Task Queue time 000009 000004
DAG Run time 000129 000055
1.4
Improvement in DAG Parsing
a) Only run dbt deps when there is packages.yml by
AlgirdasDubickas @tatiana
○ #1030
Cosmos 1.4 Performance Improvements
2. 3. 4.
1.
Select
nodes
Build the
Airflow DAG
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Run dbt
deps
2.
Select
nodes
Parse the
dbt project
3. 4.
Run dbt
deps
2.
Improvement in DAG Parsing
b) Use & cache dbt partial parsing by @dwreeves @tatiana
○ #800, #904
○ Improvement in how dbt commands run (LoadMode.DBT_LS)
○ Leverages dbt partial_parse.msgpack
Cosmos 1.4 Performance Improvements
2. 3. 4.
1.
Select
nodes
Build the
Airflow DAG
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Run dbt
deps
2.
Select
nodes
Parse the
dbt project
3. 4.
“Improve the performance to run the benchmark DAG with 100 tasks by
34% and the benchmark DAG with 10 tasks by 22%ˮ
Improvement in Task Run
a) Only run dbt deps when there is packages.yml by
AlgirdasDubickas @tatiana
○ #1030
User def
callback
Cosmos 1.4 Performance Improvements
2. 3. 4.
1.
Select
nodes
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Create py
virtualenv
2.
Run dbt
deps
3. 4.
Cosmos 1.4 Performance Improvements
Improvement in Task Run
b) Use & cache dbt partial parsing by @dwreeves @tatiana
○ #800, #904
○ Improvement in how dbt commands run (LoadMode.DBT_LS)
○ Leverages dbt partial_parse.msgpack
User def
callback
2. 3. 4.
1.
Select
nodes
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Create py
virtualenv
2.
Run dbt
command
Run dbt
deps
3. 4.
“Improve the performance to run the benchmark DAG with 100 tasks by
34% and the benchmark DAG with 10 tasks by 22%ˮ
User def
callback
Improvement in Task Run
c) Introduction of InvocationMode.DBT_RUNNER by @jbandoro
○ #850
○ Avoid create subprocesses to run dbt commands in task execution
○ Relies on dbt and Airflow being in the same Python virtualenv
Cosmos 1.4 Performance Improvements
2. 3. 4.
1.
Select
nodes
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Create py
virtualenv
2.
Run dbt
command
Run dbt
deps
3. 4.
“Using InvocationMode.DBT_RUNNER is almost 3x faster, and can
speed up dag runs if there are a lot of models that execute relatively
quickly since there seems to be a 12s speed up per task.ˮ
Cosmos 1.4 Performance Improvements
1.2.5 1.4
DAG Parsing time 000008 000007
Task Run time 000009 000006
Task Queue time 000009 000005
DAG Run time 000129 000118
1.5
Improvement in DAG Parsing
a) Cache ProfileMapping by @pankajastro
○ #1046
○ Similar to partial_parse.msgpackcaching
Cosmos 1.5 Performance Improvements
2. 3. 4.
1. 2. 3. 4.
1.
Select
nodes
Build the
Airflow DAG
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Run dbt
deps
2.
Select
nodes
Parse the
dbt project
3. 4.
Run dbt
deps
2.
“Enabling profile caching for 100 models DAG benchmark reduced the
DAG run by 11%ˮ
Improvement in DAG Parsing
b) Cache LoadMode.DBT_LS using Airflow variables by @tatiana
○ #992 #1014
○ Mechanism to cache output of dbt ls into Airflow variable
○ Automatic purge if dbt project changes
Cosmos 1.5 Performance Improvements
2. 3. 4.
1.
Build the
Airflow DAG
Select
nodes
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Run dbt
deps
2.
Parse the
dbt project
3.
“The example DAGs tested reduced the task queueing time significantly (from 30s to 0.5s) and the
total DAG run time for Jaffle Shop from 1 min 25s to 40s (by more than 50%.ˮ
Cosmos 1.5 Performance Improvements
Improvement in Task Run
a) Cache ProfileMapping by @pankajastro
○ #1046
○ Similar to partial_parse.msgpackcaching
Cosmos 1.5 Performance Improvements
2. 3. 4.
1.
Select
nodes
User def
callback
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Create py
virtualenv
2.
Run dbt
command
Run dbt
deps
3. 4.
“Enabling profile caching for 100 models DAG benchmark reduced the
DAG run by 11%ˮ
1.2.5 1.4 1.5
DAG Parsing time 000008 000007 000002
Task Run time 000009 000006 000005
Task Queue time 000009 000005 000001
DAG Run time 000129 000118 000043
Cosmos 1.5 Performance Improvements
1.6
Improvement in DAG Parsing
a) Cache package-lock.yml by @pankajastro
○ #1086
○ Similar to profiles.yml and partial_parse.msgpack
Cosmos 1.6 Performance Improvements
Build the
Airflow DAG
5.
Create
profiles.yml
1.
Run dbt
deps
2.
Select
nodes
Parse the
dbt project
3. 4.
Improvement in Task Run
a) Cache package-lock.yml by @pankajastro
○ #1086
○ Similar to profiles.yml and partial_parse.msgpack
Cosmos 1.6 Performance Improvements
2. 3. 4.
1.
Select
nodes
User def
callback
Parse the
dbt project
3. 4. 5.
Run dbt
deps
2.
Create
profiles.yml
1.
Create py
virtualenv
2.
Run dbt
command
Run dbt
deps
3. 4.
1.2.5 1.4 1.5 1.6
DAG Parsing time 000008 000007 000002 000002
Task Run time 000009 000006 000005 000004
Task Queue time 000009 000005 000001 000001
DAG Run time 000129 000118 000043 000042
Cosmos 1.6 Performance Improvements
75%
56%
89%
52%
Improvement in DAG Parsing
b) Cache LoadMode.DBT_LS using remote store by @pankajkoti
○ #1147
○ Alternative to Airflow variable caching Cosmos 1.5
Cosmos 1.6 Performance Improvements
Build the
Airflow DAG
Select
nodes
4. 5.
Create
profiles.yml
1.
Run dbt
deps
2.
Parse the
dbt project
3.
“Users would observe a slight delay for the tasks being in queued state
(approx 12 seconds queued duration vs the 01 seconds previously in
the Variable approach) due to remote storage calls.ˮ
Improvement in Task Run
c) Persist LoadMode.VIRTUALENV directory by
LennartKloppenburg and @tatiana
○ #611 #1079
○ Avoid creating a new virtualenv for each task run
○ Persist the virtualenv per Airflow worker node
Cosmos 1.6 Performance Improvements
User def
callback
5.
Create
profiles.yml
1.
Create py
virtualenv
2.
Run dbt
command
Run dbt
deps
3. 4.
“The example_virtualenv DAG saw the DAG's runtime go down from 2m31s to just 32s. I'd this
improvement to be even more noticeable with more complex graphs and more python requirements.ˮ
Overview
Overview Performance Improvements
1.2.5 1.3 DBT LS FILE 1.4 1.5 1.6
DAG Parsing time 000008 000002 000007 000002 000002
Task Run time 000009 000008 000006 000005 000004
Task Queue time 000009 000004 000005 000001 000001
DAG Run time 000129 000055 000118 000043 000042
Overview Performance Improvements
Summary
7 DAG Parsing
improvements
2
Speed up
improvements on
running dbt
4 Improvements on Task
run performance
DAG Run
Reduced to
25%
of the original
time
Community
8
developers
contributed to
making
Cosmos faster
Released in the last 8 months
Future
Performance Improvements Future
We could use help to bring this ideas to life!
● Introduce Airflow native (deferrable) Operators execution mode #1134
○ dbt-core is used to pre-compile SQL
○ Python native operators (e.g. DatabricksSubmitRunOperator
) execute
actual transformations
● Support representing dbt models as single Airflow task #881
● Support using dbtRunner when parsing dbt project with
LoadMode.DBT_LS #865
● Support caching on remote store #1177, #1178, #1179
● Leverage Airflow magic loop #918
Takeaways
Takeaways
Cosmos 1.6 is faster than previous versions
● Some of Astronomer customers moved from using dbt Cloud to use
Cosmos with confidence, while leveraging dynamic DAG building
with LoadMode.DBT_LS
Understanding how Airflow works is critical
● DAG parsing happens for every task run
Having a systematic approach to measuring the progress
● Data-driven decisions
This work is an ongoing journey of collaboration
● Multiple people contributed to this work
● More work is planned
Credits
Cosmos Open Source Community
99 - and growing - contributors in Github 6 September 2024
Julian LaNeve
CTO
@ Astronomer
Tati Al-Chueyr
Lead/ Staff Software Engineer
@ Astronomer
Justin Bandoro
Data Engineer
@ Kevala Analytics
Daniel Reeves
Data Architect
@ Battery Ventures
Pankaj Singh
Senior Software Engineer
@ Astronomer
Pankaj Koti
Senior Software Engineer
@ Astronomer
Cosmos Active maintainers
Last but not least
dbt in Airflow survey
https://bit.ly/dbt-airflow-survey-2024
dbt in Airflow lunch table
Come and join us for lunch
13:30 - 14:30
We’ll have a themed table for
those interested in discussing
how they are running dbt in
Airflow
Thank you!
Any questions?
#airflow-dbt

More Related Content

PDF
Best Practices for Effectively Running dbt in Airflow
PDF
Upcoming features in Airflow 2
PDF
PyData London - Scaling AI workloads with Ray & Airflow.pdf
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PDF
Airflow techtonic template
PDF
Powering machine learning workflows with Apache Airflow and Python
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Best Practices for Effectively Running dbt in Airflow
Upcoming features in Airflow 2
PyData London - Scaling AI workloads with Ray & Airflow.pdf
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Running Airflow Workflows as ETL Processes on Hadoop
Airflow techtonic template
Powering machine learning workflows with Apache Airflow and Python
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...

Similar to Integrating dbt with Airflow - Overcoming Performance Hurdles (20)

PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
PDF
Introduction to Apache Airflow
PDF
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
PDF
Managing transactions on Ethereum with Apache Airflow
PDF
Why Airflow? & What's new in Airflow 2.3?
PPTX
Apache Cassandra Lunch #58: Tools for Cassandra Titans
PDF
Apache Airflow
PDF
Apache Airflow
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
Airflow Intro-1.pdf
PDF
Airflow presentation
PDF
Building Better Data Pipelines using Apache Airflow
PDF
Apache Airflow® Best Practices: DAG Writing
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
PPSX
Introduce Airflow.ppsx
PDF
Building Automated Data Pipelines with Airflow.pdf
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Introduction to Apache Airflow
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
Orchestrating workflows Apache Airflow on GCP & AWS
Managing transactions on Ethereum with Apache Airflow
Why Airflow? & What's new in Airflow 2.3?
Apache Cassandra Lunch #58: Tools for Cassandra Titans
Apache Airflow
Apache Airflow
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Airflow Intro-1.pdf
Airflow presentation
Building Better Data Pipelines using Apache Airflow
Apache Airflow® Best Practices: DAG Writing
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Introduce Airflow.ppsx
Building Automated Data Pipelines with Airflow.pdf
Ad

More from Tatiana Al-Chueyr (20)

PDF
dbt no Airflow: Como melhorar o seu deploy (de forma correta)
PDF
Integrating ChatGPT with Apache Airflow
PDF
Contributing to Apache Airflow
PDF
From an idea to production: building a recommender for BBC Sounds
PDF
Precomputing recommendations with Apache Beam
PDF
Scaling machine learning to millions of users with Apache Beam
PDF
Clearing Airflow Obstructions
PPTX
Scaling machine learning workflows with Apache Beam
PDF
Responsible machine learning at the BBC
PPTX
Responsible Machine Learning at the BBC
PDF
PyConUK 2018 - Journey from HTTP to gRPC
PDF
Sprint cPython at Globo.com
PDF
PythonBrasil[8] - CPython for dummies
PDF
QCon SP - recommended for you
PDF
Crafting APIs
PDF
PyConUK 2016 - Writing English Right
PDF
InVesalius: 3D medical imaging software
PDF
Automatic English text correction
PDF
Python packaging and dependency resolution
PDF
Rio info 2013 - Linked Data at Globo.com
dbt no Airflow: Como melhorar o seu deploy (de forma correta)
Integrating ChatGPT with Apache Airflow
Contributing to Apache Airflow
From an idea to production: building a recommender for BBC Sounds
Precomputing recommendations with Apache Beam
Scaling machine learning to millions of users with Apache Beam
Clearing Airflow Obstructions
Scaling machine learning workflows with Apache Beam
Responsible machine learning at the BBC
Responsible Machine Learning at the BBC
PyConUK 2018 - Journey from HTTP to gRPC
Sprint cPython at Globo.com
PythonBrasil[8] - CPython for dummies
QCon SP - recommended for you
Crafting APIs
PyConUK 2016 - Writing English Right
InVesalius: 3D medical imaging software
Automatic English text correction
Python packaging and dependency resolution
Rio info 2013 - Linked Data at Globo.com
Ad

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
KodekX | Application Modernization Development
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
MIND Revenue Release Quarter 2 2025 Press Release
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation_ Review paper, used for researhc scholars
KodekX | Application Modernization Development
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
sap open course for s4hana steps from ECC to s4
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MIND Revenue Release Quarter 2 2025 Press Release

Integrating dbt with Airflow - Overcoming Performance Hurdles

  • 1. Airflow Summit 2024 Integrating dbt-core with Airflow Overcoming Performance Hurdles Pankaj Koti Senior Software Engineer Tati Al-Chueyr Staff Software Engineer San Francisco, 11 September 2024
  • 2. dbt in Airflow 2023 Airflow Survey 32.5% of the 2023 Apache Airflow Survey respondents use dbt
  • 3. dbt in Airflow Airflow Summit 2023 dbt-core and Airflow was one of the most popular topics in 2023 ● “Airflow at Monzo: Evolving our data platform as the bank scalesˮ by Jonathan Rainer & Ed Sparkes ● “Using Dynamic Task Mapping to Orchestrate dbtˮ by Pádraic Slattery ● “Building an Airflow Pipeline with dbt and Snowflakeˮ by Rishi Kar & George Yates ● “A Single Pane of Glass on Airflow using Astro Python SDK, Snowflake, dbt, and Cosmosˮ by Luan Moreno Medeiros Maciel ● “Manifest destiny: Orchestrating dbt using Airflowˮ by Jonathan Talmi https://airflowsummit.org/sessions/2023/
  • 4. dbt in Airflow Airflow Summit 2024 dbt-core and Airflow remains a popular topic in this yearʼs summit ● “Building on Cosmos: Making dbt on Airflow Easyˮ by Lewis Macdonald & Ethan Stone 1100 on Tuesday 10/09 ● “dbt-Core & Airflow 101 Building Data Pipelines Demystifiedˮ by Luan Moreno Medeiros Maciel 1400 on Tuesday 10/09 ● “Integrating dbt with Airflow: Overcoming Performance Hurdlesˮ by Tatiana Al-Chueyr & Pankaj Koti https://airflowsummit.org/sessions/2024/
  • 5. dbt in Airflow Airflow Summit 2024 dbt-core and Airflow remains a popular topic in this yearʼs summit ● “Building on Cosmos: Making dbt on Airflow Easyˮ by Lewis Macdonald & Ethan Stone 1100 on Tuesday 10/09 ● “dbt-Core & Airflow 101 Building Data Pipelines Demystifiedˮ by Luan Moreno Medeiros Maciel 1400 on Tuesday 10/09 ● “Integrating dbt with Airflow: Overcoming Performance Hurdlesˮ by Tatiana Al-Chueyr & Pankaj Koti https://airflowsummit.org/sessions/2024/
  • 6. dbt in Airflow Airflow Summit 2024 dbt-core and Airflow remains a popular topic in this yearʼs summit ● “Building on Cosmos: Making dbt on Airflow Easyˮ by Lewis Macdonald & Ethan Stone 1100 on Tuesday 10/09 ● “dbt-Core & Airflow 101 Building Data Pipelines Demystifiedˮ by Luan Moreno Medeiros Maciel 1400 on Tuesday 10/09 ● “Integrating dbt with Airflow: Overcoming Performance Hurdlesˮ by Tatiana Al-Chueyr & Pankaj Koti https://airflowsummit.org/sessions/2024/ intro why what metrics solutions
  • 7. dbt in Airflow OSS Tools
  • 8. PyPI downloads for OSS popular tools used to run dbt in Airflow dbt in Airflow OSS Tools Adoption
  • 9. dbt in Airflow OSS Tools Non-Adopters 53.4% of the dbt in Airflow survey respondents donʼt use any OSS tools
  • 10. dbt in Airflow Performance is a challenge Performance was the second most popular challenge raised by 33.3% of the dbt in Airflow survey respondents. The most popular challenge was integrating dbt and Airflow from separate repositories 35.9%
  • 11. dbt in Airflow No solution fits all
  • 12. dbt in Airflow Cosmos approach $ pip install astronomer-cosmos
  • 13. dbt in Airflow Cosmos approach import os from datetime import datetime from pathlib import Path from cosmos import DbtDag, ProjectConfig, ProfileConfig from cosmos.profiles import PostgresUserPasswordProfileMapping DEFAULT_DBT_ROOT_PATH = Path(__file__).parent / "dbt" DBT_ROOT_PATH = Path(os.getenv("DBT_ROOT_PATH", DEFAULT_DBT_ROOT_PATH)) profile_config = ProfileConfig( profile_name="jaffle_shop", target_name="dev", profile_mapping=PostgresUserPasswordProfileMapping( conn_id="airflow_db", profile_args={"schema": "public"}, ), ) basic_cosmos_dag = DbtDag( project_config=ProjectConfig( DBT_ROOT_PATH / "jaffle_shop", ), profile_config=profile_config, schedule_interval="@daily", start_date=datetime(2023, 1, 1), catchup=False, dag_id="basic_cosmos_dag", )
  • 14. Cosmos Key Features Translate a dbt-core workflow into an Airflow workflow ● Easily render a dbt-core project as an Airflow DAG or Task Group ● Automatically map Airflow connections into dbt profile files ● Dynamically create Airflow datasets for data-aware scheduling ● Only retry necessary dbt transformations ● Generate dbt docs and host them through the Airflow UI ● Growing active open-source community https://github.com/astronomer/astronomer-cosmos Apache 2.0 License
  • 15. Cosmos Flexibility The user is in control How to parse the dbt project ⏺ dbt ls command ⏺ dbt manifest ⏺ dbt ls output ⏺ custom Run dbt your way Airflow worker ⏺ Local ⏺ Virtualenv ⏺ Docker Remotely ⏺ Kubernetes ⏺ AWS EKS ⏺ Azure ACI How the DAG is rendered ⏺ dbt selectors ⏺ test behaviour ⏺ customise the conversion ⏺ args Where to declare DB credentials ⏺ user-defined profiles.yml ⏺ dynamically create profile from Airflow connection
  • 16. Cosmos Adoption ● 44k downloads in a month September 2023 ● 244 stars in Github September 2023 ● ● ● ● https://pypistats.org/packages/astronomer-cosmos April-Sept 2023
  • 17. Cosmos Adoption ● > 1.1 million downloads in a month Aug-Sept 2024 ● 576 stars in Github September 2024 ● > 60 Astronomer customers March-Sept 2024 https://pypistats.org/packages/astronomer-cosmos
  • 18. Cosmos Trade-off Each tool has their pros and cons ● Dynamic DAG rendering increases the DAG Parsing time ○ Larger CPU and/or memory consumption ○ Higher DAG processor time ○ Longer task queueing time ● To run one dbt model per task is slower than to run multiple dbt models per task ● To run a small dbt-core pipeline in my terminal is faster than to run on a distributed orchestration platform
  • 19. Apache Airflow When DAGs are parsed DAG Processor (Scheduler) Scheduler Executor (Scheduler) Worker Metadata Database DAGs folder Fetch DAGs Parse DAGs Serialise DAG Create DAG Run Identify schedulable tasks Queue Task Runs Queue DAG Run Delegate Task Run Queue Task Run Parse DAG Run Task Fetch DAG Identify schedulable DAGs
  • 20. Apache Airflow When DAGs are parsed DAG Processor (Scheduler) Scheduler Executor (Scheduler) Worker Metadata Database DAGs folder Fetch DAGs Parse DAGs Serialise DAG Create DAG Run Identify schedulable tasks Queue Task Runs Queue DAG Run Delegate Task Run Queue Task Run Parse DAG Run Task Fetch DAG Identify schedulable DAGs (per task run) (per DAG reparse)
  • 21. Cosmos DAG Parsing Steps 2. 3. Create profiles.yml 1. (optional) Run dbt deps Parse the dbt project Select dbt nodes Build the Airflow TaskGroup or DAG 2. 3. 4. 5. (optional) (optional) 1. dbt ls --output json 2. manifest.json 3. dbt_ls_output.json 4. Cosmos parser a. Pre-computed b. Cosmos selector c. no selection
  • 22. Cosmos Task Run Steps (after DAG re-parsing) 2. 3. Parse the DAG -1. (pre-exec) Run dbt deps Parse the dbt project 2. 3. (optional) 2. Create profiles.yml (optional) Create Python virtualenv Run dbt deps 2. 3. (optional) (dependent upon execution mode & invocation method) 1. dbtRunner (python) 2. Subprocess (dbt cmd) 3. K8s.. etc Run dbt command 4. (optional) Specific to ExecutionMode. VIRTUALENV Custom user callback 5. (optional) 1.
  • 23. Cosmos Performance Disclaimer Your choices on how to use Cosmos directly affect how Cosmos-powered DAGs will perform in your Airflow deployment.
  • 24. Cosmos Performance Improvements Throughout the past months, several people have actively worked on to improve the performance using various strategies. This talk will discuss some of these.
  • 25. Cosmos 1.4 Dec 2022 Astronomer Hack Week Sept 2024 Airflow Summit ‘24 0.1 - 12.2022 0.2 - 01.2023 0.3 - 01.2023 0.4 - 02.2023 0.5 - 03.2023 0.6 - 04.2023 0.7 - 05.2023 1.0 - 07.2023 1.1 - 09.2023 1.2 - 10.2023 1.3 - 01.2024 1.4 - 05.2024 1.5 - 06.2024 1.6 - 08.2024 Sept 2023 Airflow Summit ‘23 Cosmos Timeline
  • 26. 0.1 - 12.2022 0.2 - 01.2023 0.3 - 01.2023 0.4 - 02.2023 0.5 - 03.2023 0.6 - 04.2023 0.7 - 05.2023 1.0 - 07.2023 1.1 - 09.2023 1.2 - 10.2023 1.3 - 01.2024 1.4 - 05.2024 1.5 - 06.2024 1.6 - 08.2024 Cosmos Timeline Some (non-performance) features dbt global flags, render DAG with LoadMode.DBT_LS without connection model versioning, Athena, custom node rendering, detach render/execution YAML selector support, Vertica, Snowflake encrypted key, DbtDocsGCSOperator dbt Docs in Airflow UI, Azure Container Instance, DbtBuild operators AWS EKS, Clickhouse LoadMode.DBT_MANIFEST from remote store, render Source Nodes, Teradata
  • 28. Cosmos 1.1 DAG Timeout issues https://github.com/astronomer/astronomer-cosmos/issues/520 Jan 24
  • 29. Cosmos 1.1 DAG Timeout support https://github.com/astronomer/astronomer-cosmos/issues/520 Jan 24
  • 30. Cosmos 1.2 Slowness report https://github.com/astronomer/astronomer-cosmos/issues/840 Feb 24
  • 31. Cosmos 1.3 Performance degradation https://github.com/astronomer/astronomer-cosmos/issues/932 Apr 24
  • 32. Cosmos 1.4 Large task queueing time https://github.com/astronomer/astronomer-cosmos/issues/990 May 24
  • 33. Cosmos 1.4 Large task queueing time https://github.com/astronomer/astronomer-cosmos/issues/990 May 24
  • 35. Performance Metrics DAG Parsing time ● (optional) Create profile ● (optional) Run dbt deps ● Parse the given dbt project ● (optional) Identify the selected dbt nodes ● Build the Airflow DAG or TaskGroup Task Run time ● (optional) Create profile ● (optional) Run dbt deps ● (optional) Create virtualenv ● Setup/Run dbt command ● (optional) Callback
  • 36. Performance Metrics DAG Parsing time ● (optional) Create profile ● (optional) Run dbt deps ● Parse the given dbt project ● (optional) Identify the selected dbt nodes ● Build the Airflow DAG or TaskGroup Task Run time ● (optional) Create profile ● (optional) Run dbt deps ● (optional) Create virtualenv ● Setup/Run dbt command ● (optional) Callback Task Queue time ● DAG Parsing time DAG Run time ● A combination of the previous metrics
  • 37. Airflow Deployment Astro Runtime 11.10.0 Airflow 2.9.3 Execution ● Executor Celery ● Worker type A5 1 vCPU and 2GiB RAM ● Concurrency 1 ● Storage 10 GiB ● Min # Workers 4 ● Max # Workers 4 Scheduler ● High Availability on ● Medium ○ Scheduler 1 vCPU and 2 GiB ○ DAG Processor 1 vCPU and 2 GiB
  • 39. Baseline Cosmos 1.2.5 DAG Parsing time 000008 Task Run time 000009 Task Queue time 000009 DAG Run time 000129 (LoadMode.DBT_LS & ProfileMapping)
  • 42. Performance Improvements ● 1.2 ○ Baseline ● 1.3 ○ Introduction of LoadMode.DBT_LS_FILE ● 1.4 ○ Script to evaluate performance ○ Introduce InvocationMode.DBT_RUNNER ○ Use & cache dbt partial parsing ○ Only run dbt deps when there is packages.yml ● 1.5 ○ Cache LoadMode.DBT_LS using Airflow variables ○ Cache ProfileMapping
  • 43. Performance Improvements ● 1.6 ○ Cache package-lock.yml ○ Persist LoadMode.VIRTUALENV directory ○ Cache LoadMode.DBT_LS using remote store
  • 44. 1.3
  • 45. Improvement in DAG Parsing ● Introduction of LoadMode.DBT_LS_FILE by @woogakoki ○ #733 ○ Similar to LoadMode.DBT_MANIFEST ○ Users have to pre-compile the project (dbt ls --output json) ○ Cosmos understands this output file Cosmos 1.3 Performance Improvements Select nodes Build the Airflow DAG Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. “This should increase performance compared to using dbt_ls.ˮ
  • 46. Cosmos 1.3 Performance Improvements 1.2.5 1.3 DBT LS FILE DAG Parsing time 000008 000002 Task Run time 000009 000008 Task Queue time 000009 000004 DAG Run time 000129 000055
  • 47. 1.4
  • 48. Improvement in DAG Parsing a) Only run dbt deps when there is packages.yml by AlgirdasDubickas @tatiana ○ #1030 Cosmos 1.4 Performance Improvements 2. 3. 4. 1. Select nodes Build the Airflow DAG Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Run dbt deps 2. Select nodes Parse the dbt project 3. 4. Run dbt deps 2.
  • 49. Improvement in DAG Parsing b) Use & cache dbt partial parsing by @dwreeves @tatiana ○ #800, #904 ○ Improvement in how dbt commands run (LoadMode.DBT_LS) ○ Leverages dbt partial_parse.msgpack Cosmos 1.4 Performance Improvements 2. 3. 4. 1. Select nodes Build the Airflow DAG Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Run dbt deps 2. Select nodes Parse the dbt project 3. 4. “Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%ˮ
  • 50. Improvement in Task Run a) Only run dbt deps when there is packages.yml by AlgirdasDubickas @tatiana ○ #1030 User def callback Cosmos 1.4 Performance Improvements 2. 3. 4. 1. Select nodes Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Create py virtualenv 2. Run dbt deps 3. 4.
  • 51. Cosmos 1.4 Performance Improvements Improvement in Task Run b) Use & cache dbt partial parsing by @dwreeves @tatiana ○ #800, #904 ○ Improvement in how dbt commands run (LoadMode.DBT_LS) ○ Leverages dbt partial_parse.msgpack User def callback 2. 3. 4. 1. Select nodes Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Create py virtualenv 2. Run dbt command Run dbt deps 3. 4. “Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%ˮ
  • 52. User def callback Improvement in Task Run c) Introduction of InvocationMode.DBT_RUNNER by @jbandoro ○ #850 ○ Avoid create subprocesses to run dbt commands in task execution ○ Relies on dbt and Airflow being in the same Python virtualenv Cosmos 1.4 Performance Improvements 2. 3. 4. 1. Select nodes Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Create py virtualenv 2. Run dbt command Run dbt deps 3. 4. “Using InvocationMode.DBT_RUNNER is almost 3x faster, and can speed up dag runs if there are a lot of models that execute relatively quickly since there seems to be a 12s speed up per task.ˮ
  • 53. Cosmos 1.4 Performance Improvements 1.2.5 1.4 DAG Parsing time 000008 000007 Task Run time 000009 000006 Task Queue time 000009 000005 DAG Run time 000129 000118
  • 54. 1.5
  • 55. Improvement in DAG Parsing a) Cache ProfileMapping by @pankajastro ○ #1046 ○ Similar to partial_parse.msgpackcaching Cosmos 1.5 Performance Improvements 2. 3. 4. 1. 2. 3. 4. 1. Select nodes Build the Airflow DAG Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Run dbt deps 2. Select nodes Parse the dbt project 3. 4. Run dbt deps 2. “Enabling profile caching for 100 models DAG benchmark reduced the DAG run by 11%ˮ
  • 56. Improvement in DAG Parsing b) Cache LoadMode.DBT_LS using Airflow variables by @tatiana ○ #992 #1014 ○ Mechanism to cache output of dbt ls into Airflow variable ○ Automatic purge if dbt project changes Cosmos 1.5 Performance Improvements 2. 3. 4. 1. Build the Airflow DAG Select nodes Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Run dbt deps 2. Parse the dbt project 3. “The example DAGs tested reduced the task queueing time significantly (from 30s to 0.5s) and the total DAG run time for Jaffle Shop from 1 min 25s to 40s (by more than 50%.ˮ
  • 57. Cosmos 1.5 Performance Improvements
  • 58. Improvement in Task Run a) Cache ProfileMapping by @pankajastro ○ #1046 ○ Similar to partial_parse.msgpackcaching Cosmos 1.5 Performance Improvements 2. 3. 4. 1. Select nodes User def callback Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Create py virtualenv 2. Run dbt command Run dbt deps 3. 4. “Enabling profile caching for 100 models DAG benchmark reduced the DAG run by 11%ˮ
  • 59. 1.2.5 1.4 1.5 DAG Parsing time 000008 000007 000002 Task Run time 000009 000006 000005 Task Queue time 000009 000005 000001 DAG Run time 000129 000118 000043 Cosmos 1.5 Performance Improvements
  • 60. 1.6
  • 61. Improvement in DAG Parsing a) Cache package-lock.yml by @pankajastro ○ #1086 ○ Similar to profiles.yml and partial_parse.msgpack Cosmos 1.6 Performance Improvements Build the Airflow DAG 5. Create profiles.yml 1. Run dbt deps 2. Select nodes Parse the dbt project 3. 4.
  • 62. Improvement in Task Run a) Cache package-lock.yml by @pankajastro ○ #1086 ○ Similar to profiles.yml and partial_parse.msgpack Cosmos 1.6 Performance Improvements 2. 3. 4. 1. Select nodes User def callback Parse the dbt project 3. 4. 5. Run dbt deps 2. Create profiles.yml 1. Create py virtualenv 2. Run dbt command Run dbt deps 3. 4.
  • 63. 1.2.5 1.4 1.5 1.6 DAG Parsing time 000008 000007 000002 000002 Task Run time 000009 000006 000005 000004 Task Queue time 000009 000005 000001 000001 DAG Run time 000129 000118 000043 000042 Cosmos 1.6 Performance Improvements 75% 56% 89% 52%
  • 64. Improvement in DAG Parsing b) Cache LoadMode.DBT_LS using remote store by @pankajkoti ○ #1147 ○ Alternative to Airflow variable caching Cosmos 1.5 Cosmos 1.6 Performance Improvements Build the Airflow DAG Select nodes 4. 5. Create profiles.yml 1. Run dbt deps 2. Parse the dbt project 3. “Users would observe a slight delay for the tasks being in queued state (approx 12 seconds queued duration vs the 01 seconds previously in the Variable approach) due to remote storage calls.ˮ
  • 65. Improvement in Task Run c) Persist LoadMode.VIRTUALENV directory by LennartKloppenburg and @tatiana ○ #611 #1079 ○ Avoid creating a new virtualenv for each task run ○ Persist the virtualenv per Airflow worker node Cosmos 1.6 Performance Improvements User def callback 5. Create profiles.yml 1. Create py virtualenv 2. Run dbt command Run dbt deps 3. 4. “The example_virtualenv DAG saw the DAG's runtime go down from 2m31s to just 32s. I'd this improvement to be even more noticeable with more complex graphs and more python requirements.ˮ
  • 67. Overview Performance Improvements 1.2.5 1.3 DBT LS FILE 1.4 1.5 1.6 DAG Parsing time 000008 000002 000007 000002 000002 Task Run time 000009 000008 000006 000005 000004 Task Queue time 000009 000004 000005 000001 000001 DAG Run time 000129 000055 000118 000043 000042
  • 68. Overview Performance Improvements Summary 7 DAG Parsing improvements 2 Speed up improvements on running dbt 4 Improvements on Task run performance DAG Run Reduced to 25% of the original time Community 8 developers contributed to making Cosmos faster Released in the last 8 months
  • 70. Performance Improvements Future We could use help to bring this ideas to life! ● Introduce Airflow native (deferrable) Operators execution mode #1134 ○ dbt-core is used to pre-compile SQL ○ Python native operators (e.g. DatabricksSubmitRunOperator ) execute actual transformations ● Support representing dbt models as single Airflow task #881 ● Support using dbtRunner when parsing dbt project with LoadMode.DBT_LS #865 ● Support caching on remote store #1177, #1178, #1179 ● Leverage Airflow magic loop #918
  • 72. Takeaways Cosmos 1.6 is faster than previous versions ● Some of Astronomer customers moved from using dbt Cloud to use Cosmos with confidence, while leveraging dynamic DAG building with LoadMode.DBT_LS Understanding how Airflow works is critical ● DAG parsing happens for every task run Having a systematic approach to measuring the progress ● Data-driven decisions This work is an ongoing journey of collaboration ● Multiple people contributed to this work ● More work is planned
  • 74. Cosmos Open Source Community 99 - and growing - contributors in Github 6 September 2024
  • 75. Julian LaNeve CTO @ Astronomer Tati Al-Chueyr Lead/ Staff Software Engineer @ Astronomer Justin Bandoro Data Engineer @ Kevala Analytics Daniel Reeves Data Architect @ Battery Ventures Pankaj Singh Senior Software Engineer @ Astronomer Pankaj Koti Senior Software Engineer @ Astronomer Cosmos Active maintainers
  • 76. Last but not least
  • 77. dbt in Airflow survey https://bit.ly/dbt-airflow-survey-2024
  • 78. dbt in Airflow lunch table Come and join us for lunch 13:30 - 14:30 We’ll have a themed table for those interested in discussing how they are running dbt in Airflow