SlideShare a Scribd company logo
From Zero to AirïŹ‚ow
bootstrapping a ML platform
1
2
About us
Bluevine
● Fintech startup up based in
Redwood City, CA and Tel Aviv,
Israel
● Provides working capital
(loans) to small & medium sized
businesses
● Over $2 BN funded to date
● Over 3.5$ BN delivered in
Payment Protection Program
Me
● Noam Elfanbaum (@noamelf), Data
Engineering team lead @
BlueVine
● Live in Tel-Aviv with my wife,
kid and dog.
● My colleague Ido Shlomo created
the original presentation for
OSDC 2019 conference.
3
Building a ML analytics platform into production using
Apache Airflow at Bluevine. This includes:
● Migrating our ML workload to Airflow
● Hacking at Airflow to provide a semi-streaming solution
● Monitoring business sensitive processes
Case study
Part 1:
Migrating to
AirïŹ‚ow
4
5
What was in place?
● Lots (and lots) of cron-jobs on
a single server!
● Every logic ran as an
independent cron
● Every logic / cron figured out
its own triggering mechanism
● Every logic / cron figured out
its own dependencies
● No communication between logics
6
Desired
● Ability to process one client
end-to-end
● Decision within a few minutes
● Map and centrally control
dependencies
● Easy and simple monitoring
● Easy to scale
● Efficient error recovery
Goals
Existing
● Scope defined by # of clients
in data batch
● Over 15 minutes
● Hidden and distributed
dependencies
● Hard and confusing monitoring
● Impractical to scale
● “All or nothing” error recovery
7
AirïŹ‚ow brief intro
● Core component is the scheduler
/ executor
● Uses dedicated metadata DB to
figure out current status of
tasks
● Uses workers to execute new
ones
● Web server allows live
interaction and monitoring
8
What is a DAG?
DAG: Directed Acyclic Graph
● Basically a map of tasks run in
a certain dependency structure
● Each DAG has a run frequency
(e.g. every 10 seconds)
● Both DAGs and tasks can run
concurrently
9
Infrastructure setup
● We run on AWS - and prefer
managed services
● Celery is the executor
● Flower proved very useful for
monitoring workers state
● No thrills setup!
10
Isolated environments
● Isolation between Airflow
environment and our scripts
● BashOperator is executing the
script under the correct
virtual environment
11
Phasing out cron jobs
● Spin up Airflow alongside
existing Data DBs, servers and
cron jobs.
● Translate every cron job into
DAG with one task that points
to same python script (Bash
Operator).
● For each cron (200 of them):
○ Turn off cron job
○ Turn on “Singleton” DAG
○ When all crons off → Kill old
servers
Part 2:
Hacking a
streaming
solution
12
13
User onboarding
● Airflow is built for batch
processing
● We needed to support streaming
user processing
● Airflow is not a good fit for
that!
● Nevertheless, due to time
constraints and familiarity, we
chose to start with it
14
THE Onboarding DAG (sort of)
15
Onboarding “streaming” Design
Logic executed
Related functionality
is executed, as the
user progress through
the application form
User signup
A “new user” event is
sent. As user goes
through the
application forms the
relevant events are
sent
Sensor poll on queue
Onboarding DAG poll
for the events using
the SensorOperator.
Once a “new user”
event is received,
the user ID is saved
in XCOM to share it
between the tasks
16
Onboarding design
Hitting a performance
wall
17
18
Airflow scheduler took up to
30 seconds to compute the
next task to run (i.e.
step)!
19
Hack #1 - standalone trigger
Problem
● Airflow scheduler is creating
all tasks objects on DAG start
● The onboarding DAG has ~40
tasks, and the scheduler works
hard to figure out each task
dependencies
● A new DAG run starts on
interval and a sensor is
polling for new user
● This creates a lot of “live”
pending DAGs
Solution
● Have a triggering DAG that only
contains a sensor and a
triggering task
● It triggers the large
on-boarding DAG
20
Hack #1 - standalone trigger
Solution
● Archive DB data to keep 1 week
of history
● Gotcha! Also make sure to keep
a DAG last run, not doing so
will make Airflow think it
didn’t run and rerun it.
21
Hack #2: Archive DB tables
Problem
● Big DB → slower queries →
slower scheduling & execution
● DB contains metadata for all
dag / task runs
● High dag frequency + many DAGs
+ many tasks == many rows
● Under our setup, within first
two months, the DB was over 15
GB in size
22
Hack #3 - Patch scheduler DAG’s state queries
Problem
● In order to determine if a task
met its dependencies, the
scheduler query the DB for each
task in the DAG
● The Onboarding Dag has 40 tasks
and can have 20 parallel runs.
● This means ~800 (!) DB queries
every pass just for this one
Dag.
Solution
● Patch Airflow to query the DAG
state by sending one query per
DAG instead of a query per DAG
task.
● PR made to Airflow team:
AIRFLOW-3607, to be released in
Airflow 2.0
● Results:
○ 90th percentile delay was
decreased by 30%
○ DB CPU usage decreased by 20%
○ Avg delay was decreased 18%
23
Hack #4 - Create a dedicated “fast” AirïŹ‚ow
Solution
● Spin up a 2nd Airflow just for
time-sensitive processes!
● Dedicated instance → less dags
/ tasks → faster scheduling
● Approx 60% reduction in average
time spent on transitions
between tasks.
Problem
● Scheduler has to continually
parse all DAGs
● Not all DAGs are equally
latency sensitive but all are
given the same scheduling
resources
24
Final results
● Time between dependent
tasks is consistently
under 3 seconds
● Overall runtime is under
3 minutes for 95% of the
cases
Part 3:
Monitoring
25
26
Plugin to match users with runs
● Locates the Airflow DAG run for
a given user ID
● Helps to track down issues
found with users
Track scheduler latencies
● Query Airflow DB from Grafana
● Query the delta between a time
that a task finishes and the
time the next one starts
27
Scheduler outage alerts
28
● Airflow most critical component
is the scheduler - nothing
happens without it
● The scheduler sends a heartbeat
to the DB
● Grafana polls on that table to
and sends us an alert if the
scheduler is down
Track ïŹ‚ow latencies
● Airflow UI is great! But, it
doesn’t allow to view
aggregated data
● Querying the DB allows to
extract great aggregated view
that can show the state of the
system in a glance
● Grafana is great!
29
Questions?
30

More Related Content

PPTX
Fyber - airflow best practices in production
PDF
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
PDF
A look at Flink 1.2
PDF
Upcoming features in Airflow 2
PDF
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
PDF
Ceilometer juno-midpoint
PDF
From business requirements to working pipelines with apache airflow
PDF
Stream Processing Live Traffic Data with Kafka Streams
Fyber - airflow best practices in production
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
A look at Flink 1.2
Upcoming features in Airflow 2
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ceilometer juno-midpoint
From business requirements to working pipelines with apache airflow
Stream Processing Live Traffic Data with Kafka Streams

What's hot (20)

PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
PDF
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
PPTX
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
PPTX
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
PPTX
Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs
PPTX
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
PPTX
Essential Ingredients of Realtime Stream Processing @ Scale
PPTX
Flink. Pure Streaming
PDF
Go With The Flow
PDF
Stockholm meetup Kafka_tutorials_window_final_result
PPTX
SECON'2014 - ЀОлОпп ĐąĐŸŃ€Ń‡ĐžĐœŃĐșĐžĐč - ĐąŃ€Đ°ĐœŃŃ„ĐŸŃ€ĐŒĐ°Ń†ĐžŃ баг-трДĐșДра ĐżĐŸĐŽ Đ»ŃŽĐ±ĐŸĐč ĐżŃ€ĐŸĐ”Đșт: ...
PDF
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
PDF
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
PDF
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
PDF
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Essential Ingredients of Realtime Stream Processing @ Scale
Flink. Pure Streaming
Go With The Flow
Stockholm meetup Kafka_tutorials_window_final_result
SECON'2014 - ЀОлОпп ĐąĐŸŃ€Ń‡ĐžĐœŃĐșĐžĐč - ĐąŃ€Đ°ĐœŃŃ„ĐŸŃ€ĐŒĐ°Ń†ĐžŃ баг-трДĐșДра ĐżĐŸĐŽ Đ»ŃŽĐ±ĐŸĐč ĐżŃ€ĐŸĐ”Đșт: ...
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Stephan Ewen - Experiences running Flink at Very Large Scale
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Ad

Similar to Bootstrapping a ML platform at Bluevine [Airflow Summit 2020] (20)

PDF
Airflow Intro-1.pdf
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
Task migration using CRIU
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
PPTX
Airflow presentation
PDF
Introduction to Apache Airflow
PPTX
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
PDF
Kristina Robinson [InfluxData] | Understand and Visualize Your Data with Infl...
PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
PDF
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
PDF
Apache Beam and Google Cloud Dataflow - IDG - final
PDF
Parallel Batch Performance Considerations
PDF
Netty training
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PPTX
Bootstrapping state in Apache Flink
PDF
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
PDF
Netty training
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
Airflow Intro-1.pdf
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Task migration using CRIU
Airflow Best Practises & Roadmap to Airflow 2.0
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Airflow presentation
Introduction to Apache Airflow
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Kristina Robinson [InfluxData] | Understand and Visualize Your Data with Infl...
Stream processing with Apache Flink (Timo Walther - Ververica)
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Apache Beam and Google Cloud Dataflow - IDG - final
Parallel Batch Performance Considerations
Netty training
How I learned to time travel, or, data pipelining and scheduling with Airflow
Bootstrapping state in Apache Flink
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
Netty training
Orchestrating workflows Apache Airflow on GCP & AWS
Ad

Recently uploaded (20)

PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
history of c programming in notes for students .pptx
PPTX
Introduction to Artificial Intelligence
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
System and Network Administraation Chapter 3
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Online Work Permit System for Fast Permit Processing
PPT
Introduction Database Management System for Course Database
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Nekopoi APK 2025 free lastest update
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
history of c programming in notes for students .pptx
Introduction to Artificial Intelligence
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
CHAPTER 2 - PM Management and IT Context
PTS Company Brochure 2025 (1).pdf.......
System and Network Administraation Chapter 3
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Operating system designcfffgfgggggggvggggggggg
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How Creative Agencies Leverage Project Management Software.pdf
Understanding Forklifts - TECH EHS Solution
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Online Work Permit System for Fast Permit Processing
Introduction Database Management System for Course Database
ManageIQ - Sprint 268 Review - Slide Deck
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Nekopoi APK 2025 free lastest update

Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]

  • 1. From Zero to AirïŹ‚ow bootstrapping a ML platform 1
  • 2. 2 About us Bluevine ● Fintech startup up based in Redwood City, CA and Tel Aviv, Israel ● Provides working capital (loans) to small & medium sized businesses ● Over $2 BN funded to date ● Over 3.5$ BN delivered in Payment Protection Program Me ● Noam Elfanbaum (@noamelf), Data Engineering team lead @ BlueVine ● Live in Tel-Aviv with my wife, kid and dog. ● My colleague Ido Shlomo created the original presentation for OSDC 2019 conference.
  • 3. 3 Building a ML analytics platform into production using Apache Airflow at Bluevine. This includes: ● Migrating our ML workload to Airflow ● Hacking at Airflow to provide a semi-streaming solution ● Monitoring business sensitive processes Case study
  • 5. 5 What was in place? ● Lots (and lots) of cron-jobs on a single server! ● Every logic ran as an independent cron ● Every logic / cron figured out its own triggering mechanism ● Every logic / cron figured out its own dependencies ● No communication between logics
  • 6. 6 Desired ● Ability to process one client end-to-end ● Decision within a few minutes ● Map and centrally control dependencies ● Easy and simple monitoring ● Easy to scale ● Efficient error recovery Goals Existing ● Scope defined by # of clients in data batch ● Over 15 minutes ● Hidden and distributed dependencies ● Hard and confusing monitoring ● Impractical to scale ● “All or nothing” error recovery
  • 7. 7 AirïŹ‚ow brief intro ● Core component is the scheduler / executor ● Uses dedicated metadata DB to figure out current status of tasks ● Uses workers to execute new ones ● Web server allows live interaction and monitoring
  • 8. 8 What is a DAG? DAG: Directed Acyclic Graph ● Basically a map of tasks run in a certain dependency structure ● Each DAG has a run frequency (e.g. every 10 seconds) ● Both DAGs and tasks can run concurrently
  • 9. 9 Infrastructure setup ● We run on AWS - and prefer managed services ● Celery is the executor ● Flower proved very useful for monitoring workers state ● No thrills setup!
  • 10. 10 Isolated environments ● Isolation between Airflow environment and our scripts ● BashOperator is executing the script under the correct virtual environment
  • 11. 11 Phasing out cron jobs ● Spin up Airflow alongside existing Data DBs, servers and cron jobs. ● Translate every cron job into DAG with one task that points to same python script (Bash Operator). ● For each cron (200 of them): ○ Turn off cron job ○ Turn on “Singleton” DAG ○ When all crons off → Kill old servers
  • 13. 13 User onboarding ● Airflow is built for batch processing ● We needed to support streaming user processing ● Airflow is not a good fit for that! ● Nevertheless, due to time constraints and familiarity, we chose to start with it
  • 15. 15 Onboarding “streaming” Design Logic executed Related functionality is executed, as the user progress through the application form User signup A “new user” event is sent. As user goes through the application forms the relevant events are sent Sensor poll on queue Onboarding DAG poll for the events using the SensorOperator. Once a “new user” event is received, the user ID is saved in XCOM to share it between the tasks
  • 18. 18 Airflow scheduler took up to 30 seconds to compute the next task to run (i.e. step)!
  • 19. 19 Hack #1 - standalone trigger Problem ● Airflow scheduler is creating all tasks objects on DAG start ● The onboarding DAG has ~40 tasks, and the scheduler works hard to figure out each task dependencies ● A new DAG run starts on interval and a sensor is polling for new user ● This creates a lot of “live” pending DAGs Solution ● Have a triggering DAG that only contains a sensor and a triggering task ● It triggers the large on-boarding DAG
  • 20. 20 Hack #1 - standalone trigger
  • 21. Solution ● Archive DB data to keep 1 week of history ● Gotcha! Also make sure to keep a DAG last run, not doing so will make Airflow think it didn’t run and rerun it. 21 Hack #2: Archive DB tables Problem ● Big DB → slower queries → slower scheduling & execution ● DB contains metadata for all dag / task runs ● High dag frequency + many DAGs + many tasks == many rows ● Under our setup, within first two months, the DB was over 15 GB in size
  • 22. 22 Hack #3 - Patch scheduler DAG’s state queries Problem ● In order to determine if a task met its dependencies, the scheduler query the DB for each task in the DAG ● The Onboarding Dag has 40 tasks and can have 20 parallel runs. ● This means ~800 (!) DB queries every pass just for this one Dag. Solution ● Patch Airflow to query the DAG state by sending one query per DAG instead of a query per DAG task. ● PR made to Airflow team: AIRFLOW-3607, to be released in Airflow 2.0 ● Results: ○ 90th percentile delay was decreased by 30% ○ DB CPU usage decreased by 20% ○ Avg delay was decreased 18%
  • 23. 23 Hack #4 - Create a dedicated “fast” AirïŹ‚ow Solution ● Spin up a 2nd Airflow just for time-sensitive processes! ● Dedicated instance → less dags / tasks → faster scheduling ● Approx 60% reduction in average time spent on transitions between tasks. Problem ● Scheduler has to continually parse all DAGs ● Not all DAGs are equally latency sensitive but all are given the same scheduling resources
  • 24. 24 Final results ● Time between dependent tasks is consistently under 3 seconds ● Overall runtime is under 3 minutes for 95% of the cases
  • 26. 26 Plugin to match users with runs ● Locates the Airflow DAG run for a given user ID ● Helps to track down issues found with users
  • 27. Track scheduler latencies ● Query Airflow DB from Grafana ● Query the delta between a time that a task finishes and the time the next one starts 27
  • 28. Scheduler outage alerts 28 ● Airflow most critical component is the scheduler - nothing happens without it ● The scheduler sends a heartbeat to the DB ● Grafana polls on that table to and sends us an alert if the scheduler is down
  • 29. Track ïŹ‚ow latencies ● Airflow UI is great! But, it doesn’t allow to view aggregated data ● Querying the DB allows to extract great aggregated view that can show the state of the system in a glance ● Grafana is great! 29