SlideShare a Scribd company logo
Airflow
Insane power in a tiny box
A BRIEF HISTORY OF DATA
PIPELINES
F o r r e a l , t h i s i s h o w w e u s e d t o d o i t …
The dev’s answer to
EVERYTHING
C r o n / c r o n t a b
This works great for some use cases,
but lacks in many other ways.
Works great, provided
the computer is on. Will
manage at the time you
set every time it can.
No recovery, logs self
managed, not sure when
it runs. Can only execute
on one computer.
It keeps tasks alive.
S u p e r v i s o r / S u p e r v i s o r d
Fantastic utility, works as expected and
optionally embedded UI and CLI util.
Keeps everything up
and let’s you see what’s
going on. Even rotates
logs and allows groups.
Still executes on the one
computer. Isn’t more
than it advertises to be.
Limited scope.
Some one said… we can do better.
Airflow is a “workflow
management system”
created by airbnb.com
“Today, we are proud to announce that
we are open sourcing and sharing
Airflow, our workflow management
platform.”
June 2, 2016
https://medium.com/airbnb-engineering/
airflow-a-workflow-management-
platform-46318b977fd8
And it’s all written in Python!
What IS Airflow?
B U T R E A L LY …
Dependency Control
Task Management
Task Recovery
Charting
Logging
Alerting
History
Folder Watching
Trending
Dynamic Tasks
ANYTHING your pipeline may need…
Airflow is NOT…
…perfect
https://airflow.apache.org/
So contribute, and help
it get better!
Webserver / UI
The Airflow Architecture
Scheduler Worker
WITH VERY LITTLE
WORK…
A i r f l o w c a n b e r u n l o c a l l y
O r b e r u n i n m u c h m o r e c o m p l e x c o n f i g u r a t i o n s .
Master / Slave / UI
Configuration
W i t h l o g s b e i n g f e d t o G C S .
How we provision
Airflow.
We place it all on a
single Google Compute
Engine VM.
No bull!
E x c u s e m e ?
CPU: n1-standard-2
2 vCPUs, 7.5 GB memory
HD: 30 GB
Standard Persistant Disk (Non-SSD)
LET’S 

TALK ABOUT

AIRFLOW

DAGs
A few key Airflow
concepts.
DAGs
Directed Acyclic Graph – is a collection of all the tasks you
want to run, organized in a way that reflects their
relationships and dependencies. Written in python.
01
Describes how a single task performs in a workflow (dag). 

There are many types of operators:
BashOperator, PythonOperator, EmailOperator, HTTPOperator,
MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator,
OracleOperator, JdbcOperator, Sensor, DockerOperator
02
Operators
Tasks
Once an operator is instantiated it’s referred to as a task.
03
dag = DAG(
dag_id='example_python_operator',
schedule_interval=None
)
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG
execution'''
time.sleep(random_base)
def print_context(ds, **kwargs):
pprint(kwargs)
print(ds)
return 'Whatever you return gets printed in the logs’
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=print_context,
dag=dag)
for i in range(10):
'''
Generating 10 sleeping task, sleeping from 0 to 9
seconds
respectively
'''
task = PythonOperator(
task_id='sleep_for_'+str(i),
python_callable=my_sleeping_function,
op_kwargs={'random_base': float(i)/10},
dag=dag
)
task.set_upstream(run_this)
Airflow - Insane power in a Tiny Box
Airflow - Insane power in a Tiny Box
Airflow - Insane power in a Tiny Box
Airflow - Insane power in a Tiny Box
Stop doing things the way you
have, think dynamically.
You can automate your task by
reading source code or listing
files in a directory.
You don’t have to worry about
execution order, you only need to
present airflow with relationships.
Think in terms of how you can
remove human error. Let
airflow work for you.
Airflow really shines with
dynamic tasks.
Dictionary (array) of Dependencies
What if you made a script that parsed all your
jobs, and detected all dependencies
automatically.



Now what if you took that dictionary, and fed it
into airflow?



How would that simplify your pipeline?
dependencies = {
'topic_billing_frequency': [
‘dim_billing_frequency’,
‘dim_account'
],
'topic_payment_method':
‘dim_credit_card_type’,
‘dim_payment_accounts’
]
}
Let’s take a look…
L e t m e s h o w y o u …
Airflow really shines with
dynamic tasks.
T h e c o d e t o r u n i t a l l
Top Level Dependencies
Top level dependencies are created. Each of these tasks,
depends on creating and deleting the cluster.
01
Now each child dependency is iterated over, and a task is
created for each. Each is given the delete task as a
“downstream” so delete cluster will never run until the tasks are
complete.
02
Child Dependencies
Connect children to parents
Now set the parent task as an upstream for each child
task.
03
all_tasks = {}
# Create all parent tasks, top level
for key, value in dependencies.all_dependencies.iteritems():
if key not in all_tasks:
all_tasks[key] = PythonOperator(
task_id=key,
python_callable=process,
op_kwargs={},
provide_context=True,
dag=dag,
retries=30,
retry_delay=timedelta(minutes=10),
on_retry_callback=airflow_retry_function,
on_failure_callback=airflow_error_function,
on_success_callback=airflow_success_function,
)
all_tasks[key].set_upstream(task_create_cluster)
all_tasks[key].set_downstream(task_delete_cluster)
# Create all nested dependency tasks
for key, value in dependencies.all_dependencies.iteritems():
for item in value:
if item not in all_tasks:
if key in all_tasks:
continue
all_tasks[item] = PythonOperator(
task_id=item,
python_callable=process,
op_kwargs={},
provide_context=True,
dag=dag,
retries=30,
retry_delay=timedelta(minutes=10),
on_retry_callback=airflow_retry_function,
on_failure_callback=airflow_error_function,
on_success_callback=airflow_success_function,
)
all_tasks[item].set_downstream(task_delete_cluster)
all_tasks[item].set_downstream(all_tasks[key])
What does that code do?This is real code being used today.
Dovy Paukstys
Consultant at Caserta
#geek #bigdata #redux
How can I help?
http://dovy.io
http://twitter.com/simplerain
dovy.paukstys@caserta.com
http://reduxframework.com
https://github.com/dovy/
http://linkedin.com/in/dovyp

More Related Content

PPT
Memory Leaks In Internet Explorer
PDF
ReactJS
PDF
"Migrating the runbook - from legacy to DevOps" at IPExpo London 2015
PDF
Ansible slackbot
PPTX
Leap Motion Development (Rohan Puri)
PDF
Persistent mobile JavaScript
KEY
Loadrunner
PDF
Rethink Async With RXJS
Memory Leaks In Internet Explorer
ReactJS
"Migrating the runbook - from legacy to DevOps" at IPExpo London 2015
Ansible slackbot
Leap Motion Development (Rohan Puri)
Persistent mobile JavaScript
Loadrunner
Rethink Async With RXJS

What's hot (15)

PDF
Boom! Promises/A+ Was Born
PPTX
Misused figures of dev ops
PDF
node.js, javascript and the future
KEY
Lazy Loading Because I'm Lazy
PDF
Hudson and Drupal
PPTX
Introduction to Docker session (at Nairuby meetup)
PDF
Ember: Guts & Goals
PDF
JavaScript APIs - The Web is the Platform - MDN Hack Day - Buenos Aires
PDF
A Small Talk on Getting Big
PDF
JavaScript APIs - The Web is the Platform - MDN Hack Day, Santiago, Chile
DOCX
Camp Blog
PDF
JavaScript APIs - The Web is the Platform - MozCamp, Buenos Aires
PDF
JavaScript APIs - The Web is the Platform - MDN Hack Day, Montevideo
PPTX
Dmp hadoop getting_start
PDF
React & The Art of Managing Complexity
Boom! Promises/A+ Was Born
Misused figures of dev ops
node.js, javascript and the future
Lazy Loading Because I'm Lazy
Hudson and Drupal
Introduction to Docker session (at Nairuby meetup)
Ember: Guts & Goals
JavaScript APIs - The Web is the Platform - MDN Hack Day - Buenos Aires
A Small Talk on Getting Big
JavaScript APIs - The Web is the Platform - MDN Hack Day, Santiago, Chile
Camp Blog
JavaScript APIs - The Web is the Platform - MozCamp, Buenos Aires
JavaScript APIs - The Web is the Platform - MDN Hack Day, Montevideo
Dmp hadoop getting_start
React & The Art of Managing Complexity
Ad

Similar to Airflow - Insane power in a Tiny Box (20)

PDF
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
KEY
Work Queues
PDF
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
KEY
Gearman and CodeIgniter
PPTX
Velocity 2015: Building Self-Healing Systems
PPTX
Velocity 2015 building self healing systems (slide share version)
PPTX
Apache airflow
PDF
MapReduce: teoria e prática
PDF
What every C++ programmer should know about modern compilers (w/ comments, AC...
PDF
Puppet for Sys Admins
PDF
Cloudops fundamentals management, tdd, test driven design, continuous integra...
PPT
Understanding Framework Architecture using Eclipse
PPTX
Yeahhhh the final requirement!!!
PPTX
Clustered PHP - DC PHP 2009
PDF
Containers, Docker, and Microservices: the Terrific Trio
PDF
StatsD DevOps Boulder 7/20/15
PDF
OSDC 2017 - Florian Heigl - Experiences with rudder, is it really for everyone
PDF
OSDC 2017 | Experiences with Rudder, is it really for everyone? by Florian Heigl
PDF
Operationalizing Clojure Confidently
PDF
Multithreading 101
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
Work Queues
Pilot Tech Talk #10 — Practical automation by Kamil Cholewiński
Gearman and CodeIgniter
Velocity 2015: Building Self-Healing Systems
Velocity 2015 building self healing systems (slide share version)
Apache airflow
MapReduce: teoria e prática
What every C++ programmer should know about modern compilers (w/ comments, AC...
Puppet for Sys Admins
Cloudops fundamentals management, tdd, test driven design, continuous integra...
Understanding Framework Architecture using Eclipse
Yeahhhh the final requirement!!!
Clustered PHP - DC PHP 2009
Containers, Docker, and Microservices: the Terrific Trio
StatsD DevOps Boulder 7/20/15
OSDC 2017 - Florian Heigl - Experiences with rudder, is it really for everyone
OSDC 2017 | Experiences with Rudder, is it really for everyone? by Florian Heigl
Operationalizing Clojure Confidently
Multithreading 101
Ad

Recently uploaded (20)

PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
Website Design Services for Small Businesses.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Nekopoi APK 2025 free lastest update
PDF
Cost to Outsource Software Development in 2025
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
CapCut Video Editor 6.8.1 Crack for PC Latest Download (Fully Activated) 2025
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
Monitoring Stack: Grafana, Loki & Promtail
Website Design Services for Small Businesses.pdf
Design an Analysis of Algorithms I-SECS-1021-03
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
AutoCAD Professional Crack 2025 With License Key
iTop VPN Free 5.6.0.5262 Crack latest version 2025
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
Operating system designcfffgfgggggggvggggggggg
Nekopoi APK 2025 free lastest update
Cost to Outsource Software Development in 2025
Designing Intelligence for the Shop Floor.pdf
CapCut Video Editor 6.8.1 Crack for PC Latest Download (Fully Activated) 2025
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Navsoft: AI-Powered Business Solutions & Custom Software Development
Autodesk AutoCAD Crack Free Download 2025
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
CHAPTER 2 - PM Management and IT Context
Design an Analysis of Algorithms II-SECS-1021-03
Weekly report ppt - harsh dattuprasad patel.pptx

Airflow - Insane power in a Tiny Box

  • 2. A BRIEF HISTORY OF DATA PIPELINES F o r r e a l , t h i s i s h o w w e u s e d t o d o i t …
  • 3. The dev’s answer to EVERYTHING C r o n / c r o n t a b This works great for some use cases, but lacks in many other ways. Works great, provided the computer is on. Will manage at the time you set every time it can. No recovery, logs self managed, not sure when it runs. Can only execute on one computer.
  • 4. It keeps tasks alive. S u p e r v i s o r / S u p e r v i s o r d Fantastic utility, works as expected and optionally embedded UI and CLI util. Keeps everything up and let’s you see what’s going on. Even rotates logs and allows groups. Still executes on the one computer. Isn’t more than it advertises to be. Limited scope.
  • 5. Some one said… we can do better.
  • 6. Airflow is a “workflow management system” created by airbnb.com “Today, we are proud to announce that we are open sourcing and sharing Airflow, our workflow management platform.” June 2, 2016 https://medium.com/airbnb-engineering/ airflow-a-workflow-management- platform-46318b977fd8 And it’s all written in Python!
  • 7. What IS Airflow? B U T R E A L LY … Dependency Control Task Management Task Recovery Charting Logging Alerting History Folder Watching Trending Dynamic Tasks ANYTHING your pipeline may need…
  • 9. Webserver / UI The Airflow Architecture Scheduler Worker
  • 10. WITH VERY LITTLE WORK… A i r f l o w c a n b e r u n l o c a l l y O r b e r u n i n m u c h m o r e c o m p l e x c o n f i g u r a t i o n s .
  • 11. Master / Slave / UI Configuration W i t h l o g s b e i n g f e d t o G C S .
  • 13. We place it all on a single Google Compute Engine VM. No bull! E x c u s e m e ? CPU: n1-standard-2 2 vCPUs, 7.5 GB memory HD: 30 GB Standard Persistant Disk (Non-SSD)
  • 15. A few key Airflow concepts. DAGs Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Written in python. 01 Describes how a single task performs in a workflow (dag). 
 There are many types of operators: BashOperator, PythonOperator, EmailOperator, HTTPOperator, MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, Sensor, DockerOperator 02 Operators Tasks Once an operator is instantiated it’s referred to as a task. 03 dag = DAG( dag_id='example_python_operator', schedule_interval=None ) def my_sleeping_function(random_base): '''This is a function that will run within the DAG execution''' time.sleep(random_base) def print_context(ds, **kwargs): pprint(kwargs) print(ds) return 'Whatever you return gets printed in the logs’ run_this = PythonOperator( task_id='print_the_context', provide_context=True, python_callable=print_context, dag=dag) for i in range(10): ''' Generating 10 sleeping task, sleeping from 0 to 9 seconds respectively ''' task = PythonOperator( task_id='sleep_for_'+str(i), python_callable=my_sleeping_function, op_kwargs={'random_base': float(i)/10}, dag=dag ) task.set_upstream(run_this)
  • 20. Stop doing things the way you have, think dynamically. You can automate your task by reading source code or listing files in a directory. You don’t have to worry about execution order, you only need to present airflow with relationships. Think in terms of how you can remove human error. Let airflow work for you.
  • 21. Airflow really shines with dynamic tasks. Dictionary (array) of Dependencies What if you made a script that parsed all your jobs, and detected all dependencies automatically.
 
 Now what if you took that dictionary, and fed it into airflow?
 
 How would that simplify your pipeline? dependencies = { 'topic_billing_frequency': [ ‘dim_billing_frequency’, ‘dim_account' ], 'topic_payment_method': ‘dim_credit_card_type’, ‘dim_payment_accounts’ ] } Let’s take a look… L e t m e s h o w y o u …
  • 22. Airflow really shines with dynamic tasks. T h e c o d e t o r u n i t a l l Top Level Dependencies Top level dependencies are created. Each of these tasks, depends on creating and deleting the cluster. 01 Now each child dependency is iterated over, and a task is created for each. Each is given the delete task as a “downstream” so delete cluster will never run until the tasks are complete. 02 Child Dependencies Connect children to parents Now set the parent task as an upstream for each child task. 03 all_tasks = {} # Create all parent tasks, top level for key, value in dependencies.all_dependencies.iteritems(): if key not in all_tasks: all_tasks[key] = PythonOperator( task_id=key, python_callable=process, op_kwargs={}, provide_context=True, dag=dag, retries=30, retry_delay=timedelta(minutes=10), on_retry_callback=airflow_retry_function, on_failure_callback=airflow_error_function, on_success_callback=airflow_success_function, ) all_tasks[key].set_upstream(task_create_cluster) all_tasks[key].set_downstream(task_delete_cluster) # Create all nested dependency tasks for key, value in dependencies.all_dependencies.iteritems(): for item in value: if item not in all_tasks: if key in all_tasks: continue all_tasks[item] = PythonOperator( task_id=item, python_callable=process, op_kwargs={}, provide_context=True, dag=dag, retries=30, retry_delay=timedelta(minutes=10), on_retry_callback=airflow_retry_function, on_failure_callback=airflow_error_function, on_success_callback=airflow_success_function, ) all_tasks[item].set_downstream(task_delete_cluster) all_tasks[item].set_downstream(all_tasks[key])
  • 23. What does that code do?This is real code being used today.
  • 24. Dovy Paukstys Consultant at Caserta #geek #bigdata #redux How can I help? http://dovy.io http://twitter.com/simplerain dovy.paukstys@caserta.com http://reduxframework.com https://github.com/dovy/ http://linkedin.com/in/dovyp