SlideShare a Scribd company logo
Intro to Airflow
or
All the stuff you are now able to do, but you did not know about
Airflow—what is it?
It’s a workflow management platform
Airflow—what is it?
It’s a workflow management platform
Task
a
Task
b
Task
c
Task
e
Task
d
Airflow—what is it?
It’s a workflow management platform
Task
a
Task
b
Task
c
Task
e
Task
d
Airflow—what is it?
It’s a workflow management platform
Task
a
Task
b
Task
c
Task
e
Task
d
Directed Acyclic Graph (DAG)
Airflow—what is it?
It’s a workflow management platform
Task
a
Task
b
Task
c
Task
e
Task
d
Does not support data streaming
Airflow—what is it?
It’s a workflow management platform
Task
a
Task
b
Task
c
Task
e
Task
d
Meant for batch processing
Does not support data streaming
Airflow—what is it?
It’s a workflow management platform
Task
a
Task
b
Task
c
Task
e
Task
d
Meant for batch processing
Does not support data streaming
.pickle .pickle
Airflow—what is it?
It’s a workflow management platform
Task
a
Task
b
Task
c
Task
e
Task
d
DB DB
Meant for batch processing
Does not support data streaming
Airflow—what is it?
It’s a workflow management platform
Task
a
Task
b
Task
c
Task
e
Task
d
Meant for batch processing
Does not support data streaming
Airflow—what is it?
dag = DAG( ‘Rocket Launcher DAG', default_args=args )
find_coordinates launch_rocket
find_coordinates = BashOperator(
task_id = ‘find coordinates',
bash_command = ‘node find_coordinates.js --place=4501 Kingsway ',
dag=dag)
launch_rocket = BashOperator(
task_id = ‘launch rocket',
bash_command = ‘java launchRocket --now',
dag=dag)
Airflow—what is it?
launch_rocket.set_upstream( find_coordinates )
dag = DAG( ‘Rocket Launcher DAG', default_args=args )
find_coordinates launch_rocket
find_coordinates = BashOperator(
task_id = ‘find coordinates',
bash_command = ‘node find_coordinates.js --place=4501 Kingsway ',
dag=dag)
launch_rocket = BashOperator(
task_id = ‘launch rocket',
bash_command = ‘java launchRocket --now',
dag=dag)
Airflow—what is it?
launch_rocket.set_upstream( find_coordinates )
dag = DAG( ‘Rocket Launcher DAG', default_args=args )
find_coordinates launch_rocket
find_coordinates = BashOperator(
task_id = ‘find coordinates',
bash_command = ‘node find_coordinates.js --place=4501 Kingsway ',
dag=dag)
launch_rocket = BashOperator(
task_id = ‘launch rocket',
bash_command = ‘java launchRocket --now',
dag=dag)
args = {
'owner': ‘username',
'depends_on_past': True,
'start_date': datetime(2017, 1, 1),
'email_on_failure': True,
‘retry': True,
‘retry_delay': timedelta(hour=1),
‘schedule_interval’ : timedelta(day=1)
}
Airflow—what is it?
launch_rocket.set_upstream( find_coordinates )
dag = DAG( ‘Rocket Launcher DAG', default_args=args )
find_coordinates launch_rocket
find_coordinates = BashOperator(
task_id = ‘find coordinates',
bash_command = ‘node find_coordinates.js --place=4501 Kingsway ',
dag=dag)
launch_rocket = BashOperator(
task_id = ‘launch rocket',
bash_command = ‘java launchRocket --now',
dag=dag)
args = {
'owner': ‘username',
'depends_on_past': True,
'start_date': datetime(2017, 1, 1),
'email_on_failure': True,
‘retry': True,
‘retry_delay': timedelta(hour=1),
‘schedule_interval’ : timedelta(day=1)
}
Python
Airflow—Internals
Python 3
Runtime
Airflow
Local Executor
Web Server
Scheduler
SQL Engine
Airflow
Metadata
DB
dag
dag
dags
Airflow—Internals
Python 3
Runtime
Airflow
Local Executor
Web Server
Scheduler
SQL Engine
Airflow
Metadata
DB
dag
dag
dags
Per task:
• Status (succes/execution/fail)
• Runtime
• etc
Airflow—Internals
Python 3
Runtime
Airflow
Local Executor
Web Server
Scheduler
SQL Engine
Airflow
Metadata
DB
dag
dag
dags
Per task:
• Status (succes/execution/fail)
• Runtime
• etc
Airflow—Internals
Python 3
Rutime
Airflow
Sequential Executor
Web Server
Scheduler
SQL Engine
Airflow
Metadata
DB
dag
dag
dags
Per task:
• Status (succes/execution/fail)
• Runtime
• etc
Single node
Airflow—Internals
Python 3
Rutime
Airflow
Local Executor
Web Server
Scheduler
SQL Engine
Airflow
Metadata
DB
dag
dag
dags
Per task:
• Status (succes/execution/fail)
• Runtime
• etc
Single node
Airflow—Internals
Python 3
Rutime
Airflow
Celery Executor
Web Server
Scheduler
SQL Engine
Airflow
Metadata
DB
dag
dag
dags
Per task:
• Status (succes/execution/fail)
• Runtime
• etc
Airflow—what is it?
Operators
• BashOperator
• HTTPSensor
• SSHExecuteOperator
• …
Documentation here:
https://airflow.incubator.apache.org/code.html
Airflow—Demo
What we have so far….
Airflow—Benefits
• Scalable workflows (with Celery)
• Easy parallelizing of workflows
• With Celery: Dedicated low-priority n-thread queue
• Beautiful monitoring from the UI
Airflow—Benefits
• Some automation
• Scheduling and triggering of DAGS from the UI
• Auto Retrying
• Auto Notifying (emails, callbacks)
• Some interesting Operators
• Place for contribution
• Automatic way of splitting a workload into n tasks
(Dask style)
Airflow—Who’s using them
Airflow—The case against cron
But I could “just” use Cron
Airflow—The case against cron
But I could “just” use Cron
Cron Airflow
Supervisor—what is it?
It’s a process demonization and monitoring tool
Supervisor—what is it?
It’s a process demonization and monitoring tool
Long story short:
We use them to run and monitor Celery, Flower, and Airflow
Supervisor—Demo
What we have so far….
Supervisor—Benefits
• Centralized control
• The server was re-started? What was running on the background?! Panick!!
Panick!!
(Chill… start supervisor and everything that needs to be started will be
started)
Supervisor—Benefits
• Centralized control
• The server was re-started? What was running on the background?! Panick!!
Panick!!
(Chill… start supervisor and everything that needs to be started will be
started)
• Web UI (for monitoring and control)
• Status and Logs are visible directly on the UI
• Can stop/start/re-start
Supervisor—Benefits
• Centralized control
• The server was re-started? What was running on the background?! Panick!!
Panick!!
(Chill… start supervisor and everything that needs to be started will be
started)
• Web UI (for monitoring and control)
• Status and Logs are visible directly on the UI
• Can stop/start/re-start
• Basic automation
• If an app fails. It can automatically re-start it up to n retries
Supervisor—The case against nohup
But I could “just” use nohup
Supervisor—The case against nohup
But I could “just” use nohup

More Related Content

PDF
Speeding Time to Insight with a Modern ELT Approach
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Introducing Apache Airflow and how we are using it
PDF
Apache flink
PDF
Building real time analytics applications using pinot : A LinkedIn case study
PPTX
Liquibase for java developers
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Airflow - a data flow engine
Speeding Time to Insight with a Modern ELT Approach
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Introducing Apache Airflow and how we are using it
Apache flink
Building real time analytics applications using pinot : A LinkedIn case study
Liquibase for java developers
Azure Synapse Analytics Overview (r2)
Airflow - a data flow engine

What's hot (20)

PDF
Building an analytics workflow using Apache Airflow
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PPTX
Airflow 101
PDF
Airflow for Beginners
PDF
Introduction to Apache Airflow
PDF
Airflow introduction
PDF
Spark and S3 with Ryan Blue
PDF
Grafana introduction
PDF
Airbyte @ Airflow Summit - The new modern data stack
PDF
Building Better Data Pipelines using Apache Airflow
PDF
Apache airflow
PPTX
Apache Beam: A unified model for batch and stream processing data
PPTX
Airflow presentation
PPTX
Elastic - ELK, Logstash & Kibana
PDF
Apache Airflow
PDF
Dynamic Allocation in Spark
PPTX
Elastic search overview
PPTX
Introduction to Apache Flink
PPSX
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
PPTX
Apache airflow
Building an analytics workflow using Apache Airflow
Introducing the Snowflake Computing Cloud Data Warehouse
Airflow 101
Airflow for Beginners
Introduction to Apache Airflow
Airflow introduction
Spark and S3 with Ryan Blue
Grafana introduction
Airbyte @ Airflow Summit - The new modern data stack
Building Better Data Pipelines using Apache Airflow
Apache airflow
Apache Beam: A unified model for batch and stream processing data
Airflow presentation
Elastic - ELK, Logstash & Kibana
Apache Airflow
Dynamic Allocation in Spark
Elastic search overview
Introduction to Apache Flink
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Apache airflow
Ad

Similar to Airflow and supervisor (20)

PDF
Airflow Intro-1.pdf
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Airflow - Insane power in a Tiny Box
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PPTX
airflow web UI and CLI.pptx
PDF
Airflow presentation
PDF
Managing transactions on Ethereum with Apache Airflow
PPTX
airflowpresentation1-180717183432.pptx
PDF
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
PPTX
Apache Airdrop detailed description.pptx
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PPTX
Getting to Know Airflow
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
PPSX
Introduce Airflow.ppsx
PPTX
Fyber - airflow best practices in production
PDF
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
PPTX
Apache Airflow overview
Airflow Intro-1.pdf
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Airflow Best Practises & Roadmap to Airflow 2.0
How I learned to time travel, or, data pipelining and scheduling with Airflow
Airflow - Insane power in a Tiny Box
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
airflow web UI and CLI.pptx
Airflow presentation
Managing transactions on Ethereum with Apache Airflow
airflowpresentation1-180717183432.pptx
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Apache Airdrop detailed description.pptx
How I learned to time travel, or, data pipelining and scheduling with Airflow
Getting to Know Airflow
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Introduce Airflow.ppsx
Fyber - airflow best practices in production
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
Apache Airflow overview
Ad

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Digital Strategies for Manufacturing Companies
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
System and Network Administration Chapter 2
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Introduction to Artificial Intelligence
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Design an Analysis of Algorithms I-SECS-1021-03
VVF-Customer-Presentation2025-Ver1.9.pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Digital Strategies for Manufacturing Companies
Understanding Forklifts - TECH EHS Solution
How Creative Agencies Leverage Project Management Software.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
ManageIQ - Sprint 268 Review - Slide Deck
Odoo Companies in India – Driving Business Transformation.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Design an Analysis of Algorithms II-SECS-1021-03
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Wondershare Filmora 15 Crack With Activation Key [2025
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
System and Network Administration Chapter 2
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Introduction to Artificial Intelligence
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Design an Analysis of Algorithms I-SECS-1021-03

Airflow and supervisor

Editor's Notes

  • #3: Airflow is a … In airflow workflows consists of tasks.
  • #4: In airflow workflows consists of tasks.
  • #5: And tasks have dependencies. In this example In this manner, tasks can be seen as the vertices of a graph and dependencies as the edges.
  • #6: The only condition is that there can not be cyclic dependencies. In other words, the dependency graph is a DAG
  • #7: One thing to have in consideration is that Airflow…. Meaning, tasks do no exchange data. So, for instance, you cannot create a data pipeline where data is passed from task to task.
  • #8: And this is because airflow is meant for batch processing
  • #9: If you need to… because you want to create a data pipeline that does some sort of batch processing… then you can always pass data in files. The only drawback with airflow is that this step is manual… there’s no way to do it from airflow automatically… you have to create and read your won pickle files… so that’s pehaps a place where we could improve
  • #10: Or even in the DB… this would be handy if you wanted to persist intermediate results
  • #11: IN some contexts, a DAG is a pipeline. So you can think of DAGs as either workflows or pipelines. Spark is for data streamin?? (Ask leandro)
  • #12: So this is a sample workflow… on the left you have the task definition for….
  • #13: Then to specify a dependency between to tasks you can say…
  • #14: You can also configure the DAG.
  • #15: And if you haven’t notice… the language for specifying a DAG is python… The ability to define a DAG programmatically can be useful when workflows start to grow in complexity. You can. for instance, define dependencies on a loop
  • #16: As I already mentioned, airflow is a batch processing platform. To support batch processing, the first thing Airflow has to have is an SCHEDULER. The scheduler is in charge of scheduling DAGs for execution
  • #17: Airlfow also keeps track of tasks states via its metadata DB, whether…
  • #18: This metadata can then be accessed by the webserver to display information about a task in the UI
  • #19: As for the executor… Airflow supports three options. The sequential executor, which runs all DAGS in a single process in a single machine.
  • #20: As for the executor… Airflwo supports three options. I’m showing here the local executor, which let you do multiprocessing in a single machine.
  • #21: And there’s the option for a Celery Executor, which enables Airflow to distribute and parallelize DAGs.
  • #22: Airflow operators provide you with options so that you can create DAGS…. I have never used any of them (so I can’t tell you about them).. But I can show you a couple of them…. To give you an idea….
  • #23: Consider the previous example.. The rocker laiuncher one. In Java, I would have to create a POST request that looks as follows: SHOW LAUN ROCKET TASK IN CHROME ARC
  • #24: It’s a… it sits right between Celery applications and everybody else.
  • #25: It’s a… it sits right between Celery applications and everybody else.
  • #26: And others….
  • #27: Now….If after this explanation you’re still thinking cron is a good idea to run your batch processes. That means you haven't listened at all. So this is my closing argument…
  • #28: Cron is like a 1970s vokswagen. It was good, everybody used it… but we’ve moved on
  • #29: Supervisor is… Meaning you can create daemon processes and monitor them
  • #30: Long story short…
  • #31: Consider the previous example.. The rocker laiuncher one. In Java, I would have to create a POST request that looks as follows: SHOW LAUN ROCKET TASK IN CHROME ARC
  • #32: Why supervisor… that do we gain?
  • #33: Why supervisor… that do we gain?
  • #34: Why supervisor… that do we gain?
  • #35: Now.. If you think… well I could just use
  • #36: Please don’t