SlideShare a Scribd company logo
6
Most read
8
Most read
Data Model
Data Pipelines with Apache Airflow
John Zhao
Who Am I?
 SR. SRE Engineer
SUNVALLEY COMPUTER INC
 Committer – Data & Analytics
Engineering
 Have built data automation
pipeline
What Is Apache Airflow?
Apache Airflow is an open-source platform for authoring, scheduling and monitoring data and computing workflows. First
developed by Airbnb, it is now under the Apache Software Foundation. Airflow uses Python to create workflows that can
be easily scheduled and monitored. Airflow can run anything—it is completely agnostic to what you are running.
Benefits of Apache Airflow include:
• Ease of use—you only need a little python knowledge to get started.
• Open-source community—Airflow is free and has a large community of active users.
• Integrations—ready-to-use operators allow you to integrate Airflow with cloud platforms (Google, AWS, Azure,
etc.)
• Coding with standard Python—you can create flexible workflows using Python with no knowledge of additional
technologies or frameworks.
• Graphical UI—monitor and manage workflows, check the status of ongoing and completed tasks.
This is part of machine learning operations
In this article, you will learn:
• Airflow Use Cases
• Workloads
• Airflow Architecture
• Control Flow
• Airflow Components
• Airflow Best Practices
• Live Demo – Local runner
Airflow Use Cases
Airflow is best at handling workflows that run at a specified time or every specified time interval. You can trigger the
pipeline manually or using an external trigger (e.g. via REST API)
You can use Apache Airflow to schedule the following:
• ETL pipelines that extract data from multiple sources, and run Spark jobs or other data transformations
• Machine learning model training
• Automated generation of reports
• Backups and other DevOps tasks
Airflow is commonly used to automate machine learning tasks. To understand machine learning automation in more
depth, read our guides to:
• Machine learning workflow
• Machine learning automation
Workloads
The DAG runs through a series of Tasks, which may be subclasses of Airflow's BaseOperator, including:
• Operators—predefined tasks that can be strung together quickly
• Sensors—a type of Operator that waits for external events to occur
• TaskFlow—a custom Python function packaged as a task, which is decorated with @tasks
Operators are the building blocks of Apache Airflow, as they define how the Tasks run and what they do. The terms Task
and Operator are sometimes used interchangeably, but they should be considered separate concepts, with Operators and
Sensors serving as templates for creating Tasks.
Airflow Architecture
The Airflow platform lets you build and run workflows, which are represented as Directed Acyclic Graphs
(DAGs). A sample DAG is shown in the diagram below.
A DAG contains Tasks (action items) and specifies the dependencies between them and the order in which they
are executed. A Scheduler handles scheduled workflows and submits Tasks to the Executor, which runs them.
The Executor pushes tasks to workers.
Other typical components of an Airflow architecture include a database to store state metadata, a web server
used to inspect and debug Tasks and DAGs, and a folder containing the DAG files.
Control Flow
DAGs can be run multiple times, and multiple DAG runs can happen in parallel. DAGs can have multiple parameters indicating
how they should operate, but all DAGs have the mandatory parameter execution_date.
You can indicate dependencies for Tasks in the DAG using the characters >> and << :
first_task >> [second_task, third_task]
third_task << fourth_task
By default, tasks have to wait for all upstream tasks to succeed before they can run, but you can customize how tasks are
executed with features such as LatestOnly, Branching, and Trigger Rules.
To manage complex DAGs, you can use SubDAGs to embed "reusable" DAGs into others. You can also visually group tasks in
the UI using TaskGroups
Airflow Components
In addition to DAGs, Operators and Tasks, the Airflow offers the following components:
• User interface—lets you view DAGs, Tasks and logs, trigger runs and debug DAGs. This is the easiest way to
keep track of your overall Airflow installation and dive into specific DAGs to check the status of tasks.
• Hooks—Airflow uses Hooks to interface with third-party systems, enabling connection to external APIs and
databases (e.g. Hive, S3, GCS, MySQL, Postgres). Hooks should not contain sensitive information such as
authentication credentials.
• Providers—packages containing the core Operators and Hooks for a particular service. They are maintained by
the community and can be directly installed on an Airflow environment.
• Plugins—a variety of Hooks and Operators to help perform certain tasks, such as sending data from Salesforce
to Amazon Redshift.
• Connections—these contain information that enable a connection to an external system. This includes
authentication credentials and API tokens. You can manage connections directly from the UI, and the sensitive
data will be encrypted and stored in PostgreSQL or MySQL.
Airflow Best Practices
Here are a few best practices that will help you make more effective use of Airflow:
• Keep Your Workflow Files Up to Date
• Define the Clear Purpose of your DAG
• Use Variables for More Flexibility
• Set Priorities
• Define Service Level Agreements (SLAs)
Live Demo – Local runner…
git clone https://github.com/johnjzhao/docker-airflow.git
https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-
compose/index.html
Thank you!

More Related Content

PDF
Building Automated Data Pipelines with Airflow.pdf
PPSX
Introduce Airflow.ppsx
PPTX
Apache Airdrop detailed description.pptx
PPTX
Introduction to Apache Airflow & Workflow Orchestration.pptx
PDF
Airflow Intro-1.pdf
PPTX
Apache airflow
PPTX
Serverless Solutions for developers
PPTX
Apache Airflow presentation by GenPPT.pptx
Building Automated Data Pipelines with Airflow.pdf
Introduce Airflow.ppsx
Apache Airdrop detailed description.pptx
Introduction to Apache Airflow & Workflow Orchestration.pptx
Airflow Intro-1.pdf
Apache airflow
Serverless Solutions for developers
Apache Airflow presentation by GenPPT.pptx

Similar to DataPipelineApacheAirflow.pptx (20)

PDF
Apache Airflow
PDF
Apache Airflow
PPTX
Airflow presentation
PDF
Managing transactions on Ethereum with Apache Airflow
PPTX
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
PDF
GoDocker presentation
PPTX
Orchestration service v2
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
PPTX
Cloudify workshop at CCCEU 2014
PPTX
Azure from scratch part 3 By Girish Kalamati
PPTX
Evolution of netflix conductor
PPTX
Serverless Computing & Automation - GCP
PDF
vRO Training Document
PPTX
Java workflow engines
PPTX
Relay: The Next Leg, Eric Sorenson, Puppet
PPTX
Airflow 101
PDF
Airflow techtonic template
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PPTX
Leveraging Azure DevOps across the Enterprise
PDF
Automating Data Pipelines with Airflow.pdf
Apache Airflow
Apache Airflow
Airflow presentation
Managing transactions on Ethereum with Apache Airflow
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
GoDocker presentation
Orchestration service v2
Apache AirfowAsaSAsaSAsSas - Session1.pptx
Cloudify workshop at CCCEU 2014
Azure from scratch part 3 By Girish Kalamati
Evolution of netflix conductor
Serverless Computing & Automation - GCP
vRO Training Document
Java workflow engines
Relay: The Next Leg, Eric Sorenson, Puppet
Airflow 101
Airflow techtonic template
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Leveraging Azure DevOps across the Enterprise
Automating Data Pipelines with Airflow.pdf
Ad

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Introduction to Business Data Analytics.
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Global journeys: estimating international migration
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Business Data Analytics.
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Launch Your Data Science Career in Kochi – 2025
Business Ppt On Nestle.pptx huunnnhhgfvu
IB Computer Science - Internal Assessment.pptx
Mega Projects Data Mega Projects Data
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Global journeys: estimating international migration
Miokarditis (Inflamasi pada Otot Jantung)
STUDY DESIGN details- Lt Col Maksud (21).pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Fluorescence-microscope_Botany_detailed content
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Ad

DataPipelineApacheAirflow.pptx

  • 1. Data Model Data Pipelines with Apache Airflow John Zhao
  • 2. Who Am I?  SR. SRE Engineer SUNVALLEY COMPUTER INC  Committer – Data & Analytics Engineering  Have built data automation pipeline
  • 3. What Is Apache Airflow? Apache Airflow is an open-source platform for authoring, scheduling and monitoring data and computing workflows. First developed by Airbnb, it is now under the Apache Software Foundation. Airflow uses Python to create workflows that can be easily scheduled and monitored. Airflow can run anything—it is completely agnostic to what you are running. Benefits of Apache Airflow include: • Ease of use—you only need a little python knowledge to get started. • Open-source community—Airflow is free and has a large community of active users. • Integrations—ready-to-use operators allow you to integrate Airflow with cloud platforms (Google, AWS, Azure, etc.) • Coding with standard Python—you can create flexible workflows using Python with no knowledge of additional technologies or frameworks. • Graphical UI—monitor and manage workflows, check the status of ongoing and completed tasks.
  • 4. This is part of machine learning operations In this article, you will learn: • Airflow Use Cases • Workloads • Airflow Architecture • Control Flow • Airflow Components • Airflow Best Practices • Live Demo – Local runner
  • 5. Airflow Use Cases Airflow is best at handling workflows that run at a specified time or every specified time interval. You can trigger the pipeline manually or using an external trigger (e.g. via REST API) You can use Apache Airflow to schedule the following: • ETL pipelines that extract data from multiple sources, and run Spark jobs or other data transformations • Machine learning model training • Automated generation of reports • Backups and other DevOps tasks Airflow is commonly used to automate machine learning tasks. To understand machine learning automation in more depth, read our guides to: • Machine learning workflow • Machine learning automation
  • 6. Workloads The DAG runs through a series of Tasks, which may be subclasses of Airflow's BaseOperator, including: • Operators—predefined tasks that can be strung together quickly • Sensors—a type of Operator that waits for external events to occur • TaskFlow—a custom Python function packaged as a task, which is decorated with @tasks Operators are the building blocks of Apache Airflow, as they define how the Tasks run and what they do. The terms Task and Operator are sometimes used interchangeably, but they should be considered separate concepts, with Operators and Sensors serving as templates for creating Tasks.
  • 7. Airflow Architecture The Airflow platform lets you build and run workflows, which are represented as Directed Acyclic Graphs (DAGs). A sample DAG is shown in the diagram below. A DAG contains Tasks (action items) and specifies the dependencies between them and the order in which they are executed. A Scheduler handles scheduled workflows and submits Tasks to the Executor, which runs them. The Executor pushes tasks to workers. Other typical components of an Airflow architecture include a database to store state metadata, a web server used to inspect and debug Tasks and DAGs, and a folder containing the DAG files.
  • 8. Control Flow DAGs can be run multiple times, and multiple DAG runs can happen in parallel. DAGs can have multiple parameters indicating how they should operate, but all DAGs have the mandatory parameter execution_date. You can indicate dependencies for Tasks in the DAG using the characters >> and << : first_task >> [second_task, third_task] third_task << fourth_task By default, tasks have to wait for all upstream tasks to succeed before they can run, but you can customize how tasks are executed with features such as LatestOnly, Branching, and Trigger Rules. To manage complex DAGs, you can use SubDAGs to embed "reusable" DAGs into others. You can also visually group tasks in the UI using TaskGroups
  • 9. Airflow Components In addition to DAGs, Operators and Tasks, the Airflow offers the following components: • User interface—lets you view DAGs, Tasks and logs, trigger runs and debug DAGs. This is the easiest way to keep track of your overall Airflow installation and dive into specific DAGs to check the status of tasks. • Hooks—Airflow uses Hooks to interface with third-party systems, enabling connection to external APIs and databases (e.g. Hive, S3, GCS, MySQL, Postgres). Hooks should not contain sensitive information such as authentication credentials. • Providers—packages containing the core Operators and Hooks for a particular service. They are maintained by the community and can be directly installed on an Airflow environment. • Plugins—a variety of Hooks and Operators to help perform certain tasks, such as sending data from Salesforce to Amazon Redshift. • Connections—these contain information that enable a connection to an external system. This includes authentication credentials and API tokens. You can manage connections directly from the UI, and the sensitive data will be encrypted and stored in PostgreSQL or MySQL.
  • 10. Airflow Best Practices Here are a few best practices that will help you make more effective use of Airflow: • Keep Your Workflow Files Up to Date • Define the Clear Purpose of your DAG • Use Variables for More Flexibility • Set Priorities • Define Service Level Agreements (SLAs)
  • 11. Live Demo – Local runner… git clone https://github.com/johnjzhao/docker-airflow.git https://airflow.apache.org/docs/apache-airflow/stable/howto/docker- compose/index.html