SlideShare a Scribd company logo
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
https://wepayinc.app.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k
Airflow @ING
2
ING
3
Multinational banking and financial
services corporation headquartered in
Amsterdam.
Its primary businesses are retail
banking, direct banking, wholesale
banking, investment banking, asset
management, and insurance services.
• Cron Replacement
• Fault tolerant
• No XML (looking at you Oozie!)
• Testable
• Python code
• Extendable
• Now Apache (incubating)
• Scale Out
• Complex Dependency Rules
• Pools
• CLI & Web UI
Why Apache Airflow (incubating)?
4
Growing community
5
Airflow Operational Design
6
Airflow Webserver
Database
Airflow Scheduler
Airflow Executor
(local/celery/mesos
worker)
Airflow Tasks
Talks to
Auth Backend
Choose an executor that fits your environment
7
SequentialExecutor LocalExecutor CeleryExecutor
Use case Mainly testing Production (~50% of
installed base)
Production (~50% of
installed base)
Scaleability -na- Vertical Horizontal and Vertical
Complexity Low Medium Medium/High
DAG Local Local Needs sync / pickle
Configuration [core]
Executor =
SequentialExecutor
[core]
Executor =
LocalExecutor
Parallelism=32
[core]
Executor =
CeleryExecutor
[celery]
Celeryd_concurrency = 32
Broker_url = rabbitmq
celery_result_backend
Default_queue =
Remark Don’t use num_runs
UTC everywhere
8
Engineers here respond in
UTC if you ask them what
time it is
Max
• Airflow assumes every server / worker runs
in UTC
• Airflow does not manage time zones
(correctly) (to be fixed)
• UTC does not know Daylight Savings Time
Tasks run at the end of the period not at the start
9
• First run will be at 2016-06-1 22:00 UTC
• Execution date will be 2016-06-1 21:00
UTC
How to stop/kill a task?
10
How to force running a task?
11
Celery only (for now)
“An idempotent operation is one
that has no additional effect if it is
called more than once with the
same input parameters.”
Make your tasks and DAGs idempotent
12
• DAGS and Tasks receive
an execution date
• on_retry_callback can be
used to do a cleanup
before a retry
Generate your tasks programmatically
13
List file names on
HDFS
Loop file names
Create task
Assign upstream
downstream
• Otherwise scheduling can get deadlocked as the sensors take up all the slots in the
scheduler
• Another way to circumvent this issue is to have a separate pool for sensors
When using ExternalTaskSensor make sure to manually
raise the priority of the tasks it is waiting for
14
• Do you have longer running tasks? Increase the heartbeat of the scheduler to decrease
load
• Smaller tasks make for easier debugging and retrying
• Properly choose your start date: the scheduler will fill gaps.
• Changing the schedule requires change the dag_id
• Backfills are used to add runs where the scheduler already went by
Some last bits
15
Use case
16
Transactions
Risk
Products
External
HDFS SPARK
TEZ
POSTGRE
S
FLUME
XFB
SQOOP
SQOOP
17
Wait for files to arrive (Sensor)
18
Copy & clean up
19
Model creation
• Run Spark
• Tez
Sharding
20
Sqooping to DB
• Apache Release
• Allow auto aligned
start_date
• Backfills to use Dag
Runs
• Improve pooling
• DAG Parsing
Isolation
Draft Roadmap
21
• Rest API
• Further Kerberos
Integration
• Schedule Backfill
Dag Runs
• Isolation
• DAG syncing
across workers
• No direct imports
for operators from
__init__
• Event Driven Driven
Scheduler
• Make tasks not need
the database
• Roles / principals
In progress
In progress
In progress
In progress
Aspiring committer? Contributor? User?
22
http://gitter.im/apache/incubator-airflow/
https://github.com/apache/incubator-airflow/
http://mail-archives.apache.org/mod_mbox/incubator-airflow-
dev/
23

More Related Content

PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PPTX
Airflow at WePay
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
PDF
Building Robust Pipelines with Airflow
PDF
Apache Airflow
PDF
Introducing Apache Airflow and how we are using it
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
Running Airflow Workflows as ETL Processes on Hadoop
Airflow at WePay
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Building Robust Pipelines with Airflow
Apache Airflow
Introducing Apache Airflow and how we are using it
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
How I learned to time travel, or, data pipelining and scheduling with Airflow

What's hot (18)

PPTX
Apache Airflow Introduction
PPTX
Airflow - a data flow engine
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
PDF
Apache Airflow Architecture
PDF
Apache airflow
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PPTX
Apache airflow
PDF
Building Better Data Pipelines using Apache Airflow
PDF
Airflow for Beginners
PDF
Apache Airflow
PPTX
Airflow Clustering and High Availability
PPTX
What is Spark
PDF
Airflow presentation
PPTX
Building cloud-enabled genomics workflows with Luigi and Docker
PPTX
Airflow at lyft
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Airflow introduction
PPTX
Airflow presentation
Apache Airflow Introduction
Airflow - a data flow engine
Introduction to Apache Airflow - Data Day Seattle 2016
Apache Airflow Architecture
Apache airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Apache airflow
Building Better Data Pipelines using Apache Airflow
Airflow for Beginners
Apache Airflow
Airflow Clustering and High Availability
What is Spark
Airflow presentation
Building cloud-enabled genomics workflows with Luigi and Docker
Airflow at lyft
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Airflow introduction
Airflow presentation
Ad

Similar to Apache Airflow (incubating) NL HUG Meetup 2016-07-19 (20)

PDF
Serverless Computing
PDF
What no one tells you about writing a streaming app
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
12-Step Program for Scaling Web Applications on PostgreSQL
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
PDF
KACE Agent Architecture and Troubleshooting Overview
PPTX
python_development.pptx
PDF
Ingesting hdfs intosolrusingsparktrimmed
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PPTX
Apache Tez -- A modern processing engine
PPTX
The Future of Hadoop: A deeper look at Apache Spark
PDF
20150704 benchmark and user experience in sahara weiting
PDF
Spark Summit EU talk by Mike Percy
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PPTX
Spark etl
PDF
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PDF
Kudu austin oct 2015.pptx
PPTX
Apache Tez: Accelerating Hadoop Query Processing
Serverless Computing
What no one tells you about writing a streaming app
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
12-Step Program for Scaling Web Applications on PostgreSQL
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Kudu (short) - Big Data Application Meetup
KACE Agent Architecture and Troubleshooting Overview
python_development.pptx
Ingesting hdfs intosolrusingsparktrimmed
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Apache Tez -- A modern processing engine
The Future of Hadoop: A deeper look at Apache Spark
20150704 benchmark and user experience in sahara weiting
Spark Summit EU talk by Mike Percy
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Spark etl
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Kudu austin oct 2015.pptx
Apache Tez: Accelerating Hadoop Query Processing
Ad

Recently uploaded (20)

PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
top salesforce developer skills in 2025.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
L1 - Introduction to python Backend.pptx
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
ai tools demonstartion for schools and inter college
PDF
System and Network Administraation Chapter 3
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Introduction to Artificial Intelligence
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
medical staffing services at VALiNTRY
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
System and Network Administration Chapter 2
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Understanding Forklifts - TECH EHS Solution
How to Migrate SBCGlobal Email to Yahoo Easily
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms I-SECS-1021-03
top salesforce developer skills in 2025.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
L1 - Introduction to python Backend.pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
ManageIQ - Sprint 268 Review - Slide Deck
Design an Analysis of Algorithms II-SECS-1021-03
ai tools demonstartion for schools and inter college
System and Network Administraation Chapter 3
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Introduction to Artificial Intelligence
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
medical staffing services at VALiNTRY
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
System and Network Administration Chapter 2
Online Work Permit System for Fast Permit Processing
Understanding Forklifts - TECH EHS Solution

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

  • 3. ING 3 Multinational banking and financial services corporation headquartered in Amsterdam. Its primary businesses are retail banking, direct banking, wholesale banking, investment banking, asset management, and insurance services.
  • 4. • Cron Replacement • Fault tolerant • No XML (looking at you Oozie!) • Testable • Python code • Extendable • Now Apache (incubating) • Scale Out • Complex Dependency Rules • Pools • CLI & Web UI Why Apache Airflow (incubating)? 4
  • 6. Airflow Operational Design 6 Airflow Webserver Database Airflow Scheduler Airflow Executor (local/celery/mesos worker) Airflow Tasks Talks to Auth Backend
  • 7. Choose an executor that fits your environment 7 SequentialExecutor LocalExecutor CeleryExecutor Use case Mainly testing Production (~50% of installed base) Production (~50% of installed base) Scaleability -na- Vertical Horizontal and Vertical Complexity Low Medium Medium/High DAG Local Local Needs sync / pickle Configuration [core] Executor = SequentialExecutor [core] Executor = LocalExecutor Parallelism=32 [core] Executor = CeleryExecutor [celery] Celeryd_concurrency = 32 Broker_url = rabbitmq celery_result_backend Default_queue = Remark Don’t use num_runs
  • 8. UTC everywhere 8 Engineers here respond in UTC if you ask them what time it is Max • Airflow assumes every server / worker runs in UTC • Airflow does not manage time zones (correctly) (to be fixed) • UTC does not know Daylight Savings Time
  • 9. Tasks run at the end of the period not at the start 9 • First run will be at 2016-06-1 22:00 UTC • Execution date will be 2016-06-1 21:00 UTC
  • 10. How to stop/kill a task? 10
  • 11. How to force running a task? 11 Celery only (for now)
  • 12. “An idempotent operation is one that has no additional effect if it is called more than once with the same input parameters.” Make your tasks and DAGs idempotent 12 • DAGS and Tasks receive an execution date • on_retry_callback can be used to do a cleanup before a retry
  • 13. Generate your tasks programmatically 13 List file names on HDFS Loop file names Create task Assign upstream downstream
  • 14. • Otherwise scheduling can get deadlocked as the sensors take up all the slots in the scheduler • Another way to circumvent this issue is to have a separate pool for sensors When using ExternalTaskSensor make sure to manually raise the priority of the tasks it is waiting for 14
  • 15. • Do you have longer running tasks? Increase the heartbeat of the scheduler to decrease load • Smaller tasks make for easier debugging and retrying • Properly choose your start date: the scheduler will fill gaps. • Changing the schedule requires change the dag_id • Backfills are used to add runs where the scheduler already went by Some last bits 15
  • 17. 17 Wait for files to arrive (Sensor)
  • 19. 19 Model creation • Run Spark • Tez Sharding
  • 21. • Apache Release • Allow auto aligned start_date • Backfills to use Dag Runs • Improve pooling • DAG Parsing Isolation Draft Roadmap 21 • Rest API • Further Kerberos Integration • Schedule Backfill Dag Runs • Isolation • DAG syncing across workers • No direct imports for operators from __init__ • Event Driven Driven Scheduler • Make tasks not need the database • Roles / principals In progress In progress In progress In progress
  • 22. Aspiring committer? Contributor? User? 22 http://gitter.im/apache/incubator-airflow/ https://github.com/apache/incubator-airflow/ http://mail-archives.apache.org/mod_mbox/incubator-airflow- dev/
  • 23. 23

Editor's Notes

  • #10: Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.
  • #14: One of the most powerful features of a system where workflows are described in code is that you can programmatically generate your dag. This is very, very useful where you want to automatically pick up new data sources without manual intervention.