How I learned to time travel, or, data pipelining and scheduling with Airflow

How I learned to time travel
or, data pipelining and scheduling with Airflow
Laura Lorenz | @lalorenz6 | github.com/lauralorenz | llorenz@industrydive.com
We’re
hiring!

Data is weird & breaks stuff
User data is particularly untrustworthy. Don’t trust it.

Computers/the Internet/third party
services/everything will fail

“ In the beginning, there was Cron.
We had one job, it ran at 1AM, and it was
good.
- Pete Owlett, PyData London 2016
from the outline of his talk:
“Lessons from 6 months of using Luigi in production”

“ In the beginning, there was Cron.
We had one job, it ran at 1AM, and it was
good.
- Pete Owlett, PyData London 2016
from the outline of his talk:
“Lessons from 6 months of using Luigi in production”
^
100 ^
depends
^
chaos

We had thoughts about how this should go
● Prefer something in open source Python so we know what’s going on and can
easily extend or customize
● Resilient
○ Handles failure well; i.e. retry logic, failure callbacks, alerting
● Deals with Complexity Intelligently
○ Can handle complicated dependencies and only runs what it has to
● Flexibility
○ Can run anything we want
● We knew we had batch tasks on daily and hourly schedules

We travelled the land
● File based dependencies
● Dependency framework only
● Lightweight, protocols minimally specified
● Abstract dependencies
● Ships with scheduling & monitoring
● Heavyweight, batteries included
DrakeMake
Pydoit
Pinball
Airflow
Luigi
AWS
Data
Pipeline
Active
docs &
community

File dependencies/target systems
File dependencies
Recipe/action
Target(s)

File dependencies
Recipe/action
Target(s)
#Makefile
wrangled.csv : source1.csv source2.csv
cat source1.csv source2.csv > wrangled.csv

File dependencies
Recipe/action
Target(s)
#Drakefile
wrangled.csv <- source1.csv, source2.csv [shell]
cat $INPUT0 $INPUT1 > $OUTPUT

File dependencies
Recipe/action
Target(s)
#Pydoit
def task_example():
return {“targets”: [‘wrangled.csv’],
“file_deps”: [‘source1.csv’, ‘source2.csv’],
“actions”: [concatenate_files_func]
}

File dependencies
Recipe/action
Target(s)
# Luigi
class TaskC(luigi.Task):
def requires(self):
return output_from_a()
def output(self):
return input_for_e()
# Luigi cont
def run(self):
do_the_thing(
self.requires, self.output)

● Work is cached in files
○ Smart rebuilding
● Simple and intuitive
configuration especially for
data transformations
● No native concept of schedule
○ Luigi is the first to introduce this,
but lacks built in polling process
● Alerting systems too basic
● Design paradigm not broadly
applicable to non-target
operations
Pros Cons

Abstract orchestration systems
C
A B
ED

C
A B
ED
# Pinball
WORKFLOW = {“ex”: WorkflowConfig(
jobs={
“A”: JobConfig(
JobTemplate(A),
[]),
“C”: JobConfig(
JobTemplate(C) ,
[“A”],
…,
schedule=ScheduleConfig(
recurrence=timedelta(days=1),
reference_timestamp=
datetime(
year=2016, day=8,
month=10))
…,

C
A B
ED
# Airflow
dag = DAG(schedule_interval=
timedelta(days=1),
start_date=
datetime(2015,10,6))
a = PythonOperator(
task_id=”A”,
python_callable=ClassA,
dag=dag)
c = MySQLOperator(
task_id=”B”,
sql=”DROP TABLE hello”,
dag=dag)
c.set_upstream(a)

● Support many more types of
operations out of the box
● Handles more complicated
dependency logic
● Scheduling, monitoring, and
alerting services built-in and
sophisticated
● Caching is per service; loses
focus on individual data
transformations
● Configuration is more complex
● More infrastructure
dependencies
○ Database for state
○ Queue for distribution
Pros Cons

Armed with knowledge, we had more opinions
● We like the sophistication of the abstract orchestration systems
● But we also like Drake/Luigi-esque file targeting for transparency and data
bug tracking
○ “Intermediate artifacts”
● We (I/devops) don’t want to maintain a separate scheduling service
● We like a strong community and good docs
● We don’t want to be stuck in one ecosystem

Airflow + “smart-airflow” =

Airflow
● Scheduler process handles triggering and executing work specified in DAGs
on a given schedule
● Built in alerting based on service license agreements or task state
● Lots of sexy profiling visualizations
● test, backfill, clear operations convenient from the CLI
● Operators can come from a number of prebuilt classes like PythonOperator,
S3KeySensor, or BaseTransfer, or can obviously extend using inheritance
● Can support local or distributed work execution; distributed needs a celery
backend service (RabbitMQ, Redis)

Airflow
worker
webserver
scheduler
queue metadata
executor

How I learned to time travel, or, data pipelining and scheduling with Airflow

Let’s talk about DAGs and Tasks
schedule_interval
start_date
max_active_runs
DAG
Operators
Sensors Operators
SensorsOperators
poke_interval
timeout
owner
retries
on_failure_callback
data_dependencies
DAG properties Task properties

Let’s talk about DagRuns and TaskInstances
DagRuns:
DAG by “time”
TaskInstances:
Task by DagRun

Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
What to
do what to
do

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
Hey do
the
thing!!!
Do the
thing
queued

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
Okok I
told rabbit
Do the
thing
queued
DagRun
TaskInstance
running
queued

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
Do the
thing
queued
What to
do what to
do
DagRun
TaskInstance
running
queued

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
Do the
thing
Oo thing!
I’M ON IT
running
DagRun
TaskInstance
running
running

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
Do the
thing
Ack!!(knowledge):
success!
success
DagRun
TaskInstance
running
success

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
Well I did my
job
DagRun
TaskInstance
running
success

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
DagRun
TaskInstance
running
success

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
DagRun
TaskInstance
running
success
Whaaaats
goin’ on

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
DagRun
TaskInstance
success
success
Ok we’re
done with
that one

webserver
DAGs
Airflow
worker
webserver
scheduler
executor
The people love
UIs, I gotta put
some data on it
DagRun
TaskInstance
success
success

Alerting is fun
SLAs
Callbacks
Email on retry/failure/success
Timeouts
SlackOperator

Configuration abounds
Pools
Queues
Max_active_dag_runs
Max_concurrency
DAGs on/off
Retries

Flexibility
● Operators
○ PythonOperator, BashOperator
○ TriggerDagRunOperator, BranchOperator
○ EmailOperator, MySqlOperator, S3ToHiveTransfer
● Sensors
○ ExternalTaskSensor
○ HttpSensor, S3KeySensor
● Extending your own operators and sensors
○ smart_airflow.DivePythonOperator

smart-airflow
● Airflow doesn’t support much data transfer between tasks out of the box
○ only small pieces of data via XCom
● But we liked the file dependency/target concept of checkpoints to cache data
transformations to both save time and provide transparency
● smart-airflow is a plugin to Airflow that supports local file system or
S3-backed intermediate artifact storage
● It leverages Airflow concepts to make file location predictable
○ dag_id/task_id/execution_date

Our smart-airflow backed ETL paradigm
1. Make each task as small as possible while maintaining readability
2. Preserve output for each task as a file-based intermediate artifact in a format
that is consumable by its dependent task
3. Avoid finalizing artifacts for as long as possible (e.g. update a database table
as the last and simplest step in a DAG)

Setting up Airflow at your organization
● pip install airflow to get started - you can instantly get started up with
○ sqlite metadata database
○ SequentialExecutor
○ Included example DAGs
● Use puckel/docker-airflow to get started quickly with
○ MySQL metadata database
○ CeleryExecutor
○ Celery Flower
○ RabbitMQ messaging backend with Management plugin
● upstart and systemd templates available through the
apache/incubator-airflow repository

Tips, tricks, and gotchas for Airflow
● Minimize your dev environment with SequentialExecutor and use
airflow test {dag_id} {task_id} {execution_date} in early
development to test tasks
● To test your DAG with the scheduler, utilize the @once schedule_interval and
clear the DagRuns and TaskInstances between tests with airflow
clear or the fancy schmancy UI
● Don’t bother with the nascent plugin system, just package your custom
operators with your DAGs for deployment
● No built in log rotation - default logging is pretty verbose and if you add your
own, this might get surprisingly large. As of Airflow 1.7 you can back them up
to S3 with simple configuration

Tips, tricks, and gotchas for Airflow
● We have about ~1300 tasks across 8 active DAGs and 27 worker processes
on an m4.xlarge AWS EC2 instance.
○ We utilize pools (which have come a long way since their buggy inception) to manage
resources
● Consider using queues if you have disparate types of work to do
○ our tasks are currently 100% Python but you could support scripts in any other language by
redirecting those messages to worker servers with the proper executables and dependencies
installed
● Your tasks must be idempotent; we’re using retries here

Let’s talk about TIME TRAVEL
time.

schedule_interval = timedelta(day=1)

execution_date
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05

Data
Data
Data
Data
Data
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
now

Data
Data
Data
Data
Data
Data
Data
Data
Data
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
now

Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
now

Data
Data
Data
Data
Data
now
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05

“allowable” = execution_date + schedule_intervalData
Data
Data
Data
Data
now
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05

execution_date != start_dateData
Data
Data
Data
Data
Data
Data
now
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
DagRun:
2016-10-01
start_date: 2016-10-02 00:00:05
end_date: 2016-10-02 00:00:10

It’s the same for TaskInstances
DagRuns:
DAG by “time”
TaskInstances:
Task by DagRun

execution_date != start_date
but…
start_date != start_date
...sorta

Still don’t get it?
● Time travel yourself!!!
○ Rewind
○ Ask me I’m friendly
○ Google Groups, Gitter, Airflow docs, dev mailing list archives

Thank you!
Questions?
PS there’s an appendix

Appendix
More details on the pipeline tools not covered in depth
from an earlier draft of this talk

Make
● Originally/often used to compile source code
● Defines
○ targets and any prerequisites, which are potential file paths
○ recipes that can be executed by your shell environment
● Specify batch workflows in stages with file storage as atomic intermediates
(local FS only).
● Rebuilding logic based on target existence and other file dependencies
(“prerequisites”) existence/metadata.
● Supports basic conditionals and parallelism

From http://blog.kaggle.com/2012/10/15/make-for-data-scientists/

Drake
● “Make for data”
● Specify batch workflows in stages with file storage as atomic intermediates
(backend support for local FS, S3, HDFS, Hive)
● Early support for alternate branches and branch merging
● Smart rebuilding against targets and target metadata, or a quite sophisticated
command line specification system
● Parallel execution
● Workflow graph generation
● Expanding protocols support to facilitate common tasks like python, HTTP
GET, etc

drake --graph +workflow/02.model.complete

Pydoit
● “doit comes from the idea of bringing the power of build-tools to execute any
kind of task”
● flexible build tool used to glue together pipelines
● Similar to Drake but much more support for Python as opposed to bash.
● Specify actions, file_deps, and targets for tasks
● Smart rebuilding based on target/file_deps metadata, or your own custom
logic
● Parallel execution
● Watcher process that triggers based on target/file_deps file changes
● Can define failure/success callbacks against the project

Luigi
● Specify batch workflows with different job type classes including
postgres.CopyToTable, hadoop.JobTask, PySparkTask
● Specify depedencies with class requires() method and record via
output() method targets against supported backends such as S3, HDFS,
local or remote FS, MySQL, Redshift, etc.
● Event system provided to add callbacks to task returns, basic email alerting
● A central task scheduler (luigid) that provides a web frontend for task
reporting, prevents duplicate task execution, and basic task history browsing
● Requires a separate triggering mechanism to submit tasks to the central
scheduler
● RangeDaily and RangeHourly parameters as a dependency for backfill or
recovery from extended downtime

AWS Data Pipeline
● Cloud scheduler and resource instigator on hosted AWS hardware
● Can define data pipeline jobs, some of which come built-in (particularly
AWS-to-AWS data transfer, complete with blueprints), but you can run custom
scripts by hosting them on AWS
● Get all the AWS goodies: CloudWatch, IAM roles/policies, Security Groups
● Spins up target computing instances to run your pipeline activities per your
configuration
● No explicit file based dependencies

Pinball
● Central scheduler server with monitoring UI built in
● Parses a file based configuration system into JobToken or EventToken
instances
● Tokens are checked against Events associated with upstream tokens to
determine whether or not they are in a runnable state.
● Tokens specify their dependencies to each other - not file based.
● More of an task manager abstraction than the other tools as it doesn't have a
lot of job templates built in.
● Overrun policies deal with dependencies on past successes

How I learned to time travel, or, data pipelining and scheduling with Airflow

More Related Content

What's hot (20)

Similar to How I learned to time travel, or, data pipelining and scheduling with Airflow (20)

More from PyData (20)

Recently uploaded (20)

How I learned to time travel, or, data pipelining and scheduling with Airflow