SlideShare a Scribd company logo
@erinshellman
Wrangle Conf
July 20th, 2017
Building Robust
Pipelines with Airflow
Zymology: is the science of fermentation and it’s applied to
make materials and molecules
!
"
#
$
Beer
Insulin
Food
additives
Plastics
Building Robust Pipelines with Airflow
Zymergen provides
a platform for
rapid improvement
of microbial strains
through genetic
engineering.
Robotic automation
Our experimentation
is increasingly
orchestrated with
robotics and machine
learning.
Learning how to efficiently
navigate the genome is the mission
of data science at Zymergen
Blocker: process failure
Orchestrating complex experiments with
robots is hard, and there are process failures.
These failures often cause sporadic, extreme
measurement values.
Blocker: batch effects
We see temporal
effects based on
when experiments
were performed
Blocker:
different interpretations of results
We’re building a platform that can
support any microbe and any molecule.
Sometimes that results in a proliferation
of solutions with disagreement on which
is best.
Processing pipeline
1.Identify process failures
2.Quantify and remove process-
related bias
3.Identify strains that show
improvement using consistent
criteria
Clean model inputs
Outlier detection
Normalization
Hit detection
Rolling our own ETL pipeline
There are many
ways to measure
the concentration of
a molecule.
Any microbe, any
molecule… any
experiment, many
data formats.
Describing complex
processing
dependencies is
hard.
Rolling our own ETL pipeline
Airflow
https://airflow.incubator.apache.org/
“Airflow is a platform to programmatically author, schedule and
monitor workflows.”
Airflow gives us flexibility to apply a common
set of processing steps to variable data
inputs, schedule complex processing
workflows, and has become a delivery
mechanism for our products.
Structure
and Flexibility
e.g. Normalization
Airflow workflows are
described as directed
acyclic graphs (DAGs).
Each task node in the
DAG is an operator.
The anatomy of a
DAG
Custom operators
Ordering
Instantiate DAG
Modularity and flexibility
Airflow + PyStan
With Bayesian hierarchical models we estimate
(and monitor) the distribution of batch effects.
Experimental bias
DropBox
• Scientists at Zymergen work with data using
many different tools including JMP, SQL, and
Excel.
• We use a custom DropBox hook to make
quick data ingestion pipelines.
Alerting /
Communication
3rd-party hooks & operators
Operator
Pairs well with Superset!
“Apache Superset is a
modern, enterprise-ready
business intelligence web
application”
https://github.com/apache/incubator-superset
Constructing machine
learning workflows
Fairflow: Functional Airflow
• The core of Fairflow is an
abstract base class foperator
that takes care of
instantiating your Airflow
operators and setting their
dependencies.
• In Fairflow, DAGs are
constructed from foperators
that create the upstream
operators when the final
foperator is called.
Configuring complex ML
workflows… functionally
Defining ML workflows
In the DAG
definition, create
an instance of the
task.
Then, instantiate a
DAG like usual and
call the compare
task on the DAG.
Defining ML workflows
The design allows for simple creation of
complicated experimental workflows with arbitrary
sets of models, parameters, and evaluation metrics.
Is Airflow for you?
Do you have heterogeneous data sources?
Do you have complex dependencies between
processing tasks?
Do you have data with different velocities?
Do you have constraints on your time?
Probably!
Thanks team!
%%
& '()
*
+

More Related Content

PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PPTX
Airflow at WePay
PPTX
Getting to Know Airflow
PPTX
Airflow - a data flow engine
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
PDF
Apache Airflow
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Running Airflow Workflows as ETL Processes on Hadoop
Airflow at WePay
Getting to Know Airflow
Airflow - a data flow engine
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Introduction to Apache Airflow - Data Day Seattle 2016
Apache Airflow

What's hot (18)

PDF
Apache Airflow Architecture
PPTX
Apache Airflow Introduction
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Airflow for Beginners
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Introducing Apache Airflow and how we are using it
PPTX
Airflow presentation
PDF
Building Better Data Pipelines using Apache Airflow
PPTX
Building cloud-enabled genomics workflows with Luigi and Docker
PDF
Airflow presentation
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
PDF
Apache Airflow
PPTX
What is Spark
PDF
Workflow Engines + Luigi
PDF
Clearing Airflow Obstructions
PDF
Powering machine learning workflows with Apache Airflow and Python
PDF
Fast and Reliable Apache Spark SQL Engine
PDF
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Apache Airflow Architecture
Apache Airflow Introduction
How I learned to time travel, or, data pipelining and scheduling with Airflow
Airflow for Beginners
How I learned to time travel, or, data pipelining and scheduling with Airflow
Introducing Apache Airflow and how we are using it
Airflow presentation
Building Better Data Pipelines using Apache Airflow
Building cloud-enabled genomics workflows with Luigi and Docker
Airflow presentation
Orchestrating workflows Apache Airflow on GCP & AWS
Apache Airflow
What is Spark
Workflow Engines + Luigi
Clearing Airflow Obstructions
Powering machine learning workflows with Apache Airflow and Python
Fast and Reliable Apache Spark SQL Engine
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Ad

Similar to Building Robust Pipelines with Airflow (20)

PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
PPTX
Apache airflow
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PPSX
Introduce Airflow.ppsx
PPTX
DataPipelineApacheAirflow.pptx
PDF
Andrii Soldatenko "The art of data engineering"
PDF
Airflow Intro-1.pdf
PPTX
Apache Airflow in Production
PDF
Apache airflow
PPTX
03_aiops-1.pptx
PDF
Building Automated Data Pipelines with Airflow.pdf
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
PDF
Managing transactions on Ethereum with Apache Airflow
PDF
Introduction to Apache Airflow
PPTX
airflow web UI and CLI.pptx
PPTX
Introduction to Apache Airflow & Workflow Orchestration.pptx
PPTX
airflowpresentation1-180717183432.pptx
PDF
Airflow - Insane power in a Tiny Box
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Apache airflow
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Introduce Airflow.ppsx
DataPipelineApacheAirflow.pptx
Andrii Soldatenko "The art of data engineering"
Airflow Intro-1.pdf
Apache Airflow in Production
Apache airflow
03_aiops-1.pptx
Building Automated Data Pipelines with Airflow.pdf
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache AirfowAsaSAsaSAsSas - Session1.pptx
Managing transactions on Ethereum with Apache Airflow
Introduction to Apache Airflow
airflow web UI and CLI.pptx
Introduction to Apache Airflow & Workflow Orchestration.pptx
airflowpresentation1-180717183432.pptx
Airflow - Insane power in a Tiny Box
Ad

More from Erin Shellman (9)

PDF
Case studies in data-driven merchandising
PDF
Catching the most with high-throughput screening
PDF
Developing effective data scientists
PDF
Bot or Not
PDF
Downloading the internet with Python + Scrapy
PDF
Fun! with the Twitter API
PDF
real time real talk
PDF
Collaborative Filtering for fun ...and profit!
PDF
Assumptions: Check yo'self before you wreck yourself
Case studies in data-driven merchandising
Catching the most with high-throughput screening
Developing effective data scientists
Bot or Not
Downloading the internet with Python + Scrapy
Fun! with the Twitter API
real time real talk
Collaborative Filtering for fun ...and profit!
Assumptions: Check yo'self before you wreck yourself

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Digital-Transformation-Roadmap-for-Companies.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity

Building Robust Pipelines with Airflow