SlideShare a Scribd company logo
Data Pipelines
NWA Techfest 2017
With Python
What is a Data
Pipeline?
What is a Data Pipeline?
• Discrete set of dependent operations
• Directional (Inputs -> [Operations] -> Outputs)
• One or more input sources and one or more
outputs
Pipelines are Used For
• Data aggregation / augmentation
• Data cleansing / de-duplication
• Data copying / synchronization
• Analytics processing
• AI Modeling
Sources and Targets
• Sources: Initial inputs into a pipeline
• REST API, Excel Sheet, Filesystem, HDFS,
RDBMS, etc.
• Targets: Terminal outputs of a pipeline
• REST API, Excel Sheet, [...], Email, Slack
Operations
• Operations are the fundamental units of work
within a pipeline.
• Operations can be domain specific.
• Operations can be composable.
O
O O OS T
S
T
Source
Operation
Target
O
Simple Linear Pipeline
O
O
O
S
TS
T
Source
Operation
Target
O
Complex Pipeline
S S
O
T T
O
O
O
DAGs
• Directed Acyclic Graphs
• Transitive reduction enables smart
dependency resolution
DAG Reduced Form
Atomicity
• An entire operation fails or succeeds as a
whole.
• There is no partial state in the event of a
failure.
"the state or fact of being
composed of indivisible units."
Atomicity
https://www.postgresql.org/docs/8.3/static/tutorial-transactions.html
Idempotency
• An operation can be run multiple times without failure.
• An operation can be run multiple times without
duplication of output.
Q: What is the correct way to pronounce 'idempotent'?
A: The same way every time.
"denoting an element of a set that is
unchanged in value when multiplied or
otherwise operated on by itself."
Idempotency
Idempotent
Concurrency
• Execute a non-resource bound operation via
many threads on the same core.
• Performant pipelines find concurrency within
an operation.
"the decomposability property of a
program, algorithm, or problem into
order-independent or partially-ordered
components or units."
Parallelism
• Execute operations on multiple cores /
machines simultaneously.
• Operators can operate in parallel as soon as a
new input is available.
"a computation architecture in which
many calculations or the execution of
processes are carried out simultaneously"
Design Patterns
Periodic Workflows
• Pipeline executes on a timed interval
• Great for exhaustive data processing
• Easy backfilling
Event-Driven Workflows
• Pipeline handles inputs (events) as they are
received
• Real time data
• Best suited for non-exhaustive data processing
• Backfills?
ETL
• Extract, Transform, Load are distinct steps with
no shared operations
• Each step can performed one or more times
before the following step is performed.
Extract, Transform, Load
ETL
• Intermediate data stored between steps and
audit data is tracked for each step.
• Enables independent processing of Extract
Transform, and Load.
• Don't transform during extraction.
• Don't transform during loading!
O
O
O
S
TS
T
Source
Operation
Target
O
Complex Pipeline
S S
O
T T
O
O
O
E
T
T
S
TS
T
Source
ETL Operation
Target
T
ETL Pipeline
S S
L
T T
T
T
L
E
T
T
S
T
S
T
Source
ETL Operation
Target
T
ETL Pipeline(s)
S S
L
T T
T
T
L
S S
T
T
T
S
T
Why Python?
Scientific / Stats Ecosystem
• NumPy and Pandas
• SciKitLearn
• spaCy
Web Development Ecosystem
• Django, Flask, Pyramid
• Django REST Framework
• Scrapy
• Celery
In web development, we started solving distributed processing
problems a long time ago.
Numba: JIT compiler to LLVM
http://numba.pydata.org/
Data Pipelines with Python - NWA TechFest 2017
Python Libraries
Celery
• Task queueing / Asynchronous processing
• Native Python executed on distributed workers
• Retrying, Throttling, Pooling
Data Pipelines with Python - NWA TechFest 2017
🐶>😿
Luigi
• Open sourced by Spotify in 2012
• Lightweight configuration
• Does not support worker pooling
"Luigi is a Python package that helps you build complex pipelines
of batch jobs. It handles dependency resolution, workflow
management, visualization, handling failures, command line
integration, and much more."
Data Pipelines with Python - NWA TechFest 2017
Airflow
• Open sourced by AirBnb in 2015
• Apache Incubation since March 2016
• Implements workflows as strict DAGs
• Visualization / Audit / Backfill tools
• Scales with Celery
"Airflow is a platform to programmaticaly
author, schedule and monitor data pipelines."
Data Pipelines with Python - NWA TechFest 2017
Our Tools of Choice:
Celery + Airflow
How To Attack a Pipeline
Problem
1. Pure Python functions
2. Convert to Celery (Parallel for free!)
3. Layer in Concurrency / Optimizations
4. Escalate to AirFlow
Things are going to fail.
• Log often and frequently.
• Remember Atomicity.
• Leverage aggregation / visualization tools.
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
Pipeline Takeaways
• Build operations with Atomicity and
Idempotency in mind.
• Optimize throughput with concurrency and
parallelism.
• Log and visualize (or just use Airflow).
Come talk to us about your data.
Casey Kinsey, Principal Consultant
hirelofty.com
@loftylabs
@quesokinsey

More Related Content

PDF
Building Data Pipelines in Python
PPTX
Building cloud-enabled genomics workflows with Luigi and Docker
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
PDF
data.table and H2O at LondonR with Matt Dowle
PPTX
January 2016 Flink Community Update & Roadmap 2016
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
PPTX
Flink Community Update December 2015: Year in Review
PPTX
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Building Data Pipelines in Python
Building cloud-enabled genomics workflows with Luigi and Docker
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
data.table and H2O at LondonR with Matt Dowle
January 2016 Flink Community Update & Roadmap 2016
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Community Update December 2015: Year in Review
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN

What's hot (20)

PDF
Fast and Reliable Apache Spark SQL Engine
PPTX
Apache Zeppelin Meetup Christian Tzolov 1/21/16
PPTX
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
PDF
Workflow Engines + Luigi
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
PDF
Luigi presentation OA Summit
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PDF
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
PDF
Spark Workflow Management
PDF
Workflow Engines for Hadoop
PPTX
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
PDF
Building Robust Pipelines with Airflow
PPTX
Suneel Marthi - Deep Learning with Apache Flink and DL4J
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
PDF
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
PPTX
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PPTX
Flink history, roadmap and vision
Fast and Reliable Apache Spark SQL Engine
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Workflow Engines + Luigi
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Luigi presentation OA Summit
K. Tzoumas & S. Ewen – Flink Forward Keynote
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Spark Workflow Management
Workflow Engines for Hadoop
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Building Robust Pipelines with Airflow
Suneel Marthi - Deep Learning with Apache Flink and DL4J
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Taking a look under the hood of Apache Flink's relational APIs.
Flink history, roadmap and vision
Ad

Similar to Data Pipelines with Python - NWA TechFest 2017 (20)

PDF
Apache airflow
PDF
Airflow presentation
PDF
Airflow Intro-1.pdf
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
Building Automated Data Pipelines with Airflow.pdf
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Luigi presentation NYC Data Science
PDF
Introducing Apache Airflow and how we are using it
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PPTX
Cassandra Lunch #88: Cadence
PDF
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
PPTX
More Data, More Problems: Evolving big data machine learning pipelines with S...
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
PDF
PyData Meetup Presentation in Natal April 2024
PPTX
Apache airflow
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PPTX
DataPipelineApacheAirflow.pptx
PDF
28March2024-Codeless-Generative-AI-Pipelines
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Apache airflow
Airflow presentation
Airflow Intro-1.pdf
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Building Automated Data Pipelines with Airflow.pdf
How I learned to time travel, or, data pipelining and scheduling with Airflow
Luigi presentation NYC Data Science
Introducing Apache Airflow and how we are using it
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
How I learned to time travel, or, data pipelining and scheduling with Airflow
Cassandra Lunch #88: Cadence
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform
More Data, More Problems: Evolving big data machine learning pipelines with S...
Apache AirfowAsaSAsaSAsSas - Session1.pptx
PyData Meetup Presentation in Natal April 2024
Apache airflow
Running Airflow Workflows as ETL Processes on Hadoop
DataPipelineApacheAirflow.pptx
28March2024-Codeless-Generative-AI-Pipelines
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Ad

Recently uploaded (20)

PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
A Quantitative-WPS Office.pptx research study
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Mega Projects Data Mega Projects Data
PPTX
Computer network topology notes for revision
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Global journeys: estimating international migration
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
STUDY DESIGN details- Lt Col Maksud (21).pptx
IB Computer Science - Internal Assessment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
A Quantitative-WPS Office.pptx research study
Miokarditis (Inflamasi pada Otot Jantung)
climate analysis of Dhaka ,Banglades.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Mega Projects Data Mega Projects Data
Computer network topology notes for revision
Reliability_Chapter_ presentation 1221.5784
Taxes Foundatisdcsdcsdon Certificate.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Global journeys: estimating international migration
Clinical guidelines as a resource for EBP(1).pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Database Infoormation System (DBIS).pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Data Pipelines with Python - NWA TechFest 2017