SlideShare a Scribd company logo
1
by Teemu Kurppa
www.teemukurppa.net
Metrics Monday at Custobar, Helsinki,
30.5.2016
Managing data workflows
with Luigi
2
Customer analytics and
marketing tool for retailers
I’m an advisor at your host:
teemu@ouraring.com
www.ouraring.com
the world's first wellness ring
Head of Software: Cloud & Mobile
I work at
Introducing
Data Workflows
4
gunzip -c /var/log/syslog.3.gz | grep -e UFW
Complex data workflow
Let’s analyse if the weather affects sleep quality:
• Get sleep data of all study participants
• Get location data of all study participants
• Fetch weather data for each day and location
• Fetch historical weather data for each location
• Calculate difference from an average weather for each
data point
• Do a statistical analysis over users and days, comparing
weather data and sleep quality data
A lot of can go wrong on each step. Rerunning takes time
Case Custobar:ETL
Extract - Transform - Load
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw away invalid
fields
Load valid sales
data to database
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw away invalid
fields
Load valid sales
data to database
Transform
Load
Extract
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw away invalid
fields
Load valid sales
data to database
Do this, for
millions of rows of initial data,
and continue doing it every day, for
products
customers
sales
Luigi
by Spotify
11
Data workflow tools
Pinball
by Pinterest
Luigi
by Spotify
Airflow
by AirBnB
Luigi Concepts
13
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
customers.csv
sftp://data/
customers.csv
Dependencies
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
customers.csv
sftp://data/
customers.csv
Dependencies
output()input() input() output()
requires()
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
customers.csv
sftp://data/
customers.csv
Dependencies
company: Parameter
date: DateParameter
company: Parameter
date: DateParameter
Parameters
Concepts: Target
17
Target
Target is simply something that exists or doesn’t exist
For example
• a file in a local file system
• a file in a remote file system
• a file in an Amazon S3 bucket
• a database row in a SQL database
Target
class MongoTarget(Luigi.Target):
def __init__(self, database, collection, predicate):
self.client = MongoClient()
self.database = database
self.collection = collection
self.predicate = predicate
def exists(self):
db = self.client[self.database]
one = db[self.collection].find_one(self.predicate)
return one is not None
Target
Lots of ready-made targets in Luigi:
• local file
• HDFS file
• S3 key/value target
• SSH remote target
• SFTP remote target
• SQL table row target
• Amazon Redshift table row target
• ElasticSearch target
Concepts: Task
21
Task: basic structure
class TransformDailySalesCSVtoJSON(Luigi.Task):
def requires(self): #…
def run(self): # …
def output(self): #…
Task: parameters
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): #…
def run(self): # …
def output(self): #…
Task: requires
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self):
return ImportDailyCSVFromSFTP(self.date)
def run(self): # …
def output(self): #…
Task: output
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): # …
def run(self): # …
def output(self):
path = “/d/sales_%s.json” % (self.date.stftime(‘%Y%m%d’))
return luigi.LocalTarget(path)
Task: run
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): #…
def run(self):
# Note: luigi’s input() and output() takes care of atomicity
with self.input().open(‘r’) as infile:
data = transform_csv_to_dict(infile)
with self.output().open(‘w’) as outfile:
json.dump(data, outfile)
def output(self): #…
Task
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self):
return ImportDailyCSVFromSFTP(self.date)
def run(self):
with self.input().open(‘r’) as infile:
data = transform_csv_to_dict(infile)
with self.output().open(‘w’) as outfile:
json.dump(data, outfile)
def output(self):
path = “/d/sales_%s.json” % (self.date.stftime(‘%Y%m%d’))
return luigi.LocalTarget(path)
Tasks
Lots of ready-made tasks in Luigi:
• dump data to SQL table
• copy to Redshift Table
• run Hadoop job
• query SalesForce
• copy to Redshift Table
• Load ElasticSearch index
• …
Dependency patterns
29
Multiple dependencies
class TransformAllSales(Luigi.Task):
def requires(self):
for i in range(1000):
return [ImportInitialSaleFile(index=i)]
def run(self): #…
def output(self): #…
Dynamic dependencies
class LoadDailyAPIData(Luigi.Task):
date = luigi.DateParameter()
def run(self):
for filepath in os.listdir(‘/d/api_data/*.json’):
TransformDailyAPIData(filepath)
Wrapper task
class LoadAllDailyData(Luigi.WrapperTask):
date = luigi.DateParameter()
def run(self):
yield LoadDailyProducts(self.date)
yield LoadDailyCustomers(self.date)
yield LoadDailySales(self.date)
Why to use
data workflow tools?
33
34
1. Resume the data workflow after a failure
2. Parametrize and rerun tasks every day
3. Organise code with shared patterns
35
Thanks! Questions?
Custobar is hiring!
Approach Juha, Tatu or me to learn more
Follow @teemu on Twitter to stay in touch.

More Related Content

PDF
Luigi presentation NYC Data Science
PDF
Luigi presentation OA Summit
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
re:Invent 2019 BPF Performance Analysis at Netflix
PPTX
Redis data modeling examples
PDF
Systems@Scale 2021 BPF Performance Getting Started
PDF
Scaling Twitter
Luigi presentation NYC Data Science
Luigi presentation OA Summit
A Beginner's Guide to Building Data Pipelines with Luigi
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
re:Invent 2019 BPF Performance Analysis at Netflix
Redis data modeling examples
Systems@Scale 2021 BPF Performance Getting Started
Scaling Twitter

What's hot (20)

PDF
Performance Analysis: The USE Method
PPTX
Why your Spark Job is Failing
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
PPTX
Rate limiters in big data systems
PDF
Flask With Server-Sent Event
PDF
Performance Wins with BPF: Getting Started
PDF
Linux kernel tracing
PDF
マーク&スイープ勉強会
PDF
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
BPF - in-kernel virtual machine
PPTX
Apache Beam: A unified model for batch and stream processing data
PDF
Altinity Quickstart for ClickHouse-2202-09-15.pdf
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PDF
LISA2019 Linux Systems Performance
PDF
ksqlDB - Stream Processing simplified!
KEY
Rainbird: Realtime Analytics at Twitter (Strata 2011)
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Empower your App by Inheriting from Odoo Mixins
PDF
Introduction to Apache Beam
Performance Analysis: The USE Method
Why your Spark Job is Failing
Compression Options in Hadoop - A Tale of Tradeoffs
Rate limiters in big data systems
Flask With Server-Sent Event
Performance Wins with BPF: Getting Started
Linux kernel tracing
マーク&スイープ勉強会
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
BPF - in-kernel virtual machine
Apache Beam: A unified model for batch and stream processing data
Altinity Quickstart for ClickHouse-2202-09-15.pdf
HBase and HDFS: Understanding FileSystem Usage in HBase
LISA2019 Linux Systems Performance
ksqlDB - Stream Processing simplified!
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Empower your App by Inheriting from Odoo Mixins
Introduction to Apache Beam
Ad

Similar to Managing data workflows with Luigi (20)

PDF
Running a Scalable And Reliable Symfony2 Application in Cloud (Symfony Sweden...
PDF
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
PDF
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
PDF
Reproducibility and automation of machine learning process
PDF
Data herding
PDF
Data herding
PPTX
GraphTour - Workday: Tracking activity with Neo4j (English Version)
PDF
Digdagによる大規模データ処理の自動化とエラー処理
PDF
Our Puppet Story (GUUG FFG 2015)
PDF
Buildingsocialanalyticstoolwithmongodb
PDF
Machine Learning Infrastructure
PDF
Handout3o
PPTX
Building Deep Learning Workflows with DL4J
PPTX
Webinar: Performance Tuning + Optimization
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPTX
How to Achieve Scale with MongoDB
PDF
Automating Workflows for Analytics Pipelines
PPTX
More Data, More Problems: Evolving big data machine learning pipelines with S...
PPTX
Eagle from eBay at China Hadoop Summit 2015
PPTX
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Running a Scalable And Reliable Symfony2 Application in Cloud (Symfony Sweden...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Reproducibility and automation of machine learning process
Data herding
Data herding
GraphTour - Workday: Tracking activity with Neo4j (English Version)
Digdagによる大規模データ処理の自動化とエラー処理
Our Puppet Story (GUUG FFG 2015)
Buildingsocialanalyticstoolwithmongodb
Machine Learning Infrastructure
Handout3o
Building Deep Learning Workflows with DL4J
Webinar: Performance Tuning + Optimization
AI與大數據數據處理 Spark實戰(20171216)
How to Achieve Scale with MongoDB
Automating Workflows for Analytics Pipelines
More Data, More Problems: Evolving big data machine learning pipelines with S...
Eagle from eBay at China Hadoop Summit 2015
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Ad

More from Teemu Kurppa (7)

PDF
React + Redux + d3.js
PDF
fast.ai - Learning Deep Learning
KEY
Quick'n'Dirty Tornado Intro
KEY
Early stage startups
PDF
Mobile Startups - Why to focus on mobile?
PDF
Platform = Stage. How to choose a mobile development platform?
PDF
Leaks & Zombies
React + Redux + d3.js
fast.ai - Learning Deep Learning
Quick'n'Dirty Tornado Intro
Early stage startups
Mobile Startups - Why to focus on mobile?
Platform = Stage. How to choose a mobile development platform?
Leaks & Zombies

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Cloud computing and distributed systems.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
sap open course for s4hana steps from ECC to s4
Cloud computing and distributed systems.
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25-Week II
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx

Managing data workflows with Luigi