Erik Bernhardsson
erikbern@spotify.com
Batchdataprocessingin
Python
Focusingmostlyonmusicdiscoveryandlargescalemachinelearning
Previouslymanagedthe“Analyticsteam”inStockholm
I’matSpotify in NYC
BtwI’mErikBernhardsson
Background
Billionsoflogmessages(severalTBs)everyday
Usageandbackendstats,debuginformation
Whatwewanttodo
AB-testing
Musicrecommendations
Monthly/daily/hourlyreporting
Businessmetricdashboards
Weexperimentalot–needquickdevelopmentcycles
Wecrunchalot of data
WhydidwebuildLuigi?
Oursecondcluster(in2009):
WelikeHadoop
Longstoryshort:)
Ourfifthcluster
Runningonejobiseasy
Lotsoflong-runningprocesseswithdependencies
Needmonitoring
Handlefailures
Gofromexperimentationtoproductioneasily
Butwhataboutrunning1000sofjob every day?
Butalsonon-Hadoopstuff
MostthingsarePythonMap/Reducejobs
AlsoPig,Hive
SCPfilesfromonehosttoanother
Trainamachinelearningmodel
PutdatainCassandra
Inthepre-Luigiworld
Hownottodoworkflows
“Streams”isalistof(username,track,artist,timestamp)tuples
Example:ArtistToplist
Streams
Artist
Aggregation
Top 10 Database
Pre-Luigiexampleofartisttoplists
Don’tdothisathome
OK,sochainthetasks
Cronnicer,yay!
That’sOK,butdon’tleavebrokendatasomewhere
(btw,LuigigivesyouatomicfileoperationslocallyandinHDFS)
Errorswilloccur
Thesecondstepfails,youfixit,thenyouwanttoresume
Don’trunthingstwice
Tousedataflowsascommandlinetools
Parametrizetasks
Youwanttorunthedataflowforasetofsimilarinputs
Puttasksinloops
Plumbingsucks
Graphalgorithmsrock!
Plumbingsucks...
Who’stheworld’ssecond
mostfamousplumber?
Hint:hewearsgreen
APythonframeworkfordataflowdefinitionandexecution
IntroducingLuigi
OnsteroidsandPCP
...withatoolboxofmainlyHadooprelatedstuff
Simpledependencydefinitions
EmphasisonHadoop/HDFSintegration
Atomicfileoperations
Dataflowvisualization
Commandlineintegration
Mainfeatures
Luigiis“kindoflike
Makefile”inPython
LuigiTask
Luigi-AggregateArtists
Luigi-AggregateArtists
Run on the command line:
$ python dataflow.py AggregateArtists
DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 74375] Running AggregateArtists()
INFO: [pid 74375] Done AggregateArtists()
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
Top10artists-WrappedarbitraryPythoncode
Completingthetoplist
BasicfunctionalityforexportingtoPostgres.Cassandrasupportisintheworks
Databasesupport
Runningitall...
DEBUG: Checking if ArtistToplistToDatabase() is complete
INFO: Scheduled ArtistToplistToDatabase()
DEBUG: Checking if Top10Artists() is complete
INFO: Scheduled Top10Artists()
DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 3
INFO: [pid 74811] Running AggregateArtists()
INFO: [pid 74811] Done AggregateArtists()
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 74811] Running Top10Artists()
INFO: [pid 74811] Done Top10Artists()
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 74811] Running ArtistToplistToDatabase()
INFO: Done writing, importing at 2013-03-13 15:41:09.407138
INFO: [pid 74811] Done ArtistToplistToDatabase()
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
Imaginehowcoolthiswouldbewithrealdata...
Theresults
Luigi Presentation at OSCON 2013
Taskshaveimplicit__init__
TaskParameters
Generatescommandlineinterfacewithtypinganddocumentation
Classvariableswithsomemagic
$ python dataflow.py AggregateArtists --date 2013-03-05
Combinedusageexample
TaskParameters
RunningHadoopMapReduceutilizingHadoopStreamingorcustomjar-files
RunningHiveand(soon)Pigqueries
InsertingdatasetsintoPostgres
LuigicomeswithatoolboxofabstractTasksfor...
...howtorunanything,really
Tasktemplatesandtargets
Writingnew onesareaseasyasdefininganinterfaceand
implementingrun()
Built-inHadoopStreamingPythonframework
HadoopMapReduce
Tinyinterface–justimplementmapperandreducer
FetcheserrorlogsfromHadoopclusteranddisplaysthemtotheuser
ClassinstancevariablescanbereferencedinMapReducecode,whichmakesit
easytosupplyextradataindictionariesetc.formapsidejoins
EasytosendalongPythonmodulesthatmightnotbeinstalledonthecluster
Supportforcounters,secondarysort,combiners,distributedcache,etc.
RunsonCPythonsoyoucanuseyourfavoritelibs(numpy,pandasetc.)
Features
Built-inHadoopStreamingPythonframework
HadoopMapReduce
Morefeatures
Luigi’s“visualiser”
Diveintoanytask
Basicmulti-processing
Multipleworkers
$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W08
Greatforautomatedexecution
Errornotifications
Preventstwoidenticaltasksfromrunningsimultaneously
ProcessSynchronization
Luigi worker 1 Luigi worker 2
A
B C
A C
F
Luigi central planner
...whathappens
ProcessSynchronization
Luigi worker 1 Luigi worker 2
A
B C
A
C
F
...whathappens
ProcessSynchronization
Luigi worker 1 Luigi worker 2
A
B C
A
C
F
...whathappens
ProcessSynchronization
Luigi worker 1 Luigi worker 2
A
B C
A
C
F
Largedataflows
(Screenshotfromwebinterface)
ThingsLuigiisnot
Yes,youcanrunPythonHadoopjobsinLuigi. Butthemainfocusisworkflow
management.
Luigiisnottryingto
replacemrjob
Youstillneedtofigureouthoweachtaskruns
Luigidoesnotgiveyou
scalability
Mapreduce/Pig/Hive/etcarewonderfultoolsfordoingthisandLuigiismorethan
happytodelegateittothem.
Luigidoesnothelpyou
transformthedata
AlthoughOozieiskindofannoying
...butit’ssortoflikeOozie
Oozie Luigi
Only Hadoop Yes!
Horrible XML Yes!
Easy Yes!
Fun & powerful Yes!
“Oozieexample”
<workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'>
<start to='getDirInfo' />
<!-- STEP ONE -->
<action name='getDirInfo'>
<!--writes 2 properties: dir.num-files: returns -1 if dir doesn't exist,
otherwise returns # of files in dir dir.age: returns -1 if dir doesn't exist,
otherwise returns age of dir in days -->
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>com.navteq.oozie.GetDirInfo</main-class>
<arg>${inputDir}</arg>
<capture-output />
</java>
<ok to="makeIngestDecision" />
<error to="fail" />
</action>
<!-- STEP TWO -->
<decision name="makeIngestDecision">
<switch>
<!-- empty or doesn't exist -->
<case to="end">
${wf:actionData('getDirInfo')['dir.num-files'] lt 0 ||
(wf:actionData('getDirInfo')['dir.age'] lt 1 and
wf:actionData('getDirInfo')['dir.num-files'] lt 24)}
</case>
<!-- # of files >= 24 -->
<case to="ingest">
${wf:actionData('getDirInfo')['dir.num-files'] gt 23 ||
wf:actionData('getDirInfo')['dir.age'] gt 6}
</case>
<default to="sendEmail"/>
</switch>
</decision>
<!--EMAIL-->
<action name="sendEmail">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<main-class>com.navteq.oozie.StandaloneMailer</main-class>
<arg>probedata2@navteq.com</arg>
<arg>gregory.titievsky@navteq.com</arg>
<arg>${inputDir}</arg>
<arg>${wf:actionData('getDirInfo')['dir.num-files']}</arg>
<arg>${wf:actionData('getDirInfo')['dir.age']}</arg>
Instead,focusonridiculouslylittleboilerplatecode
Generalsoyoucanbuildwhateverontopofit
Aswellasrapidexperimentationcycle
Oncethingswork,trivialtoputinproduction
Luigidoesnothave999
features
WhatweuseLuigifor
HadoopStreaming
JavaHadoopMapReduce
Hive
Pig
Trainmachinelearningmodels
Import/exportdatato/fromPostgres
InsertdataintoCassandra
scp/rsync/ftpdatafilesandreports
Dumpandloaddatabases
OthersusingitwithScalaMapReduceandMRJobaswell
Beoneofthecoolkids!
OriginatedatSpotify
MainlybuiltbymeandEliasFreider
Basedonmanyyearsofexperiencewithdataprocessing
OpensourcesinceSeptember2012
https://github.com/spotify/luigi
Luigiisopensource
•Pig
•EC2
•Scalding
•Cassandra
Futureplans!
Formoreinformationfeelfreetoreachoutat
http://github.com/spotify/luigi
Thankyou!
Oh,andwe’rehiring–http://spotify.com/jobs
Erik Bernhardsson
erikbern@spotify.com

More Related Content

PDF
Luigi presentation NYC Data Science
PPTX
Jonathan Coveney: Why Pig?
PDF
Managing data workflows with Luigi
PDF
Luigi future
PDF
Luigi presentation OA Summit
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PDF
Boosting command line experience with python and awk
PDF
H2O World - Intro to R, Python, and Flow - Amy Wang
Luigi presentation NYC Data Science
Jonathan Coveney: Why Pig?
Managing data workflows with Luigi
Luigi future
Luigi presentation OA Summit
A Beginner's Guide to Building Data Pipelines with Luigi
Boosting command line experience with python and awk
H2O World - Intro to R, Python, and Flow - Amy Wang

What's hot (20)

PDF
H2O World - PySparkling Water - Nidhi Mehta
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PDF
Elasticwulf Pycon Talk
PDF
Prototyping Data Intensive Apps: TrendingTopics.org
PDF
Web Scraping in Python with Scrapy
PDF
Ganga: an interface to the LHC computing grid
PDF
Pybind11 - SciPy 2021
PDF
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
PDF
Network Analysis with networkX : Real-World Example-2
ODP
Migrations With Transmogrifier
PPT
Workflow on Hadoop Using Oozie__HadoopSummit2010
PDF
DIANA: Recent developments in GooFit
PDF
MAVRL Workshop 2014 - pymatgen-db & custodian
PDF
Emphemeral hadoop clusters in the cloud
PDF
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
ODP
Theming Plone with Deliverance
PDF
Esri International User Conference 2011: Python: Integrating Standard and Thi...
PDF
Блохин Леонид - "Mist, как часть Hydrosphere"
PDF
Austin Python Meetup 2017: What's New in Pythons 3.5 and 3.6?
PDF
XPath for web scraping
H2O World - PySparkling Water - Nidhi Mehta
Nov HUG 2009: Hadoop Record Reader In Python
Elasticwulf Pycon Talk
Prototyping Data Intensive Apps: TrendingTopics.org
Web Scraping in Python with Scrapy
Ganga: an interface to the LHC computing grid
Pybind11 - SciPy 2021
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Network Analysis with networkX : Real-World Example-2
Migrations With Transmogrifier
Workflow on Hadoop Using Oozie__HadoopSummit2010
DIANA: Recent developments in GooFit
MAVRL Workshop 2014 - pymatgen-db & custodian
Emphemeral hadoop clusters in the cloud
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Theming Plone with Deliverance
Esri International User Conference 2011: Python: Integrating Standard and Thi...
Блохин Леонид - "Mist, как часть Hydrosphere"
Austin Python Meetup 2017: What's New in Pythons 3.5 and 3.6?
XPath for web scraping
Ad

Viewers also liked (20)

PDF
Approximate nearest neighbor methods and vector models – NYC ML meetup
DOCX
Las tics en la educacion 1234
PPTX
Luigi Paris.py meetup presentation
PDF
The Echo Nest at Music and Bits, October 21 2009
PDF
Cut Bait - 10 Years of Dorkbot
PDF
The echo nest-music_discovery(1)
PDF
Music data is scary, beautiful and exciting
PDF
The future music platform
PDF
The Echo Nest Remix at Dorkbot NYC, March 4 2009
PPT
Echo nest-api-boston-2012
PDF
Luigi Galluccio Booklet 2017
PDF
俞晨杰:Linked in大数据应用和azkaban
PPTX
Quartz
PDF
ML+Hadoop at NYC Predictive Analytics
PDF
PPTX
Azkaban and Pig at LinkedIn
PPTX
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
PPT
Hadoop ecosystem framework n hadoop in live environment
PDF
Dataflow with Apache NiFi - Crash Course - HS16SJ
PDF
Real time ETL processing using Spark streaming
Approximate nearest neighbor methods and vector models – NYC ML meetup
Las tics en la educacion 1234
Luigi Paris.py meetup presentation
The Echo Nest at Music and Bits, October 21 2009
Cut Bait - 10 Years of Dorkbot
The echo nest-music_discovery(1)
Music data is scary, beautiful and exciting
The future music platform
The Echo Nest Remix at Dorkbot NYC, March 4 2009
Echo nest-api-boston-2012
Luigi Galluccio Booklet 2017
俞晨杰:Linked in大数据应用和azkaban
Quartz
ML+Hadoop at NYC Predictive Analytics
Azkaban and Pig at LinkedIn
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop ecosystem framework n hadoop in live environment
Dataflow with Apache NiFi - Crash Course - HS16SJ
Real time ETL processing using Spark streaming
Ad

Similar to Luigi Presentation at OSCON 2013 (20)

PDF
Luigi - Batch Data Processing in Python (PyData SV 2013)
PDF
Workflow Engines + Luigi
PDF
Apache airflow
PDF
Reproducibility and automation of machine learning process
PDF
Data Pipelines with Python - NWA TechFest 2017
ODP
Web-scale data processing: practical approaches for low-latency and batch
PPTX
More Data, More Problems: Evolving big data machine learning pipelines with S...
PDF
Airflow Intro-1.pdf
PDF
Powering machine learning workflows with Apache Airflow and Python
PPTX
Gcp dataflow
PDF
Interop 2015: Hardly Enough Theory, Barley Enough Code
PPTX
Airflow Operators: Python, Bash, Dummy, and More
PDF
The Evolution of Big Data at Spotify
PDF
Revealing the Power of Legacy Machine Data
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Building Data Pipelines in Python
PDF
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PDF
Big data berlin
PPTX
Analyzing Data With Python
Luigi - Batch Data Processing in Python (PyData SV 2013)
Workflow Engines + Luigi
Apache airflow
Reproducibility and automation of machine learning process
Data Pipelines with Python - NWA TechFest 2017
Web-scale data processing: practical approaches for low-latency and batch
More Data, More Problems: Evolving big data machine learning pipelines with S...
Airflow Intro-1.pdf
Powering machine learning workflows with Apache Airflow and Python
Gcp dataflow
Interop 2015: Hardly Enough Theory, Barley Enough Code
Airflow Operators: Python, Bash, Dummy, and More
The Evolution of Big Data at Spotify
Revealing the Power of Legacy Machine Data
How I learned to time travel, or, data pipelining and scheduling with Airflow
Building Data Pipelines in Python
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big data berlin
Analyzing Data With Python

Recently uploaded (20)

PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Architecture types and enterprise applications.pdf
PPT
Geologic Time for studying geology for geologist
PPTX
Modernising the Digital Integration Hub
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Getting Started with Data Integration: FME Form 101
PDF
CloudStack 4.21: First Look Webinar slides
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
observCloud-Native Containerability and monitoring.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Tartificialntelligence_presentation.pptx
1 - Historical Antecedents, Social Consideration.pdf
Architecture types and enterprise applications.pdf
Geologic Time for studying geology for geologist
Modernising the Digital Integration Hub
Developing a website for English-speaking practice to English as a foreign la...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Getting Started with Data Integration: FME Form 101
CloudStack 4.21: First Look Webinar slides
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
DP Operators-handbook-extract for the Mautical Institute
sustainability-14-14877-v2.pddhzftheheeeee
Web Crawler for Trend Tracking Gen Z Insights.pptx
WOOl fibre morphology and structure.pdf for textiles
A review of recent deep learning applications in wood surface defect identifi...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A comparative study of natural language inference in Swahili using monolingua...

Luigi Presentation at OSCON 2013