SlideShare a Scribd company logo
Building AI data pipelines
using PySpark
Matúš Cimerman, Exponea #PyDataBA15. 5. 2017
(PyData Bratislava Meetup #3, Nervosa)
1
Building AI data pipelines using PySpark
Matus Cimerman, matus.cimerman@{gmail,exponea}.com
2
About me
● 1+y Data science @ Exponea, before BI intern and other stuff @
Orange.
● FIIT STU, Data Streams related studies
● Some links:
○ https://www.facebook.com/matus.cimerman
○ https://twitter.com/MatusCimerman
○ https://www.linkedin.com/in/mat%C3%BA%C5%A1-cimerman-
4b08b352/
○ Github link soon
3
Few words regarding this talk
1. Spoken word will be in Slovak for this time.
2. This talk is not about machine learning algorithms, methods,
hyperparameters tuning or anything similar.
3. I am still newbie and learner, don't hesitate to correct me.
4. Goal is to show you overall basics, experts hold your hate.
5. First time publicly speaking, prepare your tomatoes.
6. Comic Sans Font is used intentionally, I was told it's OK for
slides.
4
Aren't you doing ML? WTF mate?
5
Data preprocessing nightmare*
*I do like all of those libraries
Pattern
NLTK
Dora
DataCleaner
Scikit-learn
Orange
Gensim
6
A practical ML pipeline
7
Data pipeline
8
Data pipeline
9
Lambda architecture
http://lambda-architecture.net/, https://mapr.com/developercentral/lambda-architecture/
10
Connecting dots is not easy for large-scale
datasets
11
Apache Spark basics
http://why-not-learn-something.blogspot.sk/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html
12
Resilient Distributed Datasets (RDDs)
• Distributed, immutable
• Lazy-execution,
• Fault-tolerant
• Functional style programming (actions,
transformations)
• Can be persisted/cached in
memory/disk for fast iterative
programming
http://vishnuviswanath.com/spark_rdd.html
Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings
of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.
13
Resilient Distributed Datasets (RDDs)
Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings
of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.
14
RDD example (1)
from random import random
def f(number):
return (0, number) if number < 0.5 else (1, number)
rdd = sc.parallelize([random() for i in range(1000)])
rdd.take(2) # [0.8528183968066678, 0.3513345834291187]
rdd.filter(lambda x: x > 0.95).count() # 53
new_rdd = rdd.persist().map(f) # Nothing happened - lazy eval
new_rdd.countByKey() # {0: 481, 1: 519}
new_rdd.reduceByKey(lambda a,b: a + b).take(2) # [(0, 110.02773787885363), (1, 408.68609250249494)]
15
RDD example (2)
from pyspark.mllib.feature import Word2Vec
inp = sc.textFile("/apps/tmp/text8").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)
synonyms = model.findSynonyms('research', 5)
for word, cosine_distance in synonyms:
print("{}: {}".format(word, cosine_distance))
"""
studies: 0.774848618142
institute: 0.71544256553
interdisciplinary: 0.684204944488
medical: 0.659590635889
informatics: 0.648754791132
"""
16
DataFrames
• Schema view of data
• Also lazy like RDD
• Significant performance
improvement in compare to RDDs
(Tungsten & Catalyst optimizer)
• No serialization between stages
• Great for semi-structured data
• RDD on the background
• Can be created from RDD
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
17
DataFrame schema
root
|-- create_time: long (nullable = true)
|-- id: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- age: string (nullable = true)
| |-- birthday: long (nullable = true)
| |-- city: string (nullable = true)
| |-- cookie_id: string (nullable = true)
| |-- created_ts: double (nullable = true)
| |-- email: string (nullable = true)
|-- raw: string (nullable = true)
|-- ids: struct (nullable = true)
| |-- cookie: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- registered: array (nullable = true)
| | |-- element: string (containsNull = true)
18
DataFrame example
users = spark.read.parquet('project_id=9e898732-a289-11e6-bc55-14187733e19e')
users.count() # 49130
# SQL style operations
users.filter("properties.gender == 'female'").count() # 590
# Expression builder operations
users.filter(users.properties.gender.like("female")).count() # 590
# Show results
users.filter(users.properties.gender.like("female")).select('properties.age').describe('age').show()
+-------+------------------+
|summary| age|
+-------+------------------+
| count | 590|
| mean | 50.3762|
| stddev | 20.5902|
| min | 15|
| max | 85|
+-------+------------------+
19
RDD, DF performance comparision
https://www.slideshare.net/databricks/2015-0616-spark-summit
20
Datasets
• Strongly typed
• …so, not available in Python
21
Spark ML pipelines – high level api
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import Tokenizer, HashingTF
tokenizer = Tokenizer() 
.setInputCol("text") 
.setOutputCol("words")
hashingTF = HashingTF()
.setNumFeatures(1000) 
.setInputCol(tokenizer.getOutputCol) 
.setOutputCol("features")
lr = LogisticRegression() 
.setMaxIter(10) 
.setRegParam(0.01)
pipeline = new
Pipeline() 
.setStages([tokenizer, hashingTF, lr])
model = pipeline.fit(trainingDataset)
https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
22
Spark ML pipelines – cross validation
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
paramGrid = ParamGridBuilder()
.addGrid(hashingTF.numFeatures, [10, 20, 40])
.addGrid(lr.regParam, [0.01, 0.1, 1.0])
.build()
cv = CrossValidator()
.setNumFolds(3)
.setEstimator(pipeline)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(BinaryClassificationEvaluator)
cv.save('cv-pipeline.parquet')
cvModel = cv.fit(trainingDataset)
cvModel.save('cv-model.parquet')
23
ML persistence
paramGrid = ParamGridBuilder()...
cv = CrossValidator().setEstimator(pipeline)...
cv.save('cv-pipeline.parquet')
cvModel = cv.fit(trainingDataset)
cvModel.save('cv-model.parquet')
• You can create model in Python and deploy in Java/Scala app
• Support for almost all Mllib algorithms
• Support for fitted and unfitted Pipelines, so that also Pipelines are exchangeable
• Suited for large distributed models – binary format Parquet is used to store model
data
• Metadata and model parameters are stored in JSON format
24
Submitting apps
1. Locally, suitable for dev (avoid this in production)
2. Cluster in client mode
3. Cluster in cluster mode
25
Running Spark locally
fastest way
26
Submitting apps to YARN
config.yml
spark.app.name: "PyData Bratislava 2017"
spark.master: "yarn"
spark.submit.deployMode: "client"
spark.yarn.dist.files: "file:/pyspark.zip,file:py4j-0.10.3-src.zip"
spark.executorEnv.PYTHONPATH: "pyspark.zip:py4j-0.10.3-src.zip"
spark.executorEnv.PYTHONHASHSEED: "0"
spark.executor.instances: "12"
spark.executor.cores: "3"
spark.executor.memory: "6g"
import yaml
from pyspark import SparkConf, SparkContext
config = yaml.load('config.yml')
sparkConf = SparkConf().setAll([(k, v) for k, v in config.items()])
spark_context = SparkContext(conf=sparkConf).getOrCreate()
27
Submitting apps to YARN
#!/usr/bin/env bash
# 1. Create virtualenv
virtualenv venv --python=/usr/bin/python3.5
# 2. Create zip file with all your python code
zip -ru pyfiles.zip * -x "*.pyc" -x "*.log" -x "venv/*"
# 3. Submit your app
PYSPARK_PYTHON=/venv/ /spark/spark-2.0.1/bin/spark-submit --py-files pyfiles.zip pipeline.py
Note:
•venv should be created on each physical node or HDFS so that every worker can use it
•Also any external config files should be either manually distributed or kept on HDFS
28
Jobs scheduling
• Jenkins hell
• Luigi
• Airflow
29
Invitation for hacking Thursday
• Topic: Offline evaluation of recommender systems
• This Thursday 5pm
30
matus.cimerman@exponea.com
Interested in demo? Let us know:

More Related Content

PDF
Data struture lab
DOCX
Lab manual data structure (cs305 rgpv) (usefulsearch.org) (useful search)
PDF
C program
PDF
MongoDB Days UK: Indexing and Performance Tuning
PDF
Ds lab handouts
DOCX
Data Structure Project File
DOC
Ds lab manual by s.k.rath
DOCX
Data Structure in C (Lab Programs)
Data struture lab
Lab manual data structure (cs305 rgpv) (usefulsearch.org) (useful search)
C program
MongoDB Days UK: Indexing and Performance Tuning
Ds lab handouts
Data Structure Project File
Ds lab manual by s.k.rath
Data Structure in C (Lab Programs)

What's hot (20)

PDF
Writing good std::future&lt;c++>
DOCX
Data structure new lab manual
PDF
C++ Programming - 1st Study
PDF
High performance web programming with C++14
PDF
Data structure lab manual
PDF
NSC #2 - D2 06 - Richard Johnson - SAGEly Advice
PDF
VTU Data Structures Lab Manual
PDF
C lab excellent
PDF
Python for Scientific Computing -- Ricardo Cruz
DOC
CS6311- PROGRAMMING & DATA STRUCTURE II LABORATORY
PDF
C programs
DOCX
PDF
Software Design in Practice (with Java examples)
PPT
Linq 090701233237 Phpapp01
PDF
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
PPTX
Nested micro
PDF
Object Oriented Programming Using C++ Practical File
PPTX
C Programming Training in Ambala ! Batra Computer Centre
PDF
Class xi sample paper (Computer Science)
PDF
C++ Programming - 11th Study
Writing good std::future&lt;c++>
Data structure new lab manual
C++ Programming - 1st Study
High performance web programming with C++14
Data structure lab manual
NSC #2 - D2 06 - Richard Johnson - SAGEly Advice
VTU Data Structures Lab Manual
C lab excellent
Python for Scientific Computing -- Ricardo Cruz
CS6311- PROGRAMMING & DATA STRUCTURE II LABORATORY
C programs
Software Design in Practice (with Java examples)
Linq 090701233237 Phpapp01
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
Nested micro
Object Oriented Programming Using C++ Practical File
C Programming Training in Ambala ! Batra Computer Centre
Class xi sample paper (Computer Science)
C++ Programming - 11th Study
Ad

Similar to Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava Meetup #3 (20)

PDF
Spark DataFrames and ML Pipelines
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
PDF
Scalable Data Science in Python and R on Apache Spark
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
Practical Machine Learning Pipelines with MLlib
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PDF
Introduction to and Extending Spark ML
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PPT
Machinel Learning with spark
PDF
Distributed ML in Apache Spark
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
PDF
Key projects in AI, ML and Generative AI
PPTX
Flux - Open Machine Learning Stack / Pipeline
PPTX
Combining Machine Learning Frameworks with Apache Spark
Spark DataFrames and ML Pipelines
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Scalable Data Science in Python and R on Apache Spark
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Machine Learning Pipelines with MLlib
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Introduction to and Extending Spark ML
Introduction to Spark ML Pipelines Workshop
Simplifying Big Data Analytics with Apache Spark
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Getting started with Apache Spark in Python - PyLadies Toronto 2016
An introduction into Spark ML plus how to go beyond when you get stuck
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Machinel Learning with spark
Distributed ML in Apache Spark
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Key projects in AI, ML and Generative AI
Flux - Open Machine Learning Stack / Pipeline
Combining Machine Learning Frameworks with Apache Spark
Ad

Recently uploaded (20)

PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Predictive modeling basics in data cleaning process
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Inferential Statistics.pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
How to run a consulting project- client discovery
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
New ISO 27001_2022 standard and the changes
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Managing Community Partner Relationships
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
STERILIZATION AND DISINFECTION-1.ppthhhbx
IMPACT OF LANDSLIDE.....................
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Predictive modeling basics in data cleaning process
IBA_Chapter_11_Slides_Final_Accessible.pptx
annual-report-2024-2025 original latest.
Introduction to Inferential Statistics.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Introduction to Data Science and Data Analysis
CYBER SECURITY the Next Warefare Tactics
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
How to run a consulting project- client discovery
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
New ISO 27001_2022 standard and the changes
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Managing Community Partner Relationships
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

Matúš Cimerman: Building AI data pipelines using PySpark, PyData Bratislava Meetup #3

  • 1. Building AI data pipelines using PySpark Matúš Cimerman, Exponea #PyDataBA15. 5. 2017 (PyData Bratislava Meetup #3, Nervosa)
  • 2. 1 Building AI data pipelines using PySpark Matus Cimerman, matus.cimerman@{gmail,exponea}.com
  • 3. 2 About me ● 1+y Data science @ Exponea, before BI intern and other stuff @ Orange. ● FIIT STU, Data Streams related studies ● Some links: ○ https://www.facebook.com/matus.cimerman ○ https://twitter.com/MatusCimerman ○ https://www.linkedin.com/in/mat%C3%BA%C5%A1-cimerman- 4b08b352/ ○ Github link soon
  • 4. 3 Few words regarding this talk 1. Spoken word will be in Slovak for this time. 2. This talk is not about machine learning algorithms, methods, hyperparameters tuning or anything similar. 3. I am still newbie and learner, don't hesitate to correct me. 4. Goal is to show you overall basics, experts hold your hate. 5. First time publicly speaking, prepare your tomatoes. 6. Comic Sans Font is used intentionally, I was told it's OK for slides.
  • 5. 4 Aren't you doing ML? WTF mate?
  • 6. 5 Data preprocessing nightmare* *I do like all of those libraries Pattern NLTK Dora DataCleaner Scikit-learn Orange Gensim
  • 7. 6 A practical ML pipeline
  • 11. 10 Connecting dots is not easy for large-scale datasets
  • 13. 12 Resilient Distributed Datasets (RDDs) • Distributed, immutable • Lazy-execution, • Fault-tolerant • Functional style programming (actions, transformations) • Can be persisted/cached in memory/disk for fast iterative programming http://vishnuviswanath.com/spark_rdd.html Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.
  • 14. 13 Resilient Distributed Datasets (RDDs) Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.
  • 15. 14 RDD example (1) from random import random def f(number): return (0, number) if number < 0.5 else (1, number) rdd = sc.parallelize([random() for i in range(1000)]) rdd.take(2) # [0.8528183968066678, 0.3513345834291187] rdd.filter(lambda x: x > 0.95).count() # 53 new_rdd = rdd.persist().map(f) # Nothing happened - lazy eval new_rdd.countByKey() # {0: 481, 1: 519} new_rdd.reduceByKey(lambda a,b: a + b).take(2) # [(0, 110.02773787885363), (1, 408.68609250249494)]
  • 16. 15 RDD example (2) from pyspark.mllib.feature import Word2Vec inp = sc.textFile("/apps/tmp/text8").map(lambda row: row.split(" ")) word2vec = Word2Vec() model = word2vec.fit(inp) synonyms = model.findSynonyms('research', 5) for word, cosine_distance in synonyms: print("{}: {}".format(word, cosine_distance)) """ studies: 0.774848618142 institute: 0.71544256553 interdisciplinary: 0.684204944488 medical: 0.659590635889 informatics: 0.648754791132 """
  • 17. 16 DataFrames • Schema view of data • Also lazy like RDD • Significant performance improvement in compare to RDDs (Tungsten & Catalyst optimizer) • No serialization between stages • Great for semi-structured data • RDD on the background • Can be created from RDD https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
  • 18. 17 DataFrame schema root |-- create_time: long (nullable = true) |-- id: string (nullable = true) |-- properties: struct (nullable = true) | |-- age: string (nullable = true) | |-- birthday: long (nullable = true) | |-- city: string (nullable = true) | |-- cookie_id: string (nullable = true) | |-- created_ts: double (nullable = true) | |-- email: string (nullable = true) |-- raw: string (nullable = true) |-- ids: struct (nullable = true) | |-- cookie: array (nullable = true) | | |-- element: string (containsNull = true) | |-- registered: array (nullable = true) | | |-- element: string (containsNull = true)
  • 19. 18 DataFrame example users = spark.read.parquet('project_id=9e898732-a289-11e6-bc55-14187733e19e') users.count() # 49130 # SQL style operations users.filter("properties.gender == 'female'").count() # 590 # Expression builder operations users.filter(users.properties.gender.like("female")).count() # 590 # Show results users.filter(users.properties.gender.like("female")).select('properties.age').describe('age').show() +-------+------------------+ |summary| age| +-------+------------------+ | count | 590| | mean | 50.3762| | stddev | 20.5902| | min | 15| | max | 85| +-------+------------------+
  • 20. 19 RDD, DF performance comparision https://www.slideshare.net/databricks/2015-0616-spark-summit
  • 21. 20 Datasets • Strongly typed • …so, not available in Python
  • 22. 21 Spark ML pipelines – high level api from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import Tokenizer, HashingTF tokenizer = Tokenizer() .setInputCol("text") .setOutputCol("words") hashingTF = HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") lr = LogisticRegression() .setMaxIter(10) .setRegParam(0.01) pipeline = new Pipeline() .setStages([tokenizer, hashingTF, lr]) model = pipeline.fit(trainingDataset) https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
  • 23. 22 Spark ML pipelines – cross validation from pyspark.ml.tuning import ParamGridBuilder, CrossValidator paramGrid = ParamGridBuilder() .addGrid(hashingTF.numFeatures, [10, 20, 40]) .addGrid(lr.regParam, [0.01, 0.1, 1.0]) .build() cv = CrossValidator() .setNumFolds(3) .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(BinaryClassificationEvaluator) cv.save('cv-pipeline.parquet') cvModel = cv.fit(trainingDataset) cvModel.save('cv-model.parquet')
  • 24. 23 ML persistence paramGrid = ParamGridBuilder()... cv = CrossValidator().setEstimator(pipeline)... cv.save('cv-pipeline.parquet') cvModel = cv.fit(trainingDataset) cvModel.save('cv-model.parquet') • You can create model in Python and deploy in Java/Scala app • Support for almost all Mllib algorithms • Support for fitted and unfitted Pipelines, so that also Pipelines are exchangeable • Suited for large distributed models – binary format Parquet is used to store model data • Metadata and model parameters are stored in JSON format
  • 25. 24 Submitting apps 1. Locally, suitable for dev (avoid this in production) 2. Cluster in client mode 3. Cluster in cluster mode
  • 27. 26 Submitting apps to YARN config.yml spark.app.name: "PyData Bratislava 2017" spark.master: "yarn" spark.submit.deployMode: "client" spark.yarn.dist.files: "file:/pyspark.zip,file:py4j-0.10.3-src.zip" spark.executorEnv.PYTHONPATH: "pyspark.zip:py4j-0.10.3-src.zip" spark.executorEnv.PYTHONHASHSEED: "0" spark.executor.instances: "12" spark.executor.cores: "3" spark.executor.memory: "6g" import yaml from pyspark import SparkConf, SparkContext config = yaml.load('config.yml') sparkConf = SparkConf().setAll([(k, v) for k, v in config.items()]) spark_context = SparkContext(conf=sparkConf).getOrCreate()
  • 28. 27 Submitting apps to YARN #!/usr/bin/env bash # 1. Create virtualenv virtualenv venv --python=/usr/bin/python3.5 # 2. Create zip file with all your python code zip -ru pyfiles.zip * -x "*.pyc" -x "*.log" -x "venv/*" # 3. Submit your app PYSPARK_PYTHON=/venv/ /spark/spark-2.0.1/bin/spark-submit --py-files pyfiles.zip pipeline.py Note: •venv should be created on each physical node or HDFS so that every worker can use it •Also any external config files should be either manually distributed or kept on HDFS
  • 29. 28 Jobs scheduling • Jenkins hell • Luigi • Airflow
  • 30. 29 Invitation for hacking Thursday • Topic: Offline evaluation of recommender systems • This Thursday 5pm