SlideShare a Scribd company logo
Cloud computing made easy
in
Joblib
Alexandre Abadie
Outline
An overview of Joblib
Joblib for cloud computing
Future work
Joblib in a word
A Python package to make your algorithms run faster
Joblib in a word
A Python package to make your algorithms run faster
http://joblib.readthedocs.io
The ecosystem
54 different contributors since the beginning in 2008
Contributors per month
The ecosystem
54 different contributors since the beginning in 2008
Contributors per month
Joblib is the computing backend used by Scikit-Learn
The ecosystem
54 different contributors since the beginning in 2008
Contributors per month
Joblib is the computing backend used by Scikit-Learn
Stable and mature code base
https://github.com/joblib/joblib
Why Joblib?
Why Joblib?
Because we want to make use of all available computing resources
Why Joblib?
Because we want to make use of all available computing resources
⇒ And ensure algorithms run as fast as possible
Why Joblib?
Because we want to make use of all available computing resources
⇒ And ensure algorithms run as fast as possible
Because we work on large datasets
Why Joblib?
Because we want to make use of all available computing resources
⇒ And ensure algorithms run as fast as possible
Because we work on large datasets
⇒ Data that just fits in RAM
Why Joblib?
Because we want to make use of all available computing resources
⇒ And ensure algorithms run as fast as possible
Because we work on large datasets
⇒ Data that just fits in RAM
Because we want the internal algorithm logic to remain unchanged
Why Joblib?
Because we want to make use of all available computing resources
⇒ And ensure algorithms run as fast as possible
Because we work on large datasets
⇒ Data that just fits in RAM
Because we want the internal algorithm logic to remain unchanged
⇒ Adapted to embarrassingly parallel problems
Why Joblib?
Because we want to make use of all available computing resources
⇒ And ensure algorithms run as fast as possible
Because we work on large datasets
⇒ Data that just fits in RAM
Because we want the internal algorithm logic to remain unchanged
⇒ Adapted to embarrassingly parallel problems
Because we love simple APIs
Why Joblib?
Because we want to make use of all available computing resources
⇒ And ensure algorithms run as fast as possible
Because we work on large datasets
⇒ Data that just fits in RAM
Because we want the internal algorithm logic to remain unchanged
⇒ Adapted to embarrassingly parallel problems
Because we love simple APIs
⇒ And parallel programming is not user friendly in general
How?
Embarrassingly Parallel computing helper
⇒ make parallel computing easy
How?
Embarrassingly Parallel computing helper
⇒ make parallel computing easy
Efficient disk caching to avoid recomputation
⇒ computation resource friendly
How?
Embarrassingly Parallel computing helper
⇒ make parallel computing easy
Efficient disk caching to avoid recomputation
⇒ computation resource friendly
Fast I/O persistence
⇒ limit cache access time
How?
Embarrassingly Parallel computing helper
⇒ make parallel computing easy
Efficient disk caching to avoid recomputation
⇒ computation resource friendly
Fast I/O persistence
⇒ limit cache access time
No dependencies, optimized for numpy arrays
⇒ simple installation and integration in other projects
Overview
Parallel helper
>>> from joblib import Parallel, delayed
>>> from math import sqrt
>>> Parallel(n_jobs=3, verbose=50)(delayed(sqrt)(i**2) for i in range(6))
[Parallel(n_jobs=3)]: Done 1 tasks | elapsed: 0.0s
[...]
[Parallel(n_jobs=3)]: Done 6 out of 6 | elapsed: 0.0s finished
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]
Parallel helper
>>> from joblib import Parallel, delayed
>>> from math import sqrt
>>> Parallel(n_jobs=3, verbose=50)(delayed(sqrt)(i**2) for i in range(6))
[Parallel(n_jobs=3)]: Done 1 tasks | elapsed: 0.0s
[...]
[Parallel(n_jobs=3)]: Done 6 out of 6 | elapsed: 0.0s finished
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]
⇒ API can be extended with external backends
Parallel backends
Single machine backends: works on a Laptop
⇒ threading, multiprocessing and soon Loky
Parallel backends
Single machine backends: works on a Laptop
⇒ threading, multiprocessing and soon Loky
Multi machine backends: available as optional extensions
⇒ distributed, ipyparallel, CMFActivity, Hadoop Yarn
Parallel backends
Single machine backends: works on a Laptop
⇒ threading, multiprocessing and soon Loky
Multi machine backends: available as optional extensions
⇒ distributed, ipyparallel, CMFActivity, Hadoop Yarn
>>> from distributed.joblib import DistributedBackend
>>> from joblib import (Parallel, delayed,
>>> register_parallel_backend, parallel_backend)
>>> register_parallel_backend('distributed', DistributedBackend)
>>> with parallel_backend('distributed', scheduler_host='dscheduler:8786'):
>>> Parallel(n_jobs=3)(delayed(sqrt)(i**2) for i in range(6))
[...]
Parallel backends
Single machine backends: works on a Laptop
⇒ threading, multiprocessing and soon Loky
Multi machine backends: available as optional extensions
⇒ distributed, ipyparallel, CMFActivity, Hadoop Yarn
>>> from distributed.joblib import DistributedBackend
>>> from joblib import (Parallel, delayed,
>>> register_parallel_backend, parallel_backend)
>>> register_parallel_backend('distributed', DistributedBackend)
>>> with parallel_backend('distributed', scheduler_host='dscheduler:8786'):
>>> Parallel(n_jobs=3)(delayed(sqrt)(i**2) for i in range(6))
[...]
Future: new backends for Celery, Spark
Caching on disk
Use a memoize pattern with the Memory object
>>> from joblib import Memory
>>> import numpy as np
>>> a = np.vander(np.arange(3)).astype(np.float)
>>> mem = Memory(cachedir='/tmp/joblib')
>>> square = mem.cache(np.square)
Caching on disk
Use a memoize pattern with the Memory object
>>> from joblib import Memory
>>> import numpy as np
>>> a = np.vander(np.arange(3)).astype(np.float)
>>> mem = Memory(cachedir='/tmp/joblib')
>>> square = mem.cache(np.square)
>>> b = square(a)
________________________________________________________________________________
[Memory] Calling square...
square(array([[ 0., 0., 1.],
[ 1., 1., 1.],
[ 4., 2., 1.]]))
___________________________________________________________square - 0...s, 0.0min
>>> c = square(a) # no recomputation
array([[ 0., 0., 1.],
[...]
Caching on disk
Use a memoize pattern with the Memory object
>>> from joblib import Memory
>>> import numpy as np
>>> a = np.vander(np.arange(3)).astype(np.float)
>>> mem = Memory(cachedir='/tmp/joblib')
>>> square = mem.cache(np.square)
>>> b = square(a)
________________________________________________________________________________
[Memory] Calling square...
square(array([[ 0., 0., 1.],
[ 1., 1., 1.],
[ 4., 2., 1.]]))
___________________________________________________________square - 0...s, 0.0min
>>> c = square(a) # no recomputation
array([[ 0., 0., 1.],
[...]
Least Recently Used (LRU) cache replacement policy
Persistence
Convert/create an arbitrary object into/from a string of bytes
Streamable persistence to/from file or socket objects
>>> import numpy as np
>>> import joblib
>>> obj = [('a', [1, 2, 3]), ('b', np.arange(10))]
>>> joblib.dump(obj, '/tmp/test.pkl')
['/tmp/test.pkl']
>>> with open('/tmp/test.pkl', 'rb') as f:
>>> joblib.load(f)
[('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
Persistence
Convert/create an arbitrary object into/from a string of bytes
Streamable persistence to/from file or socket objects
>>> import numpy as np
>>> import joblib
>>> obj = [('a', [1, 2, 3]), ('b', np.arange(10))]
>>> joblib.dump(obj, '/tmp/test.pkl')
['/tmp/test.pkl']
>>> with open('/tmp/test.pkl', 'rb') as f:
>>> joblib.load(f)
[('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
Use compression for fast I/O:
   support for zlib, gz, bz2, xz and lzma compressors
>>> joblib.dump(obj, '/tmp/test.pkl.gz', compress=True, cache_size=0)
['/tmp/test.pkl.gz']
>>> joblib.load('/tmp/test.pkl.gz')
Outline
Joblib in a word
⇒Joblib for cloud computing
Future work
The Cloud trend
Lots of Cloud providers on the market:
              
The Cloud trend
Lots of Cloud providers on the market:
              
Existing solutions for processing Big Data:
         
The Cloud trend
Lots of Cloud providers on the market:
              
Existing solutions for processing Big Data:
         
Existing container orchestration solutions: Docker SWARM, Kubernetes
    
The Cloud trend
Lots of Cloud providers on the market:
              
Existing solutions for processing Big Data:
         
Existing container orchestration solutions: Docker SWARM, Kubernetes
    
How can Joblib be used with them?
The general idea
Use pluggable multi-machine parallel backends
Principle: configure your backend and wrap the calls to Parallel
>>> import time
>>> import ipyparallel as ipp
>>> from ipyparallel.joblib import register as register_joblib
>>> from joblib import parallel_backend, Parallel, delayed
# Setup ipyparallel backend
>>> register_joblib()
>>> dview = ipp.Client()[:]
# Start the job
>>> with parallel_backend("ipyparallel", view=dview):
>>> Parallel(n_jobs=20, verbose=50)(delayed(time.sleep)(1) for i in range(10))
Use pluggable multi-machine parallel backends
Principle: configure your backend and wrap the calls to Parallel
>>> import time
>>> import ipyparallel as ipp
>>> from ipyparallel.joblib import register as register_joblib
>>> from joblib import parallel_backend, Parallel, delayed
# Setup ipyparallel backend
>>> register_joblib()
>>> dview = ipp.Client()[:]
# Start the job
>>> with parallel_backend("ipyparallel", view=dview):
>>> Parallel(n_jobs=20, verbose=50)(delayed(time.sleep)(1) for i in range(10))
Complete examples exist for:
Dask distributed: https://github.com/ogrisel/docker-distributed
Hadoop Yarn: https://github.com/joblib/joblib-hadoop
Use pluggable store backends
Extends Memory API with other store providers
Not available upstream yet:
⇒ PR opened at https://github.com/joblib/joblib/pull/397
Use pluggable store backends
Extends Memory API with other store providers
Not available upstream yet:
⇒ PR opened at https://github.com/joblib/joblib/pull/397
>>> import numpy as np
>>> from joblib import Memory
>>> from joblibhadoop.hdfs import register_hdfs_store_backend
# Register HDFS store backend provider
>>> register_hdfs_store_backend()
# Persist data in hdfs://namenode:9000/user/john/cache/joblib
>>> mem = Memory(location='cache', backend='hdfs',
>>> host='namenode', port=9000, user='john', compress=True)
multiply = mem.cache(np.multiply)
Use pluggable store backends
Extends Memory API with other store providers
Not available upstream yet:
⇒ PR opened at https://github.com/joblib/joblib/pull/397
>>> import numpy as np
>>> from joblib import Memory
>>> from joblibhadoop.hdfs import register_hdfs_store_backend
# Register HDFS store backend provider
>>> register_hdfs_store_backend()
# Persist data in hdfs://namenode:9000/user/john/cache/joblib
>>> mem = Memory(location='cache', backend='hdfs',
>>> host='namenode', port=9000, user='john', compress=True)
multiply = mem.cache(np.multiply)
Store backends available:
Amazon S3: https://github.com/aabadie/joblib-s3
Hadoop HDFS: https://github.com/joblib/joblib-hadoop
Using Hadoop with Joblib
joblib-hadoop package: https://github.com/joblib/joblib-hadoop
Using Hadoop with Joblib
joblib-hadoop package: https://github.com/joblib/joblib-hadoop
Provides docker containers helpers for developing and testing
Using Hadoop with Joblib
joblib-hadoop package: https://github.com/joblib/joblib-hadoop
Provides docker containers helpers for developing and testing
⇒ no need for a production Hadoop cluster
⇒ make developer life easier: CI on Travis is possible
⇒ local repository on host is shared with Joblib-hadoop-node container
Using Hadoop with Joblib
joblib-hadoop package: https://github.com/joblib/joblib-hadoop
Provides docker containers helpers for developing and testing
⇒ no need for a production Hadoop cluster
⇒ make developer life easier: CI on Travis is possible
⇒ local repository on host is shared with Joblib-hadoop-node container
Outline
Joblib in a word
Joblib for cloud computing
⇒Future work and conclusion
Future work
In-memory object caching
⇒ Should save RAM during a parallel job
Future work
In-memory object caching
⇒ Should save RAM during a parallel job
Allow overriding of parallel backends
⇒ See PR: https://github.com/joblib/joblib/pull/524
⇒ Seamless distributed computing in scikit-learn
Future work
In-memory object caching
⇒ Should save RAM during a parallel job
Allow overriding of parallel backends
⇒ See PR: https://github.com/joblib/joblib/pull/524
⇒ Seamless distributed computing in scikit-learn
Replace multiprocessing parallel backend with Loky
⇒ See PR: https://github.com/joblib/joblib/pull/516
Future work
In-memory object caching
⇒ Should save RAM during a parallel job
Allow overriding of parallel backends
⇒ See PR: https://github.com/joblib/joblib/pull/524
⇒ Seamless distributed computing in scikit-learn
Replace multiprocessing parallel backend with Loky
⇒ See PR: https://github.com/joblib/joblib/pull/516
Extend Cloud providers support
⇒ Using Apache libcloud: give access to a lot more Cloud providers
Conclusion
Conclusion
Parallel helper is adapted to embarassingly parallel problems
Conclusion
Parallel helper is adapted to embarassingly parallel problems
Already a lot of parallel backends available
⇒ threading, multiprocessing, loky, CMFActivity distributed, ipyparallel, Yarn
Conclusion
Parallel helper is adapted to embarassingly parallel problems
Already a lot of parallel backends available
⇒ threading, multiprocessing, loky, CMFActivity distributed, ipyparallel, Yarn
Use caching techniques to avoid recomputation
Conclusion
Parallel helper is adapted to embarassingly parallel problems
Already a lot of parallel backends available
⇒ threading, multiprocessing, loky, CMFActivity distributed, ipyparallel, Yarn
Use caching techniques to avoid recomputation
Extra Store backends available ⇒ HDFS (Hadoop) and AWS S3
Conclusion
Parallel helper is adapted to embarassingly parallel problems
Already a lot of parallel backends available
⇒ threading, multiprocessing, loky, CMFActivity distributed, ipyparallel, Yarn
Use caching techniques to avoid recomputation
Extra Store backends available ⇒ HDFS (Hadoop) and AWS S3
Use Joblib either on your laptop or in a Cloud with very few code changes
Thanks!
           
           

More Related Content

PDF
PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...
PDF
Automating Workflows for Analytics Pipelines
PDF
Embuk internals
ZIP
Javascript Everywhere
PDF
Embulk - 進化するバルクデータローダ
PDF
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
PDF
RubyKaigi2015 making robots-with-mruby
PDF
Digdagによる大規模データ処理の自動化とエラー処理
PyParis2017 / Function-as-a-service - a pythonic perspective on severless com...
Automating Workflows for Analytics Pipelines
Embuk internals
Javascript Everywhere
Embulk - 進化するバルクデータローダ
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
RubyKaigi2015 making robots-with-mruby
Digdagによる大規模データ処理の自動化とエラー処理

What's hot (20)

PDF
Embulk at Treasure Data
PDF
Our challenge for Bulkload reliability improvement
PDF
CMUデータベース輪読会第8回
PDF
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
KEY
Python在豆瓣的应用
PDF
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
ODP
Scala Future & Promises
PPTX
PlantUML
PDF
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
PPTX
Data integration with embulk
PDF
RxSwift to Combine
PDF
RxSwift to Combine
PDF
Draw More, Work Less
PDF
Elasticwulf Pycon Talk
PPTX
Артем Сыльчук - Хранение полей в Drupal. От CCK к FieldableEntityStorageContr...
PPTX
More Data, More Problems: Evolving big data machine learning pipelines with S...
PDF
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
PDF
InfluxData Platform Future and Vision
PPT
Theads services
PDF
Devoxx uk 2014 High performance in-memory Java with open source
Embulk at Treasure Data
Our challenge for Bulkload reliability improvement
CMUデータベース輪読会第8回
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
Python在豆瓣的应用
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Scala Future & Promises
PlantUML
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Data integration with embulk
RxSwift to Combine
RxSwift to Combine
Draw More, Work Less
Elasticwulf Pycon Talk
Артем Сыльчук - Хранение полей в Drupal. От CCK к FieldableEntityStorageContr...
More Data, More Problems: Evolving big data machine learning pipelines with S...
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
InfluxData Platform Future and Vision
Theads services
Devoxx uk 2014 High performance in-memory Java with open source
Ad

Similar to Joblib for cloud computing (20)

PDF
Joblib Toward efficient computing : from laptop to cloud
PDF
Joblib PyDataParis2016
PDF
Angular - Improve Runtime performance 2019
PPT
Euro python2011 High Performance Python
PDF
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
PDF
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
PPTX
Why learn Internals?
PDF
Data herding
PDF
Data herding
PDF
Accelerating the Development of Efficient CP Optimizer Models
PPTX
Paris DataGeek - SummingBird
PDF
Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil
PDF
Java Performance Tuning
PPTX
End to-end async and await
PDF
PyCon Estonia 2019
PPTX
Matt Franklin - Apache Software (Geekfest)
PDF
Runtime performance
PDF
Scale up and Scale Out Anaconda and PyData
PDF
The genesis of clusterlib - An open source library to tame your favourite sup...
PDF
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Joblib Toward efficient computing : from laptop to cloud
Joblib PyDataParis2016
Angular - Improve Runtime performance 2019
Euro python2011 High Performance Python
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Why learn Internals?
Data herding
Data herding
Accelerating the Development of Efficient CP Optimizer Models
Paris DataGeek - SummingBird
Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil
Java Performance Tuning
End to-end async and await
PyCon Estonia 2019
Matt Franklin - Apache Software (Geekfest)
Runtime performance
Scale up and Scale Out Anaconda and PyData
The genesis of clusterlib - An open source library to tame your favourite sup...
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Spectroscopy.pptx food analysis technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Machine Learning_overview_presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
sap open course for s4hana steps from ECC to s4
NewMind AI Weekly Chronicles - August'25-Week II
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
Assigned Numbers - 2025 - Bluetooth® Document
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine Learning_overview_presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”

Joblib for cloud computing

  • 1. Cloud computing made easy in Joblib Alexandre Abadie
  • 2. Outline An overview of Joblib Joblib for cloud computing Future work
  • 3. Joblib in a word A Python package to make your algorithms run faster
  • 4. Joblib in a word A Python package to make your algorithms run faster http://joblib.readthedocs.io
  • 5. The ecosystem 54 different contributors since the beginning in 2008 Contributors per month
  • 6. The ecosystem 54 different contributors since the beginning in 2008 Contributors per month Joblib is the computing backend used by Scikit-Learn
  • 7. The ecosystem 54 different contributors since the beginning in 2008 Contributors per month Joblib is the computing backend used by Scikit-Learn Stable and mature code base https://github.com/joblib/joblib
  • 9. Why Joblib? Because we want to make use of all available computing resources
  • 10. Why Joblib? Because we want to make use of all available computing resources ⇒ And ensure algorithms run as fast as possible
  • 11. Why Joblib? Because we want to make use of all available computing resources ⇒ And ensure algorithms run as fast as possible Because we work on large datasets
  • 12. Why Joblib? Because we want to make use of all available computing resources ⇒ And ensure algorithms run as fast as possible Because we work on large datasets ⇒ Data that just fits in RAM
  • 13. Why Joblib? Because we want to make use of all available computing resources ⇒ And ensure algorithms run as fast as possible Because we work on large datasets ⇒ Data that just fits in RAM Because we want the internal algorithm logic to remain unchanged
  • 14. Why Joblib? Because we want to make use of all available computing resources ⇒ And ensure algorithms run as fast as possible Because we work on large datasets ⇒ Data that just fits in RAM Because we want the internal algorithm logic to remain unchanged ⇒ Adapted to embarrassingly parallel problems
  • 15. Why Joblib? Because we want to make use of all available computing resources ⇒ And ensure algorithms run as fast as possible Because we work on large datasets ⇒ Data that just fits in RAM Because we want the internal algorithm logic to remain unchanged ⇒ Adapted to embarrassingly parallel problems Because we love simple APIs
  • 16. Why Joblib? Because we want to make use of all available computing resources ⇒ And ensure algorithms run as fast as possible Because we work on large datasets ⇒ Data that just fits in RAM Because we want the internal algorithm logic to remain unchanged ⇒ Adapted to embarrassingly parallel problems Because we love simple APIs ⇒ And parallel programming is not user friendly in general
  • 17. How? Embarrassingly Parallel computing helper ⇒ make parallel computing easy
  • 18. How? Embarrassingly Parallel computing helper ⇒ make parallel computing easy Efficient disk caching to avoid recomputation ⇒ computation resource friendly
  • 19. How? Embarrassingly Parallel computing helper ⇒ make parallel computing easy Efficient disk caching to avoid recomputation ⇒ computation resource friendly Fast I/O persistence ⇒ limit cache access time
  • 20. How? Embarrassingly Parallel computing helper ⇒ make parallel computing easy Efficient disk caching to avoid recomputation ⇒ computation resource friendly Fast I/O persistence ⇒ limit cache access time No dependencies, optimized for numpy arrays ⇒ simple installation and integration in other projects
  • 22. Parallel helper >>> from joblib import Parallel, delayed >>> from math import sqrt >>> Parallel(n_jobs=3, verbose=50)(delayed(sqrt)(i**2) for i in range(6)) [Parallel(n_jobs=3)]: Done 1 tasks | elapsed: 0.0s [...] [Parallel(n_jobs=3)]: Done 6 out of 6 | elapsed: 0.0s finished [0.0, 1.0, 2.0, 3.0, 4.0, 5.0]
  • 23. Parallel helper >>> from joblib import Parallel, delayed >>> from math import sqrt >>> Parallel(n_jobs=3, verbose=50)(delayed(sqrt)(i**2) for i in range(6)) [Parallel(n_jobs=3)]: Done 1 tasks | elapsed: 0.0s [...] [Parallel(n_jobs=3)]: Done 6 out of 6 | elapsed: 0.0s finished [0.0, 1.0, 2.0, 3.0, 4.0, 5.0] ⇒ API can be extended with external backends
  • 24. Parallel backends Single machine backends: works on a Laptop ⇒ threading, multiprocessing and soon Loky
  • 25. Parallel backends Single machine backends: works on a Laptop ⇒ threading, multiprocessing and soon Loky Multi machine backends: available as optional extensions ⇒ distributed, ipyparallel, CMFActivity, Hadoop Yarn
  • 26. Parallel backends Single machine backends: works on a Laptop ⇒ threading, multiprocessing and soon Loky Multi machine backends: available as optional extensions ⇒ distributed, ipyparallel, CMFActivity, Hadoop Yarn >>> from distributed.joblib import DistributedBackend >>> from joblib import (Parallel, delayed, >>> register_parallel_backend, parallel_backend) >>> register_parallel_backend('distributed', DistributedBackend) >>> with parallel_backend('distributed', scheduler_host='dscheduler:8786'): >>> Parallel(n_jobs=3)(delayed(sqrt)(i**2) for i in range(6)) [...]
  • 27. Parallel backends Single machine backends: works on a Laptop ⇒ threading, multiprocessing and soon Loky Multi machine backends: available as optional extensions ⇒ distributed, ipyparallel, CMFActivity, Hadoop Yarn >>> from distributed.joblib import DistributedBackend >>> from joblib import (Parallel, delayed, >>> register_parallel_backend, parallel_backend) >>> register_parallel_backend('distributed', DistributedBackend) >>> with parallel_backend('distributed', scheduler_host='dscheduler:8786'): >>> Parallel(n_jobs=3)(delayed(sqrt)(i**2) for i in range(6)) [...] Future: new backends for Celery, Spark
  • 28. Caching on disk Use a memoize pattern with the Memory object >>> from joblib import Memory >>> import numpy as np >>> a = np.vander(np.arange(3)).astype(np.float) >>> mem = Memory(cachedir='/tmp/joblib') >>> square = mem.cache(np.square)
  • 29. Caching on disk Use a memoize pattern with the Memory object >>> from joblib import Memory >>> import numpy as np >>> a = np.vander(np.arange(3)).astype(np.float) >>> mem = Memory(cachedir='/tmp/joblib') >>> square = mem.cache(np.square) >>> b = square(a) ________________________________________________________________________________ [Memory] Calling square... square(array([[ 0., 0., 1.], [ 1., 1., 1.], [ 4., 2., 1.]])) ___________________________________________________________square - 0...s, 0.0min >>> c = square(a) # no recomputation array([[ 0., 0., 1.], [...]
  • 30. Caching on disk Use a memoize pattern with the Memory object >>> from joblib import Memory >>> import numpy as np >>> a = np.vander(np.arange(3)).astype(np.float) >>> mem = Memory(cachedir='/tmp/joblib') >>> square = mem.cache(np.square) >>> b = square(a) ________________________________________________________________________________ [Memory] Calling square... square(array([[ 0., 0., 1.], [ 1., 1., 1.], [ 4., 2., 1.]])) ___________________________________________________________square - 0...s, 0.0min >>> c = square(a) # no recomputation array([[ 0., 0., 1.], [...] Least Recently Used (LRU) cache replacement policy
  • 31. Persistence Convert/create an arbitrary object into/from a string of bytes Streamable persistence to/from file or socket objects >>> import numpy as np >>> import joblib >>> obj = [('a', [1, 2, 3]), ('b', np.arange(10))] >>> joblib.dump(obj, '/tmp/test.pkl') ['/tmp/test.pkl'] >>> with open('/tmp/test.pkl', 'rb') as f: >>> joblib.load(f) [('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
  • 32. Persistence Convert/create an arbitrary object into/from a string of bytes Streamable persistence to/from file or socket objects >>> import numpy as np >>> import joblib >>> obj = [('a', [1, 2, 3]), ('b', np.arange(10))] >>> joblib.dump(obj, '/tmp/test.pkl') ['/tmp/test.pkl'] >>> with open('/tmp/test.pkl', 'rb') as f: >>> joblib.load(f) [('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))] Use compression for fast I/O:    support for zlib, gz, bz2, xz and lzma compressors >>> joblib.dump(obj, '/tmp/test.pkl.gz', compress=True, cache_size=0) ['/tmp/test.pkl.gz'] >>> joblib.load('/tmp/test.pkl.gz')
  • 33. Outline Joblib in a word ⇒Joblib for cloud computing Future work
  • 34. The Cloud trend Lots of Cloud providers on the market:               
  • 35. The Cloud trend Lots of Cloud providers on the market:                Existing solutions for processing Big Data:          
  • 36. The Cloud trend Lots of Cloud providers on the market:                Existing solutions for processing Big Data:           Existing container orchestration solutions: Docker SWARM, Kubernetes     
  • 37. The Cloud trend Lots of Cloud providers on the market:                Existing solutions for processing Big Data:           Existing container orchestration solutions: Docker SWARM, Kubernetes      How can Joblib be used with them?
  • 39. Use pluggable multi-machine parallel backends Principle: configure your backend and wrap the calls to Parallel >>> import time >>> import ipyparallel as ipp >>> from ipyparallel.joblib import register as register_joblib >>> from joblib import parallel_backend, Parallel, delayed # Setup ipyparallel backend >>> register_joblib() >>> dview = ipp.Client()[:] # Start the job >>> with parallel_backend("ipyparallel", view=dview): >>> Parallel(n_jobs=20, verbose=50)(delayed(time.sleep)(1) for i in range(10))
  • 40. Use pluggable multi-machine parallel backends Principle: configure your backend and wrap the calls to Parallel >>> import time >>> import ipyparallel as ipp >>> from ipyparallel.joblib import register as register_joblib >>> from joblib import parallel_backend, Parallel, delayed # Setup ipyparallel backend >>> register_joblib() >>> dview = ipp.Client()[:] # Start the job >>> with parallel_backend("ipyparallel", view=dview): >>> Parallel(n_jobs=20, verbose=50)(delayed(time.sleep)(1) for i in range(10)) Complete examples exist for: Dask distributed: https://github.com/ogrisel/docker-distributed Hadoop Yarn: https://github.com/joblib/joblib-hadoop
  • 41. Use pluggable store backends Extends Memory API with other store providers Not available upstream yet: ⇒ PR opened at https://github.com/joblib/joblib/pull/397
  • 42. Use pluggable store backends Extends Memory API with other store providers Not available upstream yet: ⇒ PR opened at https://github.com/joblib/joblib/pull/397 >>> import numpy as np >>> from joblib import Memory >>> from joblibhadoop.hdfs import register_hdfs_store_backend # Register HDFS store backend provider >>> register_hdfs_store_backend() # Persist data in hdfs://namenode:9000/user/john/cache/joblib >>> mem = Memory(location='cache', backend='hdfs', >>> host='namenode', port=9000, user='john', compress=True) multiply = mem.cache(np.multiply)
  • 43. Use pluggable store backends Extends Memory API with other store providers Not available upstream yet: ⇒ PR opened at https://github.com/joblib/joblib/pull/397 >>> import numpy as np >>> from joblib import Memory >>> from joblibhadoop.hdfs import register_hdfs_store_backend # Register HDFS store backend provider >>> register_hdfs_store_backend() # Persist data in hdfs://namenode:9000/user/john/cache/joblib >>> mem = Memory(location='cache', backend='hdfs', >>> host='namenode', port=9000, user='john', compress=True) multiply = mem.cache(np.multiply) Store backends available: Amazon S3: https://github.com/aabadie/joblib-s3 Hadoop HDFS: https://github.com/joblib/joblib-hadoop
  • 44. Using Hadoop with Joblib joblib-hadoop package: https://github.com/joblib/joblib-hadoop
  • 45. Using Hadoop with Joblib joblib-hadoop package: https://github.com/joblib/joblib-hadoop Provides docker containers helpers for developing and testing
  • 46. Using Hadoop with Joblib joblib-hadoop package: https://github.com/joblib/joblib-hadoop Provides docker containers helpers for developing and testing ⇒ no need for a production Hadoop cluster ⇒ make developer life easier: CI on Travis is possible ⇒ local repository on host is shared with Joblib-hadoop-node container
  • 47. Using Hadoop with Joblib joblib-hadoop package: https://github.com/joblib/joblib-hadoop Provides docker containers helpers for developing and testing ⇒ no need for a production Hadoop cluster ⇒ make developer life easier: CI on Travis is possible ⇒ local repository on host is shared with Joblib-hadoop-node container
  • 48. Outline Joblib in a word Joblib for cloud computing ⇒Future work and conclusion
  • 49. Future work In-memory object caching ⇒ Should save RAM during a parallel job
  • 50. Future work In-memory object caching ⇒ Should save RAM during a parallel job Allow overriding of parallel backends ⇒ See PR: https://github.com/joblib/joblib/pull/524 ⇒ Seamless distributed computing in scikit-learn
  • 51. Future work In-memory object caching ⇒ Should save RAM during a parallel job Allow overriding of parallel backends ⇒ See PR: https://github.com/joblib/joblib/pull/524 ⇒ Seamless distributed computing in scikit-learn Replace multiprocessing parallel backend with Loky ⇒ See PR: https://github.com/joblib/joblib/pull/516
  • 52. Future work In-memory object caching ⇒ Should save RAM during a parallel job Allow overriding of parallel backends ⇒ See PR: https://github.com/joblib/joblib/pull/524 ⇒ Seamless distributed computing in scikit-learn Replace multiprocessing parallel backend with Loky ⇒ See PR: https://github.com/joblib/joblib/pull/516 Extend Cloud providers support ⇒ Using Apache libcloud: give access to a lot more Cloud providers
  • 54. Conclusion Parallel helper is adapted to embarassingly parallel problems
  • 55. Conclusion Parallel helper is adapted to embarassingly parallel problems Already a lot of parallel backends available ⇒ threading, multiprocessing, loky, CMFActivity distributed, ipyparallel, Yarn
  • 56. Conclusion Parallel helper is adapted to embarassingly parallel problems Already a lot of parallel backends available ⇒ threading, multiprocessing, loky, CMFActivity distributed, ipyparallel, Yarn Use caching techniques to avoid recomputation
  • 57. Conclusion Parallel helper is adapted to embarassingly parallel problems Already a lot of parallel backends available ⇒ threading, multiprocessing, loky, CMFActivity distributed, ipyparallel, Yarn Use caching techniques to avoid recomputation Extra Store backends available ⇒ HDFS (Hadoop) and AWS S3
  • 58. Conclusion Parallel helper is adapted to embarassingly parallel problems Already a lot of parallel backends available ⇒ threading, multiprocessing, loky, CMFActivity distributed, ipyparallel, Yarn Use caching techniques to avoid recomputation Extra Store backends available ⇒ HDFS (Hadoop) and AWS S3 Use Joblib either on your laptop or in a Cloud with very few code changes