SlideShare a Scribd company logo
On the code of data science
Varoquaux
Ga¨el
import data science
data science.discover()
On the code of data science
Varoquaux
Ga¨el
import data science
data science.discover()
Disclaimer:
A bit of a Python bias
Key ideas carry over
What is data science?
G Varoquaux 2
What is data science?
G Varoquaux 2
Data science = statistics + code
Larger datasets
Automating insights
Data science is disruptive when it enables
real-time or personalized decisions
G Varoquaux 3
1 Writing code for data science
2 Machine learning in Python
3 Outlook: infrastructure
Make you more productive as a data scientist
G Varoquaux 4
1 Writing code for data science
Productivity & Quality
G Varoquaux 5
1 A data-science workflow
Work based on intuition
and experimentation
Conjecture
Experiment
ñ Interactive & framework-less
Yet
needs consolidation
keeping flexibility
G Varoquaux 6
1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
Photo-editing software
Filters Canvas Tool palettes
Typical web application
Database Web-pages URLs
G Varoquaux 7
1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
G Varoquaux 7
1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
A recipe
3 types of files:
• modules • command scripts • post-processing scripts
CSVs & intermediate data files
Separate computation from analysis / plotting
Code and text (and data) ñ version control
G Varoquaux 7
1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
A recipe
3 types of files:
• modules • command scripts • post-processing scripts
CSVs & intermediate data files
Separate computation from analysis / plotting
Code and text (and data) ñ version control
Decouple steps
Goals: Reuse code
Mitigate compute time
G Varoquaux 7
1 How I work progressive consolidation
Start with a script playing to understand the problem
G Varoquaux 8
1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
Use functions
Obstacle: local scope
requires identifying input and output variables
That’s a good thing
Solution: Interactive debugging / understanding
inside a function: %debug in IPython
Functions are the basic reusable abstraction
G Varoquaux 8
1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Modules
enable sharing between experiments
ñ avoid 1000 lines scripts + commented code
enable testing
Fast experiments as tests
ñ gives confidence, hence refactorings
G Varoquaux 8
1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Clean: delete code & files you have version control
Attentional load makes it impossible
to find or understand things
Where’s Waldo?G Varoquaux 8
1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Clean: delete code & files you have version control
Why is it hard?
Long compute times
make us unadventurous
Know your tools
Refactoring editor
Version control
G Varoquaux 8
1 Caching to tame computation time
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from memory
For data science
a & b can be big
a & b arbitrary objects no change in workflow
Results stored on disk
G Varoquaux 9
1 Caching to tame computation time
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from memory
For data science
Fits in experimentation loop
Helps decrease re-run times œ
Black-boxy, persistence only implicit
G Varoquaux 9
Adopting software-engineering best practices
G Varoquaux 10
1 The ladder of code quality
Use linting/code analysis in your editor seriously
G Varoquaux 11
1 The ladder of code quality
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
G Varoquaux 11
1 The ladder of code qualityIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
G Varoquaux 11
1 The ladder of code qualityIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
Over versus under engineering
When the goal is generating insights
Experimentation to develop intuitions
ñ new ideas
As the path becomes clear: consolidation
Heavy engineering too early freezes bad ideas
G Varoquaux 11
1 LibrariesIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
A library
G Varoquaux 12
2 Machine learning in Python
scikit-learn
G Varoquaux 13
2 Tradeoffs
Experimentation Production-
đ §
scikit
G Varoquaux 14
2 My stack for data science
Python, what else?
General-purpose language
Interactive
Easy to read / write
G Varoquaux 15
2 My stack for data science
The scientific Python stack
numpy arrays
Mostly a float**
No annotation / structure
Universal across applications
Easily shared across languages
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
57187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
7187745620
G Varoquaux 15
2 My stack for data science
The scientific Python stack
numpy arrays
Connecting to
pandas
Columnar data
scikit-image
Images
scipy
Numerics, signal processing
...
G Varoquaux 15
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
G Varoquaux 16
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
G Varoquaux 16
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Which model do you prefer?
G Varoquaux 16
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Minimizing train error ‰ generalization : overfit
G Varoquaux 16
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Adapting model complexity to data – regularization
G Varoquaux 16
2 Machine learning without learning the machinery
G Varoquaux 17
2 Machine learning without learning the machinery
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
let’s disrupt something new
G Varoquaux 17
2 Machine learning without learning the machinery
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
let’s disrupt something new
As easy as py
from s k l e a r n import svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , y t r a i n )
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 17
2 Show me your data: the samples ˆ features matrix
Data input: a 2D numerical array
Requires transforming your problem
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features
G Varoquaux 18
2 Show me your data: the samples ˆ features matrix
Data input: a 2D numerical array
Requires transforming your problem
With text documents:
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
the
Python
performance
profiling
module
is
code
can
a
sklearn.feature extraction.text.TfIdfVectorizer
G Varoquaux 18
“Big” data
Engineering efficient processing pipelines
Many samples or
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features
Many features
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features 03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
See also: http://www.slideshare.net/GaelVaroquaux/processing-
biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 19
2 Many samples: on-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
G Varoquaux 20
2 Many samples: on-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
Supervised models: predicting
sklearn.naive bayes...
sklearn.linear model.SGDRegressor
sklearn.linear model.SGDClassifier
Clustering: grouping samples
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch
Linear decompositions: finding new representations
sklearn.decompositions.IncrementalPCA
sklearn.decompositions.MiniBatchDictionaryLearning
sklearn.decompositions.LatentDirichletAllocation
G Varoquaux 20
2 Many features: on-the-fly data reduction
ñ Reduce the data as it is loaded
X s m a l l =
e s t i m a t o r . t r a n s f o r m ( X big , y)
G Varoquaux 21
2 Many features: on-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 21
More gems in scikit-learn
SAG:
linear model.LogisticRegression(solver=’sag’)
Fast linear model on biggish data
G Varoquaux 22
More gems in scikit-learn
SAG:
linear model.LogisticRegression(solver=’sag’)
Fast linear model on biggish data
PCA == RandomizedPCA: (0.18)
Heuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 22
More gems in scikit-learn
New cross-validation objects (0.18)
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t r a i n = y[ t r a i n ]
Data-independent better nested-CV
G Varoquaux 22
More gems in scikit-learn
New cross-validation objects (0.18)
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t r a i n = y[ t r a i n ]
Data-independent ñ better nested-CV
G Varoquaux 22
More gems in scikit-learn
Outlier detection and isolation forests (0.18)
G Varoquaux 22
3 Outlook: infrastructure
No strings attached
G Varoquaux 23
3 Dataflow is key to scale
Array computing
CPU
03878794797927
01790752701578
03878794797927
01790752701578
Data parallel
03878794797927
03878794797927
Streaming
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
Parallel computing
Data + code transfer Out-of-memory persistence
These patterns can yield horrible code
G Varoquaux 24
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
G Varoquaux 25
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
G Varoquaux 25
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
G Varoquaux 25
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
New: distributed computing backends:
Yarn, dask.distributed, IPython.parallel
import distributed.joblib
from joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):
# normal Joblib code
G Varoquaux 25
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
New: distributed computing backends:
Yarn, dask.distributed, IPython.parallel
import distributed.joblib
from joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):
# normal Joblib code
Algorithmic plain-Python code
with optional bells and whistles
G Varoquaux 25
3 Persistence to disk
Persisting any Python object fast, with little overhead
Streamed compression
I/O from/in open file handles
ñ In S3, HDFS
G Varoquaux 26
3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
G Varoquaux 27
3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
Cache replacement:
mem = joblib.Memory(’uri’, bytes limit=’1G’)
mem.reduce size() # Remove oldest
G Varoquaux 27
3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
Cache replacement:
mem = joblib.Memory(’uri’, bytes limit=’1G’)
mem.reduce size() # Remove oldest
Out-of-memory computing
ąąąąąąąąą result = mem.cache(g).call and shelve(a)
ąąąąąąąąą result
MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)
ąąąąąąąąą c = result.get()
G Varoquaux 27
3 joblib as bricks of a compute engine: vision
03878794797927
03878794797927
G Varoquaux 28
3 joblib as bricks of a compute engine: vision
03878794797927
03878794797927
Parallel on
a cloud/cluster
Memory + shelving
on distributed stores
as out-of-core
& common
computation
Simple algorithmic code
Using infrastructure without buying into it
G Varoquaux 28
Time to wrap up
Time to wrap up
Code, code, code
Scipy-lectures: learning numerical Python
Many problems are better solved by
documentation than new code
Scipy-lectures: learning numerical Python
Comprehensive document: numpy, scipy, ...
1. Getting started with Python for science
2. Advanced topics
3. Packages and applications
http://scipy-lectures.org
Scipy-lectures: learning numerical Python
Code examples
On the code of data science
@GaelVaroquaux
I believe in code
On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Libraries
with side-effect free code
tight APIs
decoupling of analysis
and plotting
Software engineering
version control
On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
enables automation
opens new applications
Disrupt something new
We use scikit-learn for markers
of neuropsychiatric diseases
On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
Software purity holds back
Interactivity:
brittle and costly
but necessary for insights
Refusing syntax shortcuts
ñ verbosity
On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
Software purity holds back
The cost is worth the benefit in the long run

More Related Content

PDF
Computational practices for reproducible science
PDF
Open Source Scientific Software
PDF
Succeeding in academia despite doing good_software
PDF
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
PDF
Processing biggish data on commodity hardware: simple Python patterns
PDF
Building a cutting-edge data processing environment on a budget
PDF
Python for brain mining: (neuro)science with state of the art machine learnin...
PDF
Deep Learning Cases: Text and Image Processing
Computational practices for reproducible science
Open Source Scientific Software
Succeeding in academia despite doing good_software
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Processing biggish data on commodity hardware: simple Python patterns
Building a cutting-edge data processing environment on a budget
Python for brain mining: (neuro)science with state of the art machine learnin...
Deep Learning Cases: Text and Image Processing

Viewers also liked (15)

PDF
Scientist meets web dev: how Python became the language of data
PDF
Simple big data, in Python
PDF
Scikit-learn for easy machine learning: the vision, the tool, and the project
PDF
Scikit-learn: the state of the union 2016
PDF
Brain maps from machine learning? Spatial regularizations
PDF
Inter-site autism biomarkers from resting state fMRI
PDF
Machine learning and cognitive neuroimaging: new tools can answer new questions
PDF
A hand-waving introduction to sparsity for compressed tomography reconstruction
PDF
Advanced network modelling 2: connectivity measures, goup analysis
PDF
Brain network modelling: connectivity metrics and group analysis
PDF
Social-sparsity brain decoders: faster spatial sparsity
PDF
Connectomics: Parcellations and Network Analysis Methods
PDF
Scikit learn: apprentissage statistique en Python
PDF
Brain reading, compressive sensing, fMRI and statistical learning in Python
PDF
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scientist meets web dev: how Python became the language of data
Simple big data, in Python
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn: the state of the union 2016
Brain maps from machine learning? Spatial regularizations
Inter-site autism biomarkers from resting state fMRI
Machine learning and cognitive neuroimaging: new tools can answer new questions
A hand-waving introduction to sparsity for compressed tomography reconstruction
Advanced network modelling 2: connectivity measures, goup analysis
Brain network modelling: connectivity metrics and group analysis
Social-sparsity brain decoders: faster spatial sparsity
Connectomics: Parcellations and Network Analysis Methods
Scikit learn: apprentissage statistique en Python
Brain reading, compressive sensing, fMRI and statistical learning in Python
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Ad

Similar to On the code of data science (20)

PDF
S2-Programming_with_Data_Computational_Physics.pdf
PDF
Learn To Code Like A Professional With Pythonan Open Source Versatile And Pow...
PDF
Software maintenance PyConPL 2016
PDF
High Performance Python 2nd Edition Micha Gorelick
PDF
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
PDF
Python Machine Learning Sebastian Raschka Vahid Mirjalili
PPTX
Best practices in coding for beginners
PDF
Software Engineering for Data Scientists (MEAP V2) Andrew Treadway
PDF
(Ebook) Data Science with Python by coll.
PDF
Coding for science and innovation
PPTX
Beggining your career in python programming
PDF
(Ebook) High Performance Python by Micha Gorelick, Ian Ozsvald
PDF
Python Essentials For Dummies John C Shovic Alan Simpson
PDF
Effective Python 90 specific ways to write better Python Second Edition Brett...
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PPT
Introduction to the intermediate Python - v1.1
PDF
python-programming-3-books-in-ryan-turner_compress.pdf
PDF
Software Engineering For Data Scientists Meap V2 Chapters 1 To 7 Of 14 Andrew...
PDF
Effective Python 90 specific ways to write better Python Second Edition Brett...
PPTX
Clean code in Jupyter notebooks
S2-Programming_with_Data_Computational_Physics.pdf
Learn To Code Like A Professional With Pythonan Open Source Versatile And Pow...
Software maintenance PyConPL 2016
High Performance Python 2nd Edition Micha Gorelick
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Python Machine Learning Sebastian Raschka Vahid Mirjalili
Best practices in coding for beginners
Software Engineering for Data Scientists (MEAP V2) Andrew Treadway
(Ebook) Data Science with Python by coll.
Coding for science and innovation
Beggining your career in python programming
(Ebook) High Performance Python by Micha Gorelick, Ian Ozsvald
Python Essentials For Dummies John C Shovic Alan Simpson
Effective Python 90 specific ways to write better Python Second Edition Brett...
Artificial Intelligence, Machine Learning and Deep Learning
Introduction to the intermediate Python - v1.1
python-programming-3-books-in-ryan-turner_compress.pdf
Software Engineering For Data Scientists Meap V2 Chapters 1 To 7 Of 14 Andrew...
Effective Python 90 specific ways to write better Python Second Edition Brett...
Clean code in Jupyter notebooks
Ad

More from Gael Varoquaux (14)

PDF
Evaluating machine learning models and their diagnostic value
PDF
Measuring mental health with machine learning and brain imaging
PDF
Machine learning with missing values
PDF
Dirty data science machine learning on non-curated data
PDF
Representation learning in limited-data settings
PDF
Better neuroimaging data processing: driven by evidence, open communities, an...
PDF
Functional-connectome biomarkers to meet clinical needs?
PDF
Atlases of cognition with large-scale human brain mapping
PDF
Similarity encoding for learning on dirty categorical variables
PDF
Machine learning for functional connectomes
PDF
Towards psychoinformatics with machine learning and brain imaging
PDF
Simple representations for learning: factorizations and similarities
PDF
A tutorial on Machine Learning, with illustrations for MR imaging
PDF
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Evaluating machine learning models and their diagnostic value
Measuring mental health with machine learning and brain imaging
Machine learning with missing values
Dirty data science machine learning on non-curated data
Representation learning in limited-data settings
Better neuroimaging data processing: driven by evidence, open communities, an...
Functional-connectome biomarkers to meet clinical needs?
Atlases of cognition with large-scale human brain mapping
Similarity encoding for learning on dirty categorical variables
Machine learning for functional connectomes
Towards psychoinformatics with machine learning and brain imaging
Simple representations for learning: factorizations and similarities
A tutorial on Machine Learning, with illustrations for MR imaging
Estimating Functional Connectomes: Sparsity’s Strength and Limitations

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks

On the code of data science

  • 1. On the code of data science Varoquaux Ga¨el import data science data science.discover()
  • 2. On the code of data science Varoquaux Ga¨el import data science data science.discover() Disclaimer: A bit of a Python bias Key ideas carry over
  • 3. What is data science? G Varoquaux 2
  • 4. What is data science? G Varoquaux 2
  • 5. Data science = statistics + code Larger datasets Automating insights Data science is disruptive when it enables real-time or personalized decisions G Varoquaux 3
  • 6. 1 Writing code for data science 2 Machine learning in Python 3 Outlook: infrastructure Make you more productive as a data scientist G Varoquaux 4
  • 7. 1 Writing code for data science Productivity & Quality G Varoquaux 5
  • 8. 1 A data-science workflow Work based on intuition and experimentation Conjecture Experiment ñ Interactive & framework-less Yet needs consolidation keeping flexibility G Varoquaux 6
  • 9. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view Photo-editing software Filters Canvas Tool palettes Typical web application Database Web-pages URLs G Varoquaux 7
  • 10. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops G Varoquaux 7
  • 11. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops A recipe 3 types of files: • modules • command scripts • post-processing scripts CSVs & intermediate data files Separate computation from analysis / plotting Code and text (and data) ñ version control G Varoquaux 7
  • 12. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops A recipe 3 types of files: • modules • command scripts • post-processing scripts CSVs & intermediate data files Separate computation from analysis / plotting Code and text (and data) ñ version control Decouple steps Goals: Reuse code Mitigate compute time G Varoquaux 7
  • 13. 1 How I work progressive consolidation Start with a script playing to understand the problem G Varoquaux 8
  • 14. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function Use functions Obstacle: local scope requires identifying input and output variables That’s a good thing Solution: Interactive debugging / understanding inside a function: %debug in IPython Functions are the basic reusable abstraction G Varoquaux 8
  • 15. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Modules enable sharing between experiments ñ avoid 1000 lines scripts + commented code enable testing Fast experiments as tests ñ gives confidence, hence refactorings G Varoquaux 8
  • 16. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Clean: delete code & files you have version control Attentional load makes it impossible to find or understand things Where’s Waldo?G Varoquaux 8
  • 17. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Clean: delete code & files you have version control Why is it hard? Long compute times make us unadventurous Know your tools Refactoring editor Version control G Varoquaux 8
  • 18. 1 Caching to tame computation time The memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from memory For data science a & b can be big a & b arbitrary objects no change in workflow Results stored on disk G Varoquaux 9
  • 19. 1 Caching to tame computation time The memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from memory For data science Fits in experimentation loop Helps decrease re-run times œ Black-boxy, persistence only implicit G Varoquaux 9
  • 20. Adopting software-engineering best practices G Varoquaux 10
  • 21. 1 The ladder of code quality Use linting/code analysis in your editor seriously G Varoquaux 11
  • 22. 1 The ladder of code quality Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... G Varoquaux 11
  • 23. 1 The ladder of code qualityIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering G Varoquaux 11
  • 24. 1 The ladder of code qualityIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering Over versus under engineering When the goal is generating insights Experimentation to develop intuitions ñ new ideas As the path becomes clear: consolidation Heavy engineering too early freezes bad ideas G Varoquaux 11
  • 25. 1 LibrariesIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... A library G Varoquaux 12
  • 26. 2 Machine learning in Python scikit-learn G Varoquaux 13
  • 28. 2 My stack for data science Python, what else? General-purpose language Interactive Easy to read / write G Varoquaux 15
  • 29. 2 My stack for data science The scientific Python stack numpy arrays Mostly a float** No annotation / structure Universal across applications Easily shared across languages 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 57187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 7187745620 G Varoquaux 15
  • 30. 2 My stack for data science The scientific Python stack numpy arrays Connecting to pandas Columnar data scikit-image Images scipy Numerics, signal processing ... G Varoquaux 15
  • 31. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges G Varoquaux 16
  • 32. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much G Varoquaux 16
  • 33. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Which model do you prefer? G Varoquaux 16
  • 34. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Minimizing train error ‰ generalization : overfit G Varoquaux 16
  • 35. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Adapting model complexity to data – regularization G Varoquaux 16
  • 36. 2 Machine learning without learning the machinery G Varoquaux 17
  • 37. 2 Machine learning without learning the machinery A library, not a program More expressive and flexible Easy to include in an ecosystem let’s disrupt something new G Varoquaux 17
  • 38. 2 Machine learning without learning the machinery A library, not a program More expressive and flexible Easy to include in an ecosystem let’s disrupt something new As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 17
  • 39. 2 Show me your data: the samples ˆ features matrix Data input: a 2D numerical array Requires transforming your problem 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features G Varoquaux 18
  • 40. 2 Show me your data: the samples ˆ features matrix Data input: a 2D numerical array Requires transforming your problem With text documents: 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 documents the Python performance profiling module is code can a sklearn.feature extraction.text.TfIdfVectorizer G Varoquaux 18
  • 41. “Big” data Engineering efficient processing pipelines Many samples or 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features Many features 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 See also: http://www.slideshare.net/GaelVaroquaux/processing- biggish-data-on-commodity-hardware-simple-python-patterns G Varoquaux 19
  • 42. 2 Many samples: on-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 G Varoquaux 20
  • 43. 2 Many samples: on-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n ) Supervised models: predicting sklearn.naive bayes... sklearn.linear model.SGDRegressor sklearn.linear model.SGDClassifier Clustering: grouping samples sklearn.cluster.MiniBatchKMeans sklearn.cluster.Birch Linear decompositions: finding new representations sklearn.decompositions.IncrementalPCA sklearn.decompositions.MiniBatchDictionaryLearning sklearn.decompositions.LatentDirichletAllocation G Varoquaux 20
  • 44. 2 Many features: on-the-fly data reduction ñ Reduce the data as it is loaded X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y) G Varoquaux 21
  • 45. 2 Many features: on-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.FeatureAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 21
  • 46. More gems in scikit-learn SAG: linear model.LogisticRegression(solver=’sag’) Fast linear model on biggish data G Varoquaux 22
  • 47. More gems in scikit-learn SAG: linear model.LogisticRegression(solver=’sag’) Fast linear model on biggish data PCA == RandomizedPCA: (0.18) Heuristic to switch PCA to random linear algebra Fights global warming Huge speed gains for biggish data G Varoquaux 22
  • 48. More gems in scikit-learn New cross-validation objects (0.18) from s k l e a r n . c r o s s v a l i d a t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d (y , n f o l d s =2) for t r a i n , t e s t in cv : X t r a i n = X[ t r a i n ] y t r a i n = y[ t r a i n ] Data-independent better nested-CV G Varoquaux 22
  • 49. More gems in scikit-learn New cross-validation objects (0.18) from s k l e a r n . m o d e l s e l e c t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d ( n f o l d s =2) for t r a i n , t e s t in cv . s p l i t (X, y): X t r a i n = X[ t r a i n ] y t r a i n = y[ t r a i n ] Data-independent ñ better nested-CV G Varoquaux 22
  • 50. More gems in scikit-learn Outlier detection and isolation forests (0.18) G Varoquaux 22
  • 51. 3 Outlook: infrastructure No strings attached G Varoquaux 23
  • 52. 3 Dataflow is key to scale Array computing CPU 03878794797927 01790752701578 03878794797927 01790752701578 Data parallel 03878794797927 03878794797927 Streaming 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 Parallel computing Data + code transfer Out-of-memory persistence These patterns can yield horrible code G Varoquaux 24
  • 53. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) G Varoquaux 25
  • 54. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode G Varoquaux 25
  • 55. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode G Varoquaux 25
  • 56. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend(’dask.distributed’, scheduler host=’HOST:PORT’): # normal Joblib code G Varoquaux 25
  • 57. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend(’dask.distributed’, scheduler host=’HOST:PORT’): # normal Joblib code Algorithmic plain-Python code with optional bells and whistles G Varoquaux 25
  • 58. 3 Persistence to disk Persisting any Python object fast, with little overhead Streamed compression I/O from/in open file handles ñ In S3, HDFS G Varoquaux 26
  • 59. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 G Varoquaux 27
  • 60. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 Cache replacement: mem = joblib.Memory(’uri’, bytes limit=’1G’) mem.reduce size() # Remove oldest G Varoquaux 27
  • 61. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 Cache replacement: mem = joblib.Memory(’uri’, bytes limit=’1G’) mem.reduce size() # Remove oldest Out-of-memory computing ąąąąąąąąą result = mem.cache(g).call and shelve(a) ąąąąąąąąą result MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”) ąąąąąąąąą c = result.get() G Varoquaux 27
  • 62. 3 joblib as bricks of a compute engine: vision 03878794797927 03878794797927 G Varoquaux 28
  • 63. 3 joblib as bricks of a compute engine: vision 03878794797927 03878794797927 Parallel on a cloud/cluster Memory + shelving on distributed stores as out-of-core & common computation Simple algorithmic code Using infrastructure without buying into it G Varoquaux 28
  • 65. Time to wrap up Code, code, code
  • 66. Scipy-lectures: learning numerical Python Many problems are better solved by documentation than new code
  • 67. Scipy-lectures: learning numerical Python Comprehensive document: numpy, scipy, ... 1. Getting started with Python for science 2. Advanced topics 3. Packages and applications http://scipy-lectures.org
  • 68. Scipy-lectures: learning numerical Python Code examples
  • 69. On the code of data science @GaelVaroquaux I believe in code
  • 70. On the code of data science @GaelVaroquaux I believe in code without compromises Libraries with side-effect free code tight APIs decoupling of analysis and plotting Software engineering version control
  • 71. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers enables automation opens new applications Disrupt something new We use scikit-learn for markers of neuropsychiatric diseases
  • 72. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers Software purity holds back Interactivity: brittle and costly but necessary for insights Refusing syntax shortcuts ñ verbosity
  • 73. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers Software purity holds back The cost is worth the benefit in the long run