On the code of data science

On the code of data science
Varoquaux
Ga¨el
import data science
data science.discover()

Varoquaux
Ga¨el
import data science
data science.discover()
Disclaimer:
A bit of a Python bias
Key ideas carry over

What is data science?
G Varoquaux 2

Data science = statistics + code
Larger datasets
Automating insights
Data science is disruptive when it enables
real-time or personalized decisions
G Varoquaux 3

1 Writing code for data science
2 Machine learning in Python
3 Outlook: infrastructure
Make you more productive as a data scientist
G Varoquaux 4

1 Writing code for data science
Productivity & Quality
G Varoquaux 5

1 A data-science workﬂow
Work based on intuition
and experimentation
Conjecture
Experiment
ñ Interactive & framework-less
Yet
needs consolidation
keeping ﬂexibility
G Varoquaux 6

1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
Photo-editing software
Filters Canvas Tool palettes
Typical web application
Database Web-pages URLs
G Varoquaux 7

Model
Manages the data
and rules of the
application
View
Output represen-
tation
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
G Varoquaux 7

Model
Manages the data
and rules of the
application
View
Output represen-
tation
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
not expressive
Module
with functions
CSV & data files
Script
ñ for loops
A recipe
3 types of files:
• modules • command scripts • post-processing scripts
CSVs & intermediate data files
Separate computation from analysis / plotting
Code and text (and data) ñ version control
G Varoquaux 7

Model
Manages the data
and rules of the
application
View
Output represen-
tation
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
not expressive
Module
with functions
CSV & data files
Script
ñ for loops
A recipe
3 types of files:
• modules • command scripts • post-processing scripts
CSVs & intermediate data files
Separate computation from analysis / plotting
Code and text (and data) ñ version control
Decouple steps
Goals: Reuse code
Mitigate compute time
G Varoquaux 7

1 How I work progressive consolidation
Start with a script playing to understand the problem
G Varoquaux 8

Identify blocks/operations ñ move to a function
Use functions
Obstacle: local scope
requires identifying input and output variables
That’s a good thing
Solution: Interactive debugging / understanding
inside a function: %debug in IPython
Functions are the basic reusable abstraction
G Varoquaux 8

As they stabilize, move to a module
Modules
enable sharing between experiments
ñ avoid 1000 lines scripts + commented code
enable testing
Fast experiments as tests
ñ gives conﬁdence, hence refactorings
G Varoquaux 8

Clean: delete code & ﬁles you have version control
Attentional load makes it impossible
to ﬁnd or understand things
Where’s Waldo?G Varoquaux 8

Clean: delete code & ﬁles you have version control
Why is it hard?
Long compute times
make us unadventurous
Know your tools
Refactoring editor
Version control
G Varoquaux 8

1 Caching to tame computation time
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from memory
For data science
a & b can be big
a & b arbitrary objects no change in workﬂow
Results stored on disk
G Varoquaux 9

1 Caching to tame computation time
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from memory
For data science
Fits in experimentation loop
Helps decrease re-run times œ
Black-boxy, persistence only implicit
G Varoquaux 9

Adopting software-engineering best practices
G Varoquaux 10

1 The ladder of code quality
Use linting/code analysis in your editor seriously
G Varoquaux 11

1 The ladder of code quality
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
G Varoquaux 11

1 The ladder of code qualityIncreasingcost
?İ
Code review
Unit testing
Make a package
...
Avoid premature software engineering
G Varoquaux 11

1 The ladder of code qualityIncreasingcost
?İ
Code review
Unit testing
Make a package
...
Avoid premature software engineering
Over versus under engineering
When the goal is generating insights
Experimentation to develop intuitions
ñ new ideas
As the path becomes clear: consolidation
Heavy engineering too early freezes bad ideas
G Varoquaux 11

1 LibrariesIncreasingcost
?İ
Code review
Unit testing
Make a package
...
A library
G Varoquaux 12

2 Machine learning in Python
scikit-learn
G Varoquaux 13

2 Tradeoﬀs
Experimentation Production-
đ §
scikit
G Varoquaux 14

2 My stack for data science
Python, what else?
General-purpose language
Interactive
Easy to read / write
G Varoquaux 15

The scientiﬁc Python stack
numpy arrays
Mostly a float**
No annotation / structure
Universal across applications
Easily shared across languages
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
57187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
7187745620
G Varoquaux 15

The scientiﬁc Python stack
numpy arrays
Connecting to
pandas
Columnar data
scikit-image
Images
scipy
Numerics, signal processing
...
G Varoquaux 15

2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
G Varoquaux 16

Prediction is very diﬃcult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
G Varoquaux 16

but not too much
x
y
x
y
Which model do you prefer?
G Varoquaux 16

but not too much
x
y
x
y
Minimizing train error ‰ generalization : overﬁt
G Varoquaux 16

but not too much
x
y
x
y
Adapting model complexity to data – regularization
G Varoquaux 16

2 Machine learning without learning the machinery
G Varoquaux 17

A library, not a program
More expressive and ﬂexible
Easy to include in an ecosystem
let’s disrupt something new
G Varoquaux 17

A library, not a program
More expressive and ﬂexible
Easy to include in an ecosystem
let’s disrupt something new
As easy as py
from s k l e a r n import svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , y t r a i n )
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 17

2 Show me your data: the samples ˆ features matrix
Data input: a 2D numerical array
Requires transforming your problem
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features
G Varoquaux 18

2 Show me your data: the samples ˆ features matrix
Data input: a 2D numerical array
Requires transforming your problem
With text documents:
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
the
Python
performance
profiling
module
is
code
can
a
sklearn.feature extraction.text.TfIdfVectorizer
G Varoquaux 18

“Big” data
Engineering eﬃcient processing pipelines
Many samples or
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features
Many features
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features 03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
See also: http://www.slideshare.net/GaelVaroquaux/processing-
biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 19

2 Many samples: on-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
G Varoquaux 20

2 Many samples: on-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
Supervised models: predicting
sklearn.naive bayes...
sklearn.linear model.SGDRegressor
sklearn.linear model.SGDClassifier
Clustering: grouping samples
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch
Linear decompositions: ﬁnding new representations
sklearn.decompositions.IncrementalPCA
sklearn.decompositions.MiniBatchDictionaryLearning
sklearn.decompositions.LatentDirichletAllocation
G Varoquaux 20

2 Many features: on-the-ﬂy data reduction
ñ Reduce the data as it is loaded
X s m a l l =
e s t i m a t o r . t r a n s f o r m ( X big , y)
G Varoquaux 21

2 Many features: on-the-ﬂy data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 21

More gems in scikit-learn
SAG:
linear model.LogisticRegression(solver=’sag’)
Fast linear model on biggish data
G Varoquaux 22

SAG:
linear model.LogisticRegression(solver=’sag’)
Fast linear model on biggish data
PCA == RandomizedPCA: (0.18)
Heuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 22

New cross-validation objects (0.18)
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t r a i n = y[ t r a i n ]
Data-independent better nested-CV
G Varoquaux 22

New cross-validation objects (0.18)
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t r a i n = y[ t r a i n ]
Data-independent ñ better nested-CV
G Varoquaux 22

Outlier detection and isolation forests (0.18)
G Varoquaux 22

3 Outlook: infrastructure
No strings attached
G Varoquaux 23

3 Dataﬂow is key to scale
Array computing
CPU
03878794797927
01790752701578
03878794797927
01790752701578
Data parallel
03878794797927
03878794797927
Streaming
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
Parallel computing
Data + code transfer Out-of-memory persistence
These patterns can yield horrible code
G Varoquaux 24

3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
G Varoquaux 25

Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
G Varoquaux 25

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
New: distributed computing backends:
Yarn, dask.distributed, IPython.parallel
import distributed.joblib
from joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):
# normal Joblib code
G Varoquaux 25

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
New: distributed computing backends:
Yarn, dask.distributed, IPython.parallel
import distributed.joblib
from joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):
# normal Joblib code
Algorithmic plain-Python code
with optional bells and whistles
G Varoquaux 25

3 Persistence to disk
Persisting any Python object fast, with little overhead
Streamed compression
I/O from/in open ﬁle handles
ñ In S3, HDFS
G Varoquaux 26

3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
G Varoquaux 27

Cache replacement:
mem = joblib.Memory(’uri’, bytes limit=’1G’)
mem.reduce size() # Remove oldest
G Varoquaux 27

Cache replacement:
mem = joblib.Memory(’uri’, bytes limit=’1G’)
mem.reduce size() # Remove oldest
Out-of-memory computing
ąąąąąąąąą result = mem.cache(g).call and shelve(a)
ąąąąąąąąą result
MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)
ąąąąąąąąą c = result.get()
G Varoquaux 27

3 joblib as bricks of a compute engine: vision
03878794797927
03878794797927
G Varoquaux 28

3 joblib as bricks of a compute engine: vision
03878794797927
03878794797927
Parallel on
a cloud/cluster
Memory + shelving
on distributed stores
as out-of-core
& common
computation
Simple algorithmic code
Using infrastructure without buying into it
G Varoquaux 28

Time to wrap up
Code, code, code

Scipy-lectures: learning numerical Python
Many problems are better solved by
documentation than new code

Comprehensive document: numpy, scipy, ...
1. Getting started with Python for science
2. Advanced topics
3. Packages and applications
http://scipy-lectures.org

Code examples

@GaelVaroquaux
I believe in code

@GaelVaroquaux
I believe in code
without compromises
Libraries
with side-eﬀect free code
tight APIs
decoupling of analysis
and plotting
Software engineering
version control

@GaelVaroquaux
I believe in code
without compromises
Code empowers
enables automation
opens new applications
Disrupt something new
We use scikit-learn for markers
of neuropsychiatric diseases

@GaelVaroquaux
I believe in code
without compromises
Code empowers
Software purity holds back
Interactivity:
brittle and costly
but necessary for insights
Refusing syntax shortcuts
ñ verbosity

@GaelVaroquaux
I believe in code
without compromises
Code empowers
Software purity holds back
The cost is worth the beneﬁt in the long run

On the code of data science

More Related Content

Viewers also liked (15)

Similar to On the code of data science (20)

More from Gael Varoquaux (14)

Recently uploaded (20)

On the code of data science