PyData 2015 Keynote: "A Systems View of Machine Learning"

Joshua%Bloom,%Ph.D.%%
CTO,%Co'founder
PyData,'Sea*le,'July'2015
A"Systems"View"of"Machine"Learning"
in%science%&%industry
Gordon%&%Be6y%%
Moore%Founda:on%%
Data'Driven%Inves:gator%%
UC%Berkeley,%Astronomy
@pro6sb

http://research.google.com/pubs/pub43146.html
• Complex models erode abstraction
boundaries
• Data dependencies cost more than
code dependencies
• System-level Spaghetti
• Changing External World
“It may be surprising to the
academic community to know
that only a fraction of the code …
is actually doing ‘machine
learning’. A mature system might
end up being (at most)
5% machine learning code and
(at least) 95% glue code.”

Algorithms
Software
Hardware
Project Staff
Consumers
Organization + Society
ML System Components
Agenda
- inside-out discussion of
component parts & some
interconnects
- presentation of some
facilitating new tools
- impact on problem deﬁnetion
teams

Linear/
Logistic
Regression
Naive
Bayes
Decision
Trees
SVMs
Bagging
Boosting
Decision
Forests
Neural
Nets
Deep
Learning
Nearest
Neighbors
Gaussian/
Dirichlet
Processes
Splines
Lasso
XGBoost
….
Some Algos/Models/Approaches
Used in Practice
LDA/LSI
RNN
Software Instantiations
in the Python Ecosystem
BOW
word2vec

Nguyen'et'al,'CVPR'2015'
All Models of Learning Have Flaws
http://hunch.net/?p=224
“It’s common to forget the ﬂaws
of the model that you are most
familiar…while the ﬂaws of new
models get exaggerated.”
- John Langford (2007, Microsoft research)
Concepts$≠$Statistics
Convolutional#
networks#
can#be#
fooled.

Nguyen'et'al,'CVPR'2015'
The$impact$of$dataset$bias
Training/testing#on#biased#datasets#gives#unrealistic#results.
!E.g.#:#Torralba and#Efros,#Unbiased2look2at2dataset2bias,#CVPR#2011.
Torralba/Efros11 via L. Bottou (ICML 2015)
All Models of Learning Have Flaws
http://hunch.net/?p=224
“It’s common to forget the ﬂaws
of the model that you are most
familiar…while the ﬂaws of new
models get exaggerated.”
- John Langford (2007, Microsoft research)
Concepts$≠$Statistics
Convolutional#
networks#
can#be#
fooled.
(Nguyen#et#al,#CVPR#2015)
Magri*e,'ICML,'1929'

What are you optimizing for?Component What
Algorithm/Model
Learning rate, convexity, error
bounds, scaling, …
+ Software/Hardware
Accuracy, Memory usage,
Disk usage, CPU needs, time
to learn, time to predict
+ Project Staff
time to implement, people/
resource costs, reliability,
maintainability,
experimentability
+ Consumers
direct value, useability,
explainability, actionability
+ Society indirect value
- multi-axis optimizations in a given
component
- highly coupled optimization
considerations between components
- myoptic view can be costly further
up the stack

Scalar proxies:
- RMSE
- RMSLE
- [adjusted] R2
- ...
R2=0.91
RMSE = 692.3
Pearson R=0.96
Optimization Metric:
What’s the essence of what I care about?

Scalar proxies:
- RMSE
- RMSLE
- [adjusted] R2
- ...
R2=0.91
RMSE = 692.3
Pearson R=0.96
scatter
outliers
bias

which classiﬁer is best?
depends...

10
>$50k Prize
<$50k Prize
Netﬂix
winning'
metric
best'
benchmark
many'teams'get'within'
~few'%'of'opQmum
so"which"is"easier"to"put"
into"produc9on?
Leaderboard'data'from'Kaggle'&'NeMlix
Optimization Metric

11
“We evaluated some of the new methods
offline but the additional accuracy gains
that we measured did not seem to justify the
engineering effort needed to bring them into
a production environment.”
Xavier'Amatriain'and'Jus0n'Basilico'(April'2012)
On the Prize

WiseFactory
automated feature extraction, learning, prediction, deployment
WiseTransfer
efﬁcient manipulation of large objects
WiseDataSet
WiseML
high-productivity data science in Python
WiseAlgorithm
WindTunnel
detect drift in CPU, Mem,
Accuracy, Statistics
Quality
Wrapping
High-Level API
Deployment &
Monitoring
C++ SDK
Core ML Stack at Wise.io
G. Blanco
D. Eads
J. Richards P. Baines H. Brink

Wise DataSet
BaseVariableGroup BaseVariableGroup BaseVariableGroup
InstanceGroup
InstanceGroup
InstanceGroup
RowSparse
RowMajor
HeterogeneousCache
AlgoRepo
ColSparse
MemMapped
Variable Mapper
Level Mapper
• fast, highly memory-efﬁcient
• heterogeneous
• distributed
Goal: easily surface algorithms  
(written in C++ to be cache
exploitative) to Python 
 
WiseDataSets

Language-agnostic C++ Base Classes
Python-specific Derived Classes
Output
Input
Iterator
Processor
Array
Processor
String
Processor
FrameBuilder SeriesBuilder StringBuilder
Frame
Processor
R-specific Derived Classes
• expose flexible interface from Python, to high-
performance, Python-agnostic C++ code
• pass arbitrary data between layers using  
“Protocol Master” (like Protobufs)
• write C++ code generically for GraphLab, Spark,
pandas, and Wise
WiseTransfer

Datasets for Data Science Comparison
• ]
• -
Slicing
Induces
Copy
Immutable
Columns
Query
Transfer
Speed to
Python
C++
SDK
Distributed
Memory
Efﬁciency
Categorical
Optimized
Sparse &
Dense
Pandas
DataFrame
Sequences No Yes N/A No No Medium Medium Yes
GraphLab
SFrame Yes Yes Yes Low Yes Yes High No Yes
Spark
DataFrame Yes Yes Yes Very Low No Yes Low No Yes
Dask Yes No Yes N/A No Yes Medium No No
Blaze No No Yes N/A No Yes Medium No No
Wise
DataSet
Copy-
on-write
No Yes Very High Yes Yes
Very
High
High Yes
See also: Rob Story, today

Enforcing (Weak) Contracts: Monitoring Deployments
Build DS
workflow
on test set,
like the offline
testing accuracy
deploy & start
monitoring
results
online, accuracy
is worse
than expected
?
1. Bang head to find (subtle) overfitting in model
2. Retrain: with new data (mo’ data, better answers)
3. Concept Drift: if retraining doesn’t help, jigger the DS workflow
4. Maybe that’s ok: Prediction influenced outcome. Hold out some live.
What to do:
see also, Chris Harland’s talk yesterday; Mike Manapat, today

unit tests
Regression Tests
Integration Tests
Of course you’re
doing this…
ETL Testing
is my contract
affected by the
(changing)
update?
Model Deployment
Testing
@treycausey (yesterday)
some tools:
Engarde
Hypothesis
Feature Forge
Software
Tests
Enforcing (Weak) Contracts: Monitoring Deployments
1. Need to know when things
are too different than before
2. Then alert a real human
3. Use automated tools to try
to isolate cause of change:
data or code.

reproducibility
• every deployment & drift test
given unique hash
• generate data ﬁles & script with
hash
• Perform sampling on known-good
deployments
• Monitor RAM, CPU, accuracy
metrics over time
• Probabalistic testing component of
our continuous integration of ML, 
10k++ tests
Wise “WindTunnel”

“Weak Contracts”
ie.
Abstractions within
components bleed through
to other components
cf. Sculley …
1. A'smart'programmer'makes'an'
invenQve'use'of'a'trained'object'
recognizer.'
2. The'object'recognizer'receives'data'that'
does'not'resemble'the'tesQng'data'and'
outputs'nonsense.'
3. The'code'of'the'smart'programmer'does'
not'work.'
Example (via Bottou)

Platonic Form
Data, as we act like it is…
Plutonic Form
…as it is.

NLP
{broken: 3, “blue screen”: 2, ...}
computer
vision
{eyes: [{“location”: [21,13],
“bounding”: [...]}]..}
metadata
Sparse
Dense{num_pages: 12, channel:
“email”...}
Nested
3rd party {author_klout: 34.0, ...} Missing/Noisy
timeseries [2014-12-01T12:03:12,
2014-12-01T12:05:12]
Streaming
Real Data != Benchmark Data

PyData 2015 Keynote: "A Systems View of Machine Learning"

SeismologyNeuroscience
Klein et al.
Astronomy

http://mltsp.io
pip install mltsp
ML tsp.
Machine Learning
Time-Series Platform
R. AllenM. SilverF. Peréz JSB
Domain
scientists
AstroSeismoNeuro
Funding
bodies
S. van der Walt
A. Creillin-Quick
Comp/
Stat/Eng
An open-source web platform for distributed time-series analysis
→
•Selection of sophisticated feature extraction algorithms
•Distributed computation
•Sandboxed execution of custom code

Flask
CLI
(under developement)
REST
/learn
/upload UI
Disco
W1 W2 Wn
Disco worker pool
datastore
DB
Demo!

--
MLTSP Continuous Integration
github.com/drone
github.com/mltsp/mltsp
Test 
Container 
with
MLTSP
Custom
Feature
Extractor
Sandbox
Worker
Pull request triggers
webhook
Workers-
Disco
SSH
Drone calls GitHub
status API

http://bigmacc.info
Results from MLTSP
The Astrophysical Journal Supplement Series, 203:32 (27pp), 2012 December
Published Work before MLTSP
MVP: Reproduce main results of a scientiﬁc paper

Probabilistic Classﬁcation of
Variable Stars
Shivvers,JSB,Richards MNRAS,2014
106 “DEB” candidates
12 new
mass-radii
15 “RCB/DYP” 
candidates
8 new discoveries
Triple # of
Galactic
DYPer Stars
Miller, Richards, JSB,..ApJ 2012
5400
Spectroscopic
Targets
Miller, JSB, Richards,..ApJ 2015
Turn synoptic
imaged into
~spectrographs

WISE SUPPORT
FROM
SUBJECT
DESCRIPTION
Support Ticket
DATE
FROM
SUBJECT
DESCRIPTION
Support Ticket
DATE
FROM
SUBJECT
DESCRIPTION
Support Ticket
DATE
FROM
SUBJECT
DESCRIPTION
Support Ticket
DATE
FROM
SUBJECT
DESCRIPTION
Support Ticket
DATE
FROM
SUBJECT
DESCRIPTION
Support Ticket
DATE
FROM
SUBJECT
DESCRIPTION
Support Ticket
DATE
TIER 1
AUTOMATED
RESPONSE
CUSTOMER
FROM COMPLEXITY TO CLARITY
INTELLIGENT
ROUTING
RECOMMENDED
RESPONSE
AUTOMATED
REPLY
Wise Support
>30%
faster avg.
response
time
more
consistent
answers
faster
scaling of
support
teams

Fault Tolerant ML
augmentation vs. full automation
Random forest
prediction of body
segment in Xbox
Kinect
gmail

https://www.reddit.com/r/funny/comments/3e7gy4/yes_netﬂix_because_my_6_year_old_will_enjoy_the/
“Yes Netﬂix,
because my 6 year
old will enjoy the
animated fun of
Sons of Anarchy”

[So]'What'should'be'the'machine'learning'engineering'process?”'
“Machine'learning'disrupts'so_ware'engineering'
- Leon Bottou (Facebook)

ỉπ vs.
(or “Data Science is a Team Sport”)
deep domain skill/knowledge/training
deep methodological knowledge/skill
deep domain or methodological skill/knowledge/training
strong methodological or domain knowledge/skill
Goal: empower teams of gamma’s to excel
ML Systems: It Takes a Village

‣ Novel testing can strengthen abstractions within components,
and contracts between
‣ Machine Learning Systems require optimizations across
components - so we’d better understand the true loss function
‣ (End user) fault tolerance is a must
Parting Thoughts
‣ Build ML into Systems because to have to…

Area Man
Bites off more
than he can chew
PyData 2014

Thanks!
@pro6sb
A"Systems"View"of"Machine"Learning
in#science#&#industry

PyData 2015 Keynote: "A Systems View of Machine Learning"

More Related Content

What's hot (20)

Similar to PyData 2015 Keynote: "A Systems View of Machine Learning" (20)

More from Joshua Bloom (9)

Recently uploaded (20)

PyData 2015 Keynote: "A Systems View of Machine Learning"