PyData Global: Thrifty Machine Learning

Thrifty Machine Learning
Cultivating cost-conscious modeling habits
PyData Global
2020

Overview
Machine
Learning on a
Budget
The
Mother of
Invention
2 3
An
Embarrassment
of Riches
1

Dr.
Rebecca
Bilbro
Machine Learning Engineer @ Unisys
Applied Text Analysis with Python
Scikit-Yellowbrick
@rebeccabilbro

Hardware / CloudMarket
cpus gpus tpus

Arxiv
&Reddit
&Kaggle
&Medium
&Twitter
&Google
&StackOverFlow

“A fool and [their]
money are soon
parted.”

“By sowing frugality we
reap liberty, a golden
harvest.”
- Agesilaus

The Supervised Learning Problem
11
Labeled Training Data
Deﬁne a set of target classes &
build a training dataset that
has been annotated with those
class labels.
Feature Transformation(s)
Take raw data and convert into
vector form ahead of model
training.
Classiﬁer Algorithm
Train a model to recognize
target classes using labeled
training data. Tune parameters
to reduce false positives
and/or false negatives.
This part is slow and boring.

PyData Global: Thrifty Machine Learning

“Mais il faut cultiver
notre jardin.”
- Voltaire, Candide

Getting Labeled Data
● Started with Amazon Mechanical Turk.
● Now there are many commercial providers of data labeling
and data annotation services.
● It can be quite expensive.
○ 100,000 samples for $25,000 - $75,000
● It’s just people, actually…
○ Semuels, Alana. “The Internet Is Enabling a New Kind of Poorly Paid
Hell.” The Atlantic, January 23, 2018.
● Doesn’t usually work for domain-speciﬁc data.
● Quality tends to vary.

Which Model is Best?
Bayesian Decision Tree Dense Feedforward

Which Model is Best?
Bayesian Decision Tree Dense Feedforward
for this dataset/problem space

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection as ms
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = ms.KFold(len(X), n_folds=12)
max([
ms.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])

How to Select a Model
● Start with a simple model, or better yet, try several in
parallel!
● Filter out the weak performers, and only tune the best.
● Set an initial baseline.
● Use these preliminary steps to prepare for hyperparameter
tuning.

Hyperparameter Tuning
● Grid search
● Randomized search
● Bayesian optimization
● Evolutionary optimization
● Population-based training
● Gradient-based optimization
● “Auto ML” (see above, but pay $$$)

● Search is difficult, particularly in
high dimensional space.
● Even with clever optimization
techniques, there is no
guarantee of a solution.
● As the search space gets larger,
the amount of time increases
exponentially.
Unfortunately...

Thoughtful Tuning
● Only tune the best performing models.
● Try to reduce your feature space.
● Understand the parameter ranges you’re searching.
● Move towards complexity purposefully
○ Understand error from variance vs. error from bias.
○ The model underﬁts.
○ The error doesn’t converge.
● Move towards complexity gradually
○ While both train and test scores are increasing (or error decreasing).

Prototype Locally First
● Consider: are these conveniences really necessary/useful
at the prototyping phase?
○ Probably not
● Don’t default to using cloud-hosted, Spark-running
notebooks for everything!
● Conﬁgure Python to run locally (one-time cost).
● VSCode, PyCharm, etc, support Jupyter notebooks now.
● Downsampling your data is cheap!

“Those who cannot
remember the past are
condemned to repeat it.”
- George Santayana

Serialize Everything
● The model
● Engineered features
● Feature vectors/embeddings
● Stopwords
● Lexicons
● Scores
● Diagnostic plots
● Training times
And any other artifacts or
metadata!

When is an ML Model “done”?
● When you have achieved an accuracy measure above your
threshold.
● When your error bounds are within your pre-deﬁned target
range.
● When your cross-validation demonstrates a convergence in
training and test data.
● When the sprint is over.
● When the project is due

When we shift our collective
mindset toward model
thriftiness rather than the
relentless pursuit of a tiny bit
more F1, there’s no telling what
new things we might discover…

Slide template by Slidesgo
Icons by Flaticon
Images & infographics by Freepik

Hardware
Serialization
Objectivity
Simplicity
“Done”
Annotation

PyData Global: Thrifty Machine Learning

More Related Content

What's hot (20)

Similar to PyData Global: Thrifty Machine Learning (20)

More from Rebecca Bilbro (17)

Recently uploaded (20)

PyData Global: Thrifty Machine Learning