SlideShare a Scribd company logo
Thrifty Machine Learning
Cultivating cost-conscious modeling habits
PyData Global
2020
Overview
Machine
Learning on a
Budget
The
Mother of
Invention
2 3
An
Embarrassment
of Riches
1
Dr.
Rebecca
Bilbro
Machine Learning Engineer @ Unisys
Applied Text Analysis with Python
Scikit-Yellowbrick
@rebeccabilbro
An Embarrassment of Riches
Open
Source
Software
Hardware / CloudMarket
cpus gpus tpus
Arxiv
&Reddit
&Kaggle
&Medium
&Twitter
&Google
&StackOverFlow
“A fool and [their]
money are soon
parted.”
“By sowing frugality we
reap liberty, a golden
harvest.”
- Agesilaus
Machine Learning
on a Budget
The Supervised Learning Problem
11
Labeled Training Data
Define a set of target classes &
build a training dataset that
has been annotated with those
class labels.
Feature Transformation(s)
Take raw data and convert into
vector form ahead of model
training.
Classifier Algorithm
Train a model to recognize
target classes using labeled
training data. Tune parameters
to reduce false positives
and/or false negatives.
This part is slow and boring.
PyData Global: Thrifty Machine Learning
“Mais il faut cultiver
notre jardin.”
- Voltaire, Candide
Getting Labeled Data
● Started with Amazon Mechanical Turk.
● Now there are many commercial providers of data labeling
and data annotation services.
● It can be quite expensive.
○ 100,000 samples for $25,000 - $75,000
● It’s just people, actually…
○ Semuels, Alana. “The Internet Is Enabling a New Kind of Poorly Paid
Hell.” The Atlantic, January 23, 2018.
● Doesn’t usually work for domain-specific data.
● Quality tends to vary.
Which Model is Best?
Bayesian Decision Tree Dense Feedforward
Which Model is Best?
Bayesian Decision Tree Dense Feedforward
for this dataset/problem space
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection as ms
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = ms.KFold(len(X), n_folds=12)
max([
ms.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
How to Select a Model
● Start with a simple model, or better yet, try several in
parallel!
● Filter out the weak performers, and only tune the best.
● Set an initial baseline.
● Use these preliminary steps to prepare for hyperparameter
tuning.
Hyperparameter Tuning
● Grid search
● Randomized search
● Bayesian optimization
● Evolutionary optimization
● Population-based training
● Gradient-based optimization
● “Auto ML” (see above, but pay $$$)
● Search is difficult, particularly in
high dimensional space.
● Even with clever optimization
techniques, there is no
guarantee of a solution.
● As the search space gets larger,
the amount of time increases
exponentially.
Unfortunately...
Reduce feature space
Evaluate train/test ratio
Understand parameter ranges
Thoughtful Tuning
● Only tune the best performing models.
● Try to reduce your feature space.
● Understand the parameter ranges you’re searching.
● Move towards complexity purposefully
○ Understand error from variance vs. error from bias.
○ The model underfits.
○ The error doesn’t converge.
● Move towards complexity gradually
○ While both train and test scores are increasing (or error decreasing).
Development Environments
Development Environments
Prototype Locally First
● Consider: are these conveniences really necessary/useful
at the prototyping phase?
○ Probably not
● Don’t default to using cloud-hosted, Spark-running
notebooks for everything!
● Configure Python to run locally (one-time cost).
● VSCode, PyCharm, etc, support Jupyter notebooks now.
● Downsampling your data is cheap!
“Those who cannot
remember the past are
condemned to repeat it.”
- George Santayana
PyData Global: Thrifty Machine Learning
Serialize Everything
● The model
● Engineered features
● Feature vectors/embeddings
● Stopwords
● Lexicons
● Scores
● Diagnostic plots
● Training times
And any other artifacts or
metadata!
When is an ML Model “done”?
● When you have achieved an accuracy measure above your
threshold.
● When your error bounds are within your pre-defined target
range.
● When your cross-validation demonstrates a convergence in
training and test data.
● When the sprint is over.
● When the project is due
Plan for this, not this
The Mother of
Invention…
When we shift our collective
mindset toward model
thriftiness rather than the
relentless pursuit of a tiny bit
more F1, there’s no telling what
new things we might discover…
Thank you!
@rebeccabilbro
Slide template by Slidesgo
Icons by Flaticon
Images & infographics by Freepik
Hardware
Serialization
Objectivity
Simplicity
“Done”
Annotation

More Related Content

PPTX
Escaping the Black Box
PDF
(Py)testing the Limits of Machine Learning
PDF
The Incredible Disappearing Data Scientist
PDF
EuroSciPy 2019: Visual diagnostics at scale
PDF
Yellowbrick: Steering machine learning with visual transformers
PDF
Visualizing the model selection process
PDF
Winning Kaggle 101: Introduction to Stacking
PDF
Winning data science competitions
Escaping the Black Box
(Py)testing the Limits of Machine Learning
The Incredible Disappearing Data Scientist
EuroSciPy 2019: Visual diagnostics at scale
Yellowbrick: Steering machine learning with visual transformers
Visualizing the model selection process
Winning Kaggle 101: Introduction to Stacking
Winning data science competitions

What's hot (20)

PPTX
Automated Machine Learning (Auto ML)
PDF
General Tips for participating Kaggle Competitions
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
PPTX
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
PDF
VSSML16 LR1. Summary Day 1
PDF
Feature Engineering
PDF
Feature Engineering
PDF
GA.-.Presentation
PPTX
Machine Learning for .NET Developers - ADC21
PDF
Introduction to Machine Learning in Python using Scikit-Learn
PDF
Featurizing log data before XGBoost
PPTX
Feature Engineering
PDF
AutoML lectures (ACDL 2019)
PDF
Brief introduction to Machine Learning
PDF
Model remodeling with modern deep learning frameworks
PDF
VSSML16 L2. Ensembles and Logistic Regression
PDF
H2O World - Ensembles with Erin LeDell
PPT
notes as .ppt
PDF
Automatic machine learning (AutoML) 101
PDF
Start machine learning in 5 simple steps
Automated Machine Learning (Auto ML)
General Tips for participating Kaggle Competitions
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
VSSML16 LR1. Summary Day 1
Feature Engineering
Feature Engineering
GA.-.Presentation
Machine Learning for .NET Developers - ADC21
Introduction to Machine Learning in Python using Scikit-Learn
Featurizing log data before XGBoost
Feature Engineering
AutoML lectures (ACDL 2019)
Brief introduction to Machine Learning
Model remodeling with modern deep learning frameworks
VSSML16 L2. Ensembles and Logistic Regression
H2O World - Ensembles with Erin LeDell
notes as .ppt
Automatic machine learning (AutoML) 101
Start machine learning in 5 simple steps
Ad

Similar to PyData Global: Thrifty Machine Learning (20)

PDF
Tips for data science competitions
PPTX
AI hype or reality
PDF
Bootstrapping of PySpark Models for Factorial A/B Tests
PPTX
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
PPTX
2024-02-24_Session 1 - PMLE_UPDATED.pptx
PDF
The Power of Auto ML and How Does it Work
PDF
Winning data science competitions, presented by Owen Zhang
PDF
Using GANs to improve generalization in a semi-supervised setting - trying it...
PDF
Semi-supervised learning with GANs
PDF
Managing machine learning
PPTX
Jay Yagnik at AI Frontiers : A History Lesson on AI
PPTX
Fine tuning large LMs
PDF
Adtech scala-performance-tuning-150323223738-conversion-gate01
PDF
Adtech x Scala x Performance tuning
PPTX
Bagging.pptx
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
ODSC West 2022 – Kitbashing in ML
PDF
PDF
Building successful and secure products with AI and ML
Tips for data science competitions
AI hype or reality
Bootstrapping of PySpark Models for Factorial A/B Tests
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
2024-02-24_Session 1 - PMLE_UPDATED.pptx
The Power of Auto ML and How Does it Work
Winning data science competitions, presented by Owen Zhang
Using GANs to improve generalization in a semi-supervised setting - trying it...
Semi-supervised learning with GANs
Managing machine learning
Jay Yagnik at AI Frontiers : A History Lesson on AI
Fine tuning large LMs
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech x Scala x Performance tuning
Bagging.pptx
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
ODSC West 2022 – Kitbashing in ML
Building successful and secure products with AI and ML
Ad

More from Rebecca Bilbro (17)

PDF
Data Secrets From a Platform Engineer (Bilbro)
PDF
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PDF
Data Structures for Data Privacy: Lessons Learned in Production
PDF
Conflict-Free Replicated Data Types (PyCon 2022)
PDF
Anti-Entropy Replication for Cost-Effective Eventual Consistency
PDF
The Promise and Peril of Very Big Models
PDF
Beyond Off the-Shelf Consensus
PDF
Visual diagnostics at scale
PDF
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
PDF
A Visual Exploration of Distance, Documents, and Distributions
PDF
Words in space
PPTX
PPTX
Learning machine learning with Yellowbrick
PDF
Data Intelligence 2017 - Building a Gigaword Corpus
PDF
Building a Gigaword Corpus (PyCon 2017)
PDF
NLP for Everyday People
PDF
Commerce Data Usability Project
Data Secrets From a Platform Engineer (Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Data Structures for Data Privacy: Lessons Learned in Production
Conflict-Free Replicated Data Types (PyCon 2022)
Anti-Entropy Replication for Cost-Effective Eventual Consistency
The Promise and Peril of Very Big Models
Beyond Off the-Shelf Consensus
Visual diagnostics at scale
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
A Visual Exploration of Distance, Documents, and Distributions
Words in space
Learning machine learning with Yellowbrick
Data Intelligence 2017 - Building a Gigaword Corpus
Building a Gigaword Corpus (PyCon 2017)
NLP for Everyday People
Commerce Data Usability Project

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Computer network topology notes for revision
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to machine learning and Linear Models
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Database Infoormation System (DBIS).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
annual-report-2024-2025 original latest.
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Analytics and business intelligence.pdf
Introduction to Knowledge Engineering Part 1
Computer network topology notes for revision
Qualitative Qantitative and Mixed Methods.pptx
Business Acumen Training GuidePresentation.pptx
Foundation of Data Science unit number two notes
Introduction to machine learning and Linear Models
Miokarditis (Inflamasi pada Otot Jantung)
Database Infoormation System (DBIS).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
annual-report-2024-2025 original latest.

PyData Global: Thrifty Machine Learning

  • 1. Thrifty Machine Learning Cultivating cost-conscious modeling habits PyData Global 2020
  • 2. Overview Machine Learning on a Budget The Mother of Invention 2 3 An Embarrassment of Riches 1
  • 3. Dr. Rebecca Bilbro Machine Learning Engineer @ Unisys Applied Text Analysis with Python Scikit-Yellowbrick @rebeccabilbro
  • 8. “A fool and [their] money are soon parted.”
  • 9. “By sowing frugality we reap liberty, a golden harvest.” - Agesilaus
  • 11. The Supervised Learning Problem 11 Labeled Training Data Define a set of target classes & build a training dataset that has been annotated with those class labels. Feature Transformation(s) Take raw data and convert into vector form ahead of model training. Classifier Algorithm Train a model to recognize target classes using labeled training data. Tune parameters to reduce false positives and/or false negatives. This part is slow and boring.
  • 13. “Mais il faut cultiver notre jardin.” - Voltaire, Candide
  • 14. Getting Labeled Data ● Started with Amazon Mechanical Turk. ● Now there are many commercial providers of data labeling and data annotation services. ● It can be quite expensive. ○ 100,000 samples for $25,000 - $75,000 ● It’s just people, actually… ○ Semuels, Alana. “The Internet Is Enabling a New Kind of Poorly Paid Hell.” The Atlantic, January 23, 2018. ● Doesn’t usually work for domain-specific data. ● Quality tends to vary.
  • 15. Which Model is Best? Bayesian Decision Tree Dense Feedforward
  • 16. Which Model is Best? Bayesian Decision Tree Dense Feedforward for this dataset/problem space
  • 17. from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import model_selection as ms classifiers = [ KNeighborsClassifier(5), SVC(kernel="linear", C=0.025), RandomForestClassifier(max_depth=5), AdaBoostClassifier(), GaussianNB(), ] kfold = ms.KFold(len(X), n_folds=12) max([ ms.cross_val_score(model, X, y, cv=kfold).mean for model in classifiers ])
  • 18. How to Select a Model ● Start with a simple model, or better yet, try several in parallel! ● Filter out the weak performers, and only tune the best. ● Set an initial baseline. ● Use these preliminary steps to prepare for hyperparameter tuning.
  • 19. Hyperparameter Tuning ● Grid search ● Randomized search ● Bayesian optimization ● Evolutionary optimization ● Population-based training ● Gradient-based optimization ● “Auto ML” (see above, but pay $$$)
  • 20. ● Search is difficult, particularly in high dimensional space. ● Even with clever optimization techniques, there is no guarantee of a solution. ● As the search space gets larger, the amount of time increases exponentially. Unfortunately...
  • 24. Thoughtful Tuning ● Only tune the best performing models. ● Try to reduce your feature space. ● Understand the parameter ranges you’re searching. ● Move towards complexity purposefully ○ Understand error from variance vs. error from bias. ○ The model underfits. ○ The error doesn’t converge. ● Move towards complexity gradually ○ While both train and test scores are increasing (or error decreasing).
  • 27. Prototype Locally First ● Consider: are these conveniences really necessary/useful at the prototyping phase? ○ Probably not ● Don’t default to using cloud-hosted, Spark-running notebooks for everything! ● Configure Python to run locally (one-time cost). ● VSCode, PyCharm, etc, support Jupyter notebooks now. ● Downsampling your data is cheap!
  • 28. “Those who cannot remember the past are condemned to repeat it.” - George Santayana
  • 30. Serialize Everything ● The model ● Engineered features ● Feature vectors/embeddings ● Stopwords ● Lexicons ● Scores ● Diagnostic plots ● Training times And any other artifacts or metadata!
  • 31. When is an ML Model “done”? ● When you have achieved an accuracy measure above your threshold. ● When your error bounds are within your pre-defined target range. ● When your cross-validation demonstrates a convergence in training and test data. ● When the sprint is over. ● When the project is due
  • 32. Plan for this, not this
  • 34. When we shift our collective mindset toward model thriftiness rather than the relentless pursuit of a tiny bit more F1, there’s no telling what new things we might discover…
  • 36. Slide template by Slidesgo Icons by Flaticon Images & infographics by Freepik