SlideShare a Scribd company logo
Visual Diagnostics
at Scale
SciPy 2019
Dr. Rebecca Bilbro
Chief Data Scientist, ICX Media
Co-creator, Scikit-Yellowbrick
Author, Applied Text Analysis with Python
@rebeccabilbro
A tale of
three datasets
Census Dataset
500K instances
50 features
(age, occupation,
education, sex, ethnicity
marital status)
Sarcasm Dataset
50K instances
5K features
(“love”, 🙄, “totally”, “best”,
“surprise”, “Sherlock”,
capitalization, timestamp)
Sensor Dataset
5M instances
15 features
(Ammonia, Acetaldehyde,
Acetone, Ethylene, Ethanol,
Toluene ppmv)
Scaling pain
points are
dataset-
specific
● Many features
● Many instances
● Feature variance
● Heteroskedasticity
● Covariance
● Noise
Logistic Regression Fit Times (seconds)
500 - 5M instances / 5 - 50 features
10 seconds
Multilayer Perceptron Fit Times (seconds)
500 - 5M instances / 5 - 50 features
5 min, 48
seconds
Support Vector Machine Fit Times (seconds)
500 - 500K instances / 5 - 50 features
5 hours, 24
seconds
Support Vector Machine Fit Times (seconds)
500 - 500K instances / 5 - 50 features
5 hours, 24
seconds
😵
How to
optimize?
● Be patient
● Be wrong
● Be rich
● Steer
The Model
Selection
Triple
Arun Kumar, et al. http://bit.ly/2abVNrI
Models are aggregations
So are visualizations
Use visualizations
to steer model selection
Adventures in
Model Visualization
Visual diagnostics at scale
Visual diagnostics at scale
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from yellowbrick.features import ParallelCoordinates
data = load_iris()
oz = ParallelCoordinates(ax=axes[idx], fast=True)
oz.fit_transform(data.data, data.target)
oz.finalize()
Each point drawn individually
as connected line segment
With standardization
Points grouped by class, each class
drawn as single segment
Visual diagnostics at scale
from yellowbrick.features import Rank2D
from yellowbrick.pipeline import VisualPipeline
from yellowbrick.model_selection import CVScores
from yellowbrick.regressor import PredictionError
viz_pipe = VisualPipeline([
('rank2d', Rank2D(features=features, algorithm='covariance')),
('prederr', PredictionError(model)),
('cvscores', CVScores(model, cv=cv, scoring='r2'))
])
Visual
Pipelines
Bumps
Machine learning is not particularly
well-suited to object-oriented
programming
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
class Pipeline(Transfomer):
@property
def named_steps(self):
"""
Returns a sequence of estimators
"""
return self.steps
@property
def _final_estimator(self):
"""
Terminating estimator
"""
return self.steps[-1]
The scikit-learn API
self.X
class Visualizer(Estimator):
def draw(self):
"""
Draw called from scikit-learn methods.
"""
return self.ax
def finalize(self):
self.set_title()
self.legend()
def poof(self):
self.finalize()
plt.show()
import matplotlib.pyplot as plt
from yellowbrick.base import Visualizer
class MyVisualizer(Visualizer):
def __init__(self, ax=None, **kwargs):
super(MyVisualizer, self).__init__(ax, **kwargs)
def fit(self, X, y=None):
self.draw(X)
return self
def draw(self, X):
if self.ax is None:
self.ax = self.gca()
self.ax.plt(X)
def finalize(self):
self.set_title("My Visualizer")
The Yellowbrick API
A tool for students
vs.
A tool for practitioners?
Yellowbrick Quick Methods
from sklearn.linear_model import Lasso
from yellowbrick.regressor import ResidualsPlot
# Option 1: scikit-learn style
viz = ResidualsPlot(Lasso())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()
from sklearn.linear_model import Lasso
from yellowbrick.regressor import residuals_plot
# Option 2: Quick Method
viz = residuals_plot(
Lasso(), X_train, y_train, X_test, y_test
)
��
Progress
vs.
Documentation
.. plot::
:context: close-figs
:include-source: False
:alt: Recursive Feature Elimination
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from yellowbrick.features import RFECV
# Create a dataset with only 3 informative features
X, y = make_classification(
n_samples=1000, n_features=25, n_informative=3,
n_redundant=2, n_repeated=0, n_classes=8,
n_clusters_per_class=1, random_state=0
)
viz = RFECV(SVC(kernel='linear', C=1))
viz.fit(X, y)
viz.poof()
The Plot Directive
=========================================== test session starts ============================================
platform darwin -- Python 3.7.1, pytest-5.0.0, py-1.8.0, pluggy-0.12.0
rootdir: /Users/rbilbro/pyjects/yb, inifile: setup.cfg
plugins: flakes-4.0.0, cov-2.7.1
collected 932 items
tests/__init__.py s... [ 0%]
tests/base.py s [ 0%]
tests/conftest.py s [ 0%]
tests/fixtures.py s [ 0%]
tests/images.py s [ 0%]
tests/rand.py s [ 0%]
tests/test_base.py s............ [ 2%]
...........................................................................................................
...........................................................................................................
...........................................................................................................
...........................................................................................................
tests/test_utils/test_target.py s............ [ 68%]
tests/test_utils/test_timer.py s..... [ 68%]
tests/test_utils/test_types.py s.................................................................... [ 70%]
....x................................x.............................................................. [ 72%]
.... [ 73%]
tests/test_utils/test_wrapper.py s....
===================== 854 passed, 72 skipped, 6 xfailed, 33 warnings in 225.96 seconds =====================
Also Testing
Roadmap
Machine-learning oriented aggregation
YB (current) Seaborn
Brushing and Filtering
Ok for only 5 features Not good for 23 features
Parallelization with joblib
Elbow Curve Validation Curve
Figures & Axes
YB wraps a matplotlib axes.Axes object
● Visualizers behave as part of larger fig
● Make multi-axis plots for publications, etc.
● Give users control over size, style, interaction
But what to do as visualizers become
more complex, e.g. multi-axis in their
own right?
➔ AxesGrid Toolkit (e.g.
make_axes_locatable)
Other
places we’re
looking
● Altair
● Bokeh
● Pandas
● Seaborn
● Datashader
● ...suggestions?
● ML experimentation is in tension with time, $$$, reality.
● Human-driven steering is useful for data of any size.
● The stakes are much higher for big data.
● Scikit-YB supports visual steering via Visualizer objects.
● Wrapping both scikit-learn and Matplotlib APIs is tricky!
● The path forward includes optimized aggregations, including
zoom-and-filter, brushing, parallelization, and multi-axis plotting.
Main Points
Thank
you!

More Related Content

PDF
The Incredible Disappearing Data Scientist
PPTX
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
PPTX
2019 09 05 Global AI Night Toronto - Machine Learning.Net
PPTX
2020 04 10 Catch IT - Getting started with ML.Net
PDF
Brief introduction to Machine Learning
PDF
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
PDF
EuroSciPy 2019: Visual diagnostics at scale
PDF
Data Secrets From a Platform Engineer (Bilbro)
The Incredible Disappearing Data Scientist
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
2019 09 05 Global AI Night Toronto - Machine Learning.Net
2020 04 10 Catch IT - Getting started with ML.Net
Brief introduction to Machine Learning
Creating Chatbots Using TensorFlow | Chatbot Tutorial | Deep Learning Trainin...
EuroSciPy 2019: Visual diagnostics at scale
Data Secrets From a Platform Engineer (Bilbro)

Similar to Visual diagnostics at scale (20)

PDF
Viktor Tsykunov: Azure Machine Learning Service
PDF
Systems Bioinformatics Workshop Keynote
PDF
I want my model to be deployed ! (another story of MLOps)
PPTX
Unsupervised Aspect Based Sentiment Analysis at Scale
PDF
Benchy: Lightweight framework for Performance Benchmarks
PDF
ML-Ops how to bring your data science to production
PPTX
Dive into DevOps | March, Building with Terraform, Volodymyr Tsap
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PDF
MT_01_unittest_python.pdf
PPTX
Learning Predictive Modeling with TSA and Kaggle
PPTX
Azure machine learning service
PPTX
Image classification using cnn
PDF
Akka with Scala
PDF
Building an ML Platform with Ray and MLflow
PDF
Crossing the Bridge: Connecting Rails and your Front-end Framework
PPTX
Next.ml Boston: Data Science Dev Ops
PDF
Spark for Reactive Machine Learning: Building Intelligent Agents at Scale
PDF
How to fake_properly
PDF
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PDF
Automation with Ansible and Containers
Viktor Tsykunov: Azure Machine Learning Service
Systems Bioinformatics Workshop Keynote
I want my model to be deployed ! (another story of MLOps)
Unsupervised Aspect Based Sentiment Analysis at Scale
Benchy: Lightweight framework for Performance Benchmarks
ML-Ops how to bring your data science to production
Dive into DevOps | March, Building with Terraform, Volodymyr Tsap
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
MT_01_unittest_python.pdf
Learning Predictive Modeling with TSA and Kaggle
Azure machine learning service
Image classification using cnn
Akka with Scala
Building an ML Platform with Ray and MLflow
Crossing the Bridge: Connecting Rails and your Front-end Framework
Next.ml Boston: Data Science Dev Ops
Spark for Reactive Machine Learning: Building Intelligent Agents at Scale
How to fake_properly
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
Automation with Ansible and Containers
Ad

More from Rebecca Bilbro (20)

PDF
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PDF
Data Structures for Data Privacy: Lessons Learned in Production
PDF
Conflict-Free Replicated Data Types (PyCon 2022)
PDF
(Py)testing the Limits of Machine Learning
PDF
Anti-Entropy Replication for Cost-Effective Eventual Consistency
PDF
The Promise and Peril of Very Big Models
PDF
Beyond Off the-Shelf Consensus
PDF
PyData Global: Thrifty Machine Learning
PDF
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
PDF
A Visual Exploration of Distance, Documents, and Distributions
PDF
Words in space
PPTX
PPTX
Learning machine learning with Yellowbrick
PPTX
Escaping the Black Box
PDF
Data Intelligence 2017 - Building a Gigaword Corpus
PDF
Building a Gigaword Corpus (PyCon 2017)
PDF
Yellowbrick: Steering machine learning with visual transformers
PDF
Visualizing the model selection process
PDF
NLP for Everyday People
PDF
Commerce Data Usability Project
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Data Structures for Data Privacy: Lessons Learned in Production
Conflict-Free Replicated Data Types (PyCon 2022)
(Py)testing the Limits of Machine Learning
Anti-Entropy Replication for Cost-Effective Eventual Consistency
The Promise and Peril of Very Big Models
Beyond Off the-Shelf Consensus
PyData Global: Thrifty Machine Learning
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
A Visual Exploration of Distance, Documents, and Distributions
Words in space
Learning machine learning with Yellowbrick
Escaping the Black Box
Data Intelligence 2017 - Building a Gigaword Corpus
Building a Gigaword Corpus (PyCon 2017)
Yellowbrick: Steering machine learning with visual transformers
Visualizing the model selection process
NLP for Everyday People
Commerce Data Usability Project
Ad

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
1_Introduction to advance data techniques.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Business Ppt On Nestle.pptx huunnnhhgfvu
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Supervised vs unsupervised machine learning algorithms
Data_Analytics_and_PowerBI_Presentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
1_Introduction to advance data techniques.pptx
Foundation of Data Science unit number two notes
Business Acumen Training GuidePresentation.pptx
Business Analytics and business intelligence.pdf
annual-report-2024-2025 original latest.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
Introduction-to-Cloud-ComputingFinal.pptx

Visual diagnostics at scale

  • 2. Dr. Rebecca Bilbro Chief Data Scientist, ICX Media Co-creator, Scikit-Yellowbrick Author, Applied Text Analysis with Python @rebeccabilbro
  • 3. A tale of three datasets
  • 4. Census Dataset 500K instances 50 features (age, occupation, education, sex, ethnicity marital status) Sarcasm Dataset 50K instances 5K features (“love”, 🙄, “totally”, “best”, “surprise”, “Sherlock”, capitalization, timestamp) Sensor Dataset 5M instances 15 features (Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, Toluene ppmv)
  • 5. Scaling pain points are dataset- specific ● Many features ● Many instances ● Feature variance ● Heteroskedasticity ● Covariance ● Noise
  • 6. Logistic Regression Fit Times (seconds) 500 - 5M instances / 5 - 50 features 10 seconds
  • 7. Multilayer Perceptron Fit Times (seconds) 500 - 5M instances / 5 - 50 features 5 min, 48 seconds
  • 8. Support Vector Machine Fit Times (seconds) 500 - 500K instances / 5 - 50 features 5 hours, 24 seconds
  • 9. Support Vector Machine Fit Times (seconds) 500 - 500K instances / 5 - 50 features 5 hours, 24 seconds 😵
  • 10. How to optimize? ● Be patient ● Be wrong ● Be rich ● Steer
  • 11. The Model Selection Triple Arun Kumar, et al. http://bit.ly/2abVNrI
  • 12. Models are aggregations So are visualizations
  • 13. Use visualizations to steer model selection
  • 17. import matplotlib.pyplot as plt from sklearn.datasets import load_iris from yellowbrick.features import ParallelCoordinates data = load_iris() oz = ParallelCoordinates(ax=axes[idx], fast=True) oz.fit_transform(data.data, data.target) oz.finalize() Each point drawn individually as connected line segment With standardization Points grouped by class, each class drawn as single segment
  • 19. from yellowbrick.features import Rank2D from yellowbrick.pipeline import VisualPipeline from yellowbrick.model_selection import CVScores from yellowbrick.regressor import PredictionError viz_pipe = VisualPipeline([ ('rank2d', Rank2D(features=features, algorithm='covariance')), ('prederr', PredictionError(model)), ('cvscores', CVScores(model, cv=cv, scoring='r2')) ]) Visual Pipelines
  • 20. Bumps
  • 21. Machine learning is not particularly well-suited to object-oriented programming
  • 22. class Estimator(object): def fit(self, X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime class Pipeline(Transfomer): @property def named_steps(self): """ Returns a sequence of estimators """ return self.steps @property def _final_estimator(self): """ Terminating estimator """ return self.steps[-1] The scikit-learn API self.X
  • 23. class Visualizer(Estimator): def draw(self): """ Draw called from scikit-learn methods. """ return self.ax def finalize(self): self.set_title() self.legend() def poof(self): self.finalize() plt.show() import matplotlib.pyplot as plt from yellowbrick.base import Visualizer class MyVisualizer(Visualizer): def __init__(self, ax=None, **kwargs): super(MyVisualizer, self).__init__(ax, **kwargs) def fit(self, X, y=None): self.draw(X) return self def draw(self, X): if self.ax is None: self.ax = self.gca() self.ax.plt(X) def finalize(self): self.set_title("My Visualizer") The Yellowbrick API
  • 24. A tool for students vs. A tool for practitioners?
  • 25. Yellowbrick Quick Methods from sklearn.linear_model import Lasso from yellowbrick.regressor import ResidualsPlot # Option 1: scikit-learn style viz = ResidualsPlot(Lasso()) viz.fit(X_train, y_train) viz.score(X_test, y_test) viz.poof() from sklearn.linear_model import Lasso from yellowbrick.regressor import residuals_plot # Option 2: Quick Method viz = residuals_plot( Lasso(), X_train, y_train, X_test, y_test ) ��
  • 27. .. plot:: :context: close-figs :include-source: False :alt: Recursive Feature Elimination from sklearn.svm import SVC from sklearn.datasets import make_classification from yellowbrick.features import RFECV # Create a dataset with only 3 informative features X, y = make_classification( n_samples=1000, n_features=25, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0 ) viz = RFECV(SVC(kernel='linear', C=1)) viz.fit(X, y) viz.poof() The Plot Directive
  • 28. =========================================== test session starts ============================================ platform darwin -- Python 3.7.1, pytest-5.0.0, py-1.8.0, pluggy-0.12.0 rootdir: /Users/rbilbro/pyjects/yb, inifile: setup.cfg plugins: flakes-4.0.0, cov-2.7.1 collected 932 items tests/__init__.py s... [ 0%] tests/base.py s [ 0%] tests/conftest.py s [ 0%] tests/fixtures.py s [ 0%] tests/images.py s [ 0%] tests/rand.py s [ 0%] tests/test_base.py s............ [ 2%] ........................................................................................................... ........................................................................................................... ........................................................................................................... ........................................................................................................... tests/test_utils/test_target.py s............ [ 68%] tests/test_utils/test_timer.py s..... [ 68%] tests/test_utils/test_types.py s.................................................................... [ 70%] ....x................................x.............................................................. [ 72%] .... [ 73%] tests/test_utils/test_wrapper.py s.... ===================== 854 passed, 72 skipped, 6 xfailed, 33 warnings in 225.96 seconds ===================== Also Testing
  • 31. Brushing and Filtering Ok for only 5 features Not good for 23 features
  • 32. Parallelization with joblib Elbow Curve Validation Curve
  • 33. Figures & Axes YB wraps a matplotlib axes.Axes object ● Visualizers behave as part of larger fig ● Make multi-axis plots for publications, etc. ● Give users control over size, style, interaction But what to do as visualizers become more complex, e.g. multi-axis in their own right? ➔ AxesGrid Toolkit (e.g. make_axes_locatable)
  • 34. Other places we’re looking ● Altair ● Bokeh ● Pandas ● Seaborn ● Datashader ● ...suggestions?
  • 35. ● ML experimentation is in tension with time, $$$, reality. ● Human-driven steering is useful for data of any size. ● The stakes are much higher for big data. ● Scikit-YB supports visual steering via Visualizer objects. ● Wrapping both scikit-learn and Matplotlib APIs is tricky! ● The path forward includes optimized aggregations, including zoom-and-filter, brushing, parallelization, and multi-axis plotting. Main Points