Visual diagnostics at scale

Visual Diagnostics
at Scale
SciPy 2019

Dr. Rebecca Bilbro
Chief Data Scientist, ICX Media
Co-creator, Scikit-Yellowbrick
Author, Applied Text Analysis with Python
@rebeccabilbro

Census Dataset
500K instances
50 features
(age, occupation,
education, sex, ethnicity
marital status)
Sarcasm Dataset
50K instances
5K features
(“love”, 🙄, “totally”, “best”,
“surprise”, “Sherlock”,
capitalization, timestamp)
Sensor Dataset
5M instances
15 features
(Ammonia, Acetaldehyde,
Acetone, Ethylene, Ethanol,
Toluene ppmv)

Scaling pain
points are
dataset-
speciﬁc
● Many features
● Many instances
● Feature variance
● Heteroskedasticity
● Covariance
● Noise

Logistic Regression Fit Times (seconds)
500 - 5M instances / 5 - 50 features
10 seconds

Multilayer Perceptron Fit Times (seconds)
500 - 5M instances / 5 - 50 features
5 min, 48
seconds

Support Vector Machine Fit Times (seconds)
500 - 500K instances / 5 - 50 features
5 hours, 24
seconds

Support Vector Machine Fit Times (seconds)
500 - 500K instances / 5 - 50 features
5 hours, 24
seconds
😵

How to
optimize?
● Be patient
● Be wrong
● Be rich
● Steer

The Model
Selection
Triple
Arun Kumar, et al. http://bit.ly/2abVNrI

Models are aggregations
So are visualizations

Use visualizations
to steer model selection

Adventures in
Model Visualization

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from yellowbrick.features import ParallelCoordinates
data = load_iris()
oz = ParallelCoordinates(ax=axes[idx], fast=True)
oz.fit_transform(data.data, data.target)
oz.finalize()
Each point drawn individually
as connected line segment
With standardization
Points grouped by class, each class
drawn as single segment

from yellowbrick.features import Rank2D
from yellowbrick.pipeline import VisualPipeline
from yellowbrick.model_selection import CVScores
from yellowbrick.regressor import PredictionError
viz_pipe = VisualPipeline([
('rank2d', Rank2D(features=features, algorithm='covariance')),
('prederr', PredictionError(model)),
('cvscores', CVScores(model, cv=cv, scoring='r2'))
])
Visual
Pipelines

Machine learning is not particularly
well-suited to object-oriented
programming

class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
class Pipeline(Transfomer):
@property
def named_steps(self):
"""
Returns a sequence of estimators
"""
return self.steps
@property
def _final_estimator(self):
"""
Terminating estimator
"""
return self.steps[-1]
The scikit-learn API
self.X

class Visualizer(Estimator):
def draw(self):
"""
Draw called from scikit-learn methods.
"""
return self.ax
def finalize(self):
self.set_title()
self.legend()
def poof(self):
self.finalize()
plt.show()
import matplotlib.pyplot as plt
from yellowbrick.base import Visualizer
class MyVisualizer(Visualizer):
def __init__(self, ax=None, **kwargs):
super(MyVisualizer, self).__init__(ax, **kwargs)
def fit(self, X, y=None):
self.draw(X)
return self
def draw(self, X):
if self.ax is None:
self.ax = self.gca()
self.ax.plt(X)
def finalize(self):
self.set_title("My Visualizer")
The Yellowbrick API

A tool for students
vs.
A tool for practitioners?

Yellowbrick Quick Methods
from sklearn.linear_model import Lasso
from yellowbrick.regressor import ResidualsPlot
# Option 1: scikit-learn style
viz = ResidualsPlot(Lasso())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()
from sklearn.linear_model import Lasso
from yellowbrick.regressor import residuals_plot
# Option 2: Quick Method
viz = residuals_plot(
Lasso(), X_train, y_train, X_test, y_test
)
��

.. plot::
:context: close-figs
:include-source: False
:alt: Recursive Feature Elimination
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from yellowbrick.features import RFECV
# Create a dataset with only 3 informative features
X, y = make_classification(
n_samples=1000, n_features=25, n_informative=3,
n_redundant=2, n_repeated=0, n_classes=8,
n_clusters_per_class=1, random_state=0
)
viz = RFECV(SVC(kernel='linear', C=1))
viz.fit(X, y)
viz.poof()
The Plot Directive

=========================================== test session starts ============================================
platform darwin -- Python 3.7.1, pytest-5.0.0, py-1.8.0, pluggy-0.12.0
rootdir: /Users/rbilbro/pyjects/yb, inifile: setup.cfg
plugins: flakes-4.0.0, cov-2.7.1
collected 932 items
tests/__init__.py s... [ 0%]
tests/base.py s [ 0%]
tests/conftest.py s [ 0%]
tests/fixtures.py s [ 0%]
tests/images.py s [ 0%]
tests/rand.py s [ 0%]
tests/test_base.py s............ [ 2%]
...........................................................................................................
...........................................................................................................
...........................................................................................................
...........................................................................................................
tests/test_utils/test_target.py s............ [ 68%]
tests/test_utils/test_timer.py s..... [ 68%]
tests/test_utils/test_types.py s.................................................................... [ 70%]
....x................................x.............................................................. [ 72%]
.... [ 73%]
tests/test_utils/test_wrapper.py s....
===================== 854 passed, 72 skipped, 6 xfailed, 33 warnings in 225.96 seconds =====================
Also Testing

Machine-learning oriented aggregation
YB (current) Seaborn

Brushing and Filtering
Ok for only 5 features Not good for 23 features

Parallelization with joblib
Elbow Curve Validation Curve

Figures & Axes
YB wraps a matplotlib axes.Axes object
● Visualizers behave as part of larger ﬁg
● Make multi-axis plots for publications, etc.
● Give users control over size, style, interaction
But what to do as visualizers become
more complex, e.g. multi-axis in their
own right?
➔ AxesGrid Toolkit (e.g.
make_axes_locatable)

Other
places we’re
looking
● Altair
● Bokeh
● Pandas
● Seaborn
● Datashader
● ...suggestions?

● ML experimentation is in tension with time, $$$, reality.
● Human-driven steering is useful for data of any size.
● The stakes are much higher for big data.
● Scikit-YB supports visual steering via Visualizer objects.
● Wrapping both scikit-learn and Matplotlib APIs is tricky!
● The path forward includes optimized aggregations, including
zoom-and-ﬁlter, brushing, parallelization, and multi-axis plotting.
Main Points

Visual diagnostics at scale

More Related Content

Similar to Visual diagnostics at scale (20)

More from Rebecca Bilbro (20)

Recently uploaded (20)

Visual diagnostics at scale