A Visual Exploration of Distance, Documents, and Distributions

Words in Space
A Visual Exploration of Distance, Documents, and
Distributions for Text Analysis
PyData DC
2018

Dr. Rebecca Bilbro
Head of Data Science, ICX Media
Co-creator, Scikit-Yellowbrick
Author, Applied Text Analysis with Python
@rebeccabilbro

The Machine Learning Problem:
Given a set of n samples of data such that each sample is
represented by more than a single number (e.g. multivariate
data that has several attributes or features), create a model
that is able to predict unknown properties of each sample.

Spatial interpretation:
Given data points in a bounded,
high dimensional space, define
regions of decisions for any point
in that space.

Instances are composed of features that make up our dimensions.

Feature space is the n-dimensions where our variables live (not
including target).
Feature extraction is the art of creating a space with decision
boundaries.

Example
Target
Y ≡ Thickness of car tires after some testing period
Variables
X1
≡ distance travelled in test
X2
≡ time duration of test
X3
≡ amount of chemical C in tires
The feature space is R3
, or more accurately, the positive quadrant in R3
as all the X
variables can only be positive quantities.

Domain knowledge about tires might suggest that the speed the vehicle was
moving at is important, hence we generate another variable, X4
(this is the feature
extraction part):
X4
= X1
/ X2
≡ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of R4
.
A mapping is a function, ϕ, from R3
to R4
:
ϕ(x1
,x2
,x3
) = (x1
,x2
,x3
,x1
x2
)

Real-world data is often not
represented numerically
out of the box (e.g. text,
images), therefore some
transformation must be
applied in order to do
machine learning.

Tricky Part
Machine learning relies on our ability to imagine data as
points in space, where the relative closeness of any two
is a measure of their similarity.
So...when we transform those non-numeric features into
numeric ones, how should we quantify the distance
between instances?

Many ways of quantifying “distance” (or similarity)
often the
default for
numeric data
common rule
of thumb for
text data

With text, our choice of distance metric is very
important! Why?

Challenges of Modeling Text Data
● Very high dimensional
○ One dimension for every word (token) in the corpus!
● Sparsely distributed
○ Documents vary in length!
○ Most instances (documents) may be mostly zeros!
● Has some features that are more important than others
○ E.g. the “of” dimension vs. the “basketball” dimension when clustering sports articles.
● Has some feature variations that matter more than others
○ E.g. freq(tree) vs. freq(horticulture) in classifying gardening books.

scikit-learn
from sklearn.metrics import pairwise_distances(X, Y=None,
metric=’euclidean’, n_jobs=None, **kwds)
Compute the distance matrix from a vector array X and optional Y.
Valid values for metric are:
● From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
● From scipy.spatial.distance...

SciPy!
Distance functions between two numeric vectors
u and v:
● braycurtis(u, v[, w])
● canberra(u, v[, w])
● chebyshev(u, v[, w])
● cityblock(u, v[, w])
● correlation(u, v[, w, centered])
● cosine(u, v[, w])
● euclidean(u, v[, w])
● mahalanobis(u, v, VI)
● minkowski(u, v[, p, w])
● seuclidean(u, v, V)
● sqeuclidean(u, v[, w])
● wminkowski(u, v, p, w)
Distance functions between two boolean vectors
(sets) u and v:
● dice(u, v[, w])
● hamming(u, v[, w])
● jaccard(u, v[, w])
● kulsinski(u, v[, w])
● rogerstanimoto(u, v[, w])
● russellrao(u, v[, w])
● sokalmichener(u, v[, w])
● sokalsneath(u, v[, w])
● yule(u, v[, w])
Note: most don’t
support sparse
matrix inputs

● Extends the Scikit-Learn API.
● Enhances the model selection process.
● Tools for feature visualization, visual
diagnostics, and visual steering.
● Not a replacement for other visualization
libraries.
Yellowbrick
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
model selection isiterative, but can besteered!

TSNE (t-distributed Stochastic Neighbor
Embedding)
1. Apply SVD (or PCA) to reduce
dimensionality (for efficiency).
2. Embed vectors using probability
distributions from both the original
dimensionality and the decomposed
dimensionality.
3. Cluster and visualize similar
documents in a scatterplot.

Three Example Datasets
Hobbies corpus
● From the Baleen project
● 448 newspaper/blog articles
● 5 classes: gaming, cooking, cinema, books, sports
● Doc length (in words): 532 avg, 14564 max, 1 min
Farm Ads corpus
● From the UCI Repository
● 4144 ads represented as a list of metadata tags
● 2 classes: accepted, not accepted
Dresses Attributes Sales corpus
● From the UCI Repository
● 500 dresses represented as features: neckline, waistline, fabric, size, season

Euclidean Distance
Euclidean distance is the straight-line distance between 2 points in Euclidean
(metric) space.
tsne = TSNEVisualizer(metric="euclidean")
tsne.fit(docs, labels)
tsne.poof()
5 10 15 20 25
252015105
Doc 2
(20, 19)
Doc 1
(7, 14)

Euclidean Distance
Hobbies Corpus Ads Corpus Dresses Corpus

Cityblock (Manhattan) Distance
Manhattan distance between two points is computed as the sum of the absolute
differences of their Cartesian coordinates.
tsne = TSNEVisualizer(metric="cityblock")
tsne.poof()

Cityblock (Manhattan) Distance

Chebyshev Distance
Chebyshev distance is the L∞-norm of the difference between two points (a special
case of the Minkowski distance where p goes to infinity). It is also known as
chessboard distance.
tsne = TSNEVisualizer(metric="chebyshev")
tsne.poof()

Chebyshev Distance

Minkowski Distance
Minkowski distance is a generalization of Euclidean, Manhattan, and Chebyshev
distance, and defines distance between points in a normalized vector space as the
generalized Lp-norm of their difference.
tsne = TSNEVisualizer(metric="minkowski")
tsne.poof()

Minkowski Distance

Mahalanobis Distance
A multidimensional generalization
of the distance between a point
and a distribution of points.
tsne = TSNEVisualizer(metric="mahalanobis", method='exact')
tsne.poof()
Think: shifting and rescaling coordinates with respect to distribution. Can help find
similarities between different-length docs.

Mahalanobis Distance

Cosine “Distance”
Cosine “distance” is the cosine of the angle between two doc vectors. The more
parallel, the more similar. Corrects for length variations (angles rather than
magnitudes). Considers only non-zero elements (efficient for sparse vectors!).
Note: Cosine distance is not technically a distance measure because it doesn’t
satisfy the triangle inequality.
tsne = TSNEVisualizer(metric="cosine")
tsne.poof()

Cosine “Distance”

Canberra Distance
Canberra distance is a weighted version of Manhattan distance. It is often used for
data scattered around an origin, as it is biased for measures around the origin and
very sensitive for values close to zero.
tsne = TSNEVisualizer(metric="canberra")
tsne.poof()

Canberra Distance

Jaccard Distance
Jaccard distance defines similarity between finite sets as the
quotient of their intersection and their union. More effective for
detecting things like document duplication.
tsne = TSNEVisualizer(metric="jaccard")
tsne.poof()

Jaccard Distance

Hamming Distance
Hamming distance between two strings is the number of positions at which the
corresponding symbols are different. Measures minimum substitutions required to
change one string into the other.
tsne = TSNEVisualizer(metric="hamming")
tsne.poof()

Hamming Distance

Other Yellowbrick Text Visualizers

Intercluster
Distance
Maps
Token
Frequency
Distribution
Dispersion
Plot
“Overview first, zoom and filter, then
details-on-demand”
- Ben Schneiderman

A Visual Exploration of Distance, Documents, and Distributions

More Related Content

What's hot (20)

Similar to A Visual Exploration of Distance, Documents, and Distributions (20)

More from Rebecca Bilbro (20)

Recently uploaded (20)

A Visual Exploration of Distance, Documents, and Distributions