SlideShare a Scribd company logo
Words in Space
A Visual Exploration of Distance, Documents, and
Distributions for Text Analysis
PyData DC
2018
Dr. Rebecca Bilbro
Head of Data Science, ICX Media
Co-creator, Scikit-Yellowbrick
Author, Applied Text Analysis with Python
@rebeccabilbro
Machine Learning Review
The Machine Learning Problem:
Given a set of n samples of data such that each sample is
represented by more than a single number (e.g. multivariate
data that has several attributes or features), create a model
that is able to predict unknown properties of each sample.
Spatial interpretation:
Given data points in a bounded,
high dimensional space, define
regions of decisions for any point
in that space.
Instances are composed of features that make up our dimensions.
Feature space is the n-dimensions where our variables live (not
including target).
Feature extraction is the art of creating a space with decision
boundaries.
Example
Target
Y ≡ Thickness of car tires after some testing period
Variables
X1
≡ distance travelled in test
X2
≡ time duration of test
X3
≡ amount of chemical C in tires
The feature space is R3
, or more accurately, the positive quadrant in R3
as all the X
variables can only be positive quantities.
Domain knowledge about tires might suggest that the speed the vehicle was
moving at is important, hence we generate another variable, X4
(this is the feature
extraction part):
X4
= X1
/ X2
≡ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of R4
.
A mapping is a function, ϕ, from R3
to R4
:
ϕ(x1
,x2
,x3
) = (x1
,x2
,x3
,x1
x2
)
Modeling Non-Numeric Data
Real-world data is often not
represented numerically
out of the box (e.g. text,
images), therefore some
transformation must be
applied in order to do
machine learning.
Tricky Part
Machine learning relies on our ability to imagine data as
points in space, where the relative closeness of any two
is a measure of their similarity.
So...when we transform those non-numeric features into
numeric ones, how should we quantify the distance
between instances?
Many ways of quantifying “distance” (or similarity)
often the
default for
numeric data
common rule
of thumb for
text data
With text, our choice of distance metric is very
important! Why?
Challenges of Modeling Text Data
● Very high dimensional
○ One dimension for every word (token) in the corpus!
● Sparsely distributed
○ Documents vary in length!
○ Most instances (documents) may be mostly zeros!
● Has some features that are more important than others
○ E.g. the “of” dimension vs. the “basketball” dimension when clustering sports articles.
● Has some feature variations that matter more than others
○ E.g. freq(tree) vs. freq(horticulture) in classifying gardening books.
Help!
scikit-learn
from sklearn.metrics import pairwise_distances(X, Y=None,
metric=’euclidean’, n_jobs=None, **kwds)
Compute the distance matrix from a vector array X and optional Y.
Valid values for metric are:
● From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
● From scipy.spatial.distance...
SciPy!
Distance functions between two numeric vectors
u and v:
● braycurtis(u, v[, w])
● canberra(u, v[, w])
● chebyshev(u, v[, w])
● cityblock(u, v[, w])
● correlation(u, v[, w, centered])
● cosine(u, v[, w])
● euclidean(u, v[, w])
● mahalanobis(u, v, VI)
● minkowski(u, v[, p, w])
● seuclidean(u, v, V)
● sqeuclidean(u, v[, w])
● wminkowski(u, v, p, w)
Distance functions between two boolean vectors
(sets) u and v:
● dice(u, v[, w])
● hamming(u, v[, w])
● jaccard(u, v[, w])
● kulsinski(u, v[, w])
● rogerstanimoto(u, v[, w])
● russellrao(u, v[, w])
● sokalmichener(u, v[, w])
● sokalsneath(u, v[, w])
● yule(u, v[, w])
Note: most don’t
support sparse
matrix inputs
● Extends the Scikit-Learn API.
● Enhances the model selection process.
● Tools for feature visualization, visual
diagnostics, and visual steering.
● Not a replacement for other visualization
libraries.
Yellowbrick
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
model selection isiterative, but can besteered!
TSNE (t-distributed Stochastic Neighbor
Embedding)
1. Apply SVD (or PCA) to reduce
dimensionality (for efficiency).
2. Embed vectors using probability
distributions from both the original
dimensionality and the decomposed
dimensionality.
3. Cluster and visualize similar
documents in a scatterplot.
Three Example Datasets
Hobbies corpus
● From the Baleen project
● 448 newspaper/blog articles
● 5 classes: gaming, cooking, cinema, books, sports
● Doc length (in words): 532 avg, 14564 max, 1 min
Farm Ads corpus
● From the UCI Repository
● 4144 ads represented as a list of metadata tags
● 2 classes: accepted, not accepted
● Doc length (in words): 270 avg, 5316 max, 1 min
Dresses Attributes Sales corpus
● From the UCI Repository
● 500 dresses represented as features: neckline, waistline, fabric, size, season
● Doc length (in words): 11 avg, 11 max, 11 min
Euclidean Distance
Euclidean distance is the straight-line distance between 2 points in Euclidean
(metric) space.
tsne = TSNEVisualizer(metric="euclidean")
tsne.fit(docs, labels)
tsne.poof()
5 10 15 20 25
252015105
Doc 2
(20, 19)
Doc 1
(7, 14)
Euclidean Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Cityblock (Manhattan) Distance
Manhattan distance between two points is computed as the sum of the absolute
differences of their Cartesian coordinates.
tsne = TSNEVisualizer(metric="cityblock")
tsne.fit(docs, labels)
tsne.poof()
Cityblock (Manhattan) Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Chebyshev Distance
Chebyshev distance is the L∞-norm of the difference between two points (a special
case of the Minkowski distance where p goes to infinity). It is also known as
chessboard distance.
tsne = TSNEVisualizer(metric="chebyshev")
tsne.fit(docs, labels)
tsne.poof()
Chebyshev Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Minkowski Distance
Minkowski distance is a generalization of Euclidean, Manhattan, and Chebyshev
distance, and defines distance between points in a normalized vector space as the
generalized Lp-norm of their difference.
tsne = TSNEVisualizer(metric="minkowski")
tsne.fit(docs, labels)
tsne.poof()
Minkowski Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Mahalanobis Distance
A multidimensional generalization
of the distance between a point
and a distribution of points.
tsne = TSNEVisualizer(metric="mahalanobis", method='exact')
tsne.fit(docs, labels)
tsne.poof()
Think: shifting and rescaling coordinates with respect to distribution. Can help find
similarities between different-length docs.
Mahalanobis Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Cosine “Distance”
Cosine “distance” is the cosine of the angle between two doc vectors. The more
parallel, the more similar. Corrects for length variations (angles rather than
magnitudes). Considers only non-zero elements (efficient for sparse vectors!).
Note: Cosine distance is not technically a distance measure because it doesn’t
satisfy the triangle inequality.
tsne = TSNEVisualizer(metric="cosine")
tsne.fit(docs, labels)
tsne.poof()
Cosine “Distance”
Hobbies Corpus Ads Corpus Dresses Corpus
Canberra Distance
Canberra distance is a weighted version of Manhattan distance. It is often used for
data scattered around an origin, as it is biased for measures around the origin and
very sensitive for values close to zero.
tsne = TSNEVisualizer(metric="canberra")
tsne.fit(docs, labels)
tsne.poof()
Canberra Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Jaccard Distance
Jaccard distance defines similarity between finite sets as the
quotient of their intersection and their union. More effective for
detecting things like document duplication.
tsne = TSNEVisualizer(metric="jaccard")
tsne.fit(docs, labels)
tsne.poof()
Jaccard Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Hamming Distance
Hamming distance between two strings is the number of positions at which the
corresponding symbols are different. Measures minimum substitutions required to
change one string into the other.
tsne = TSNEVisualizer(metric="hamming")
tsne.fit(docs, labels)
tsne.poof()
Hamming Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Other Yellowbrick Text Visualizers
Intercluster
Distance
Maps
Token
Frequency
Distribution
Dispersion
Plot
“Overview first, zoom and filter, then
details-on-demand”
- Ben Schneiderman
Thank you!

More Related Content

PDF
Document Modeling with Implicit Approximate Posterior Distributions
PPTX
Clique and sting
PDF
Improving search time for contentment based image retrieval via, LSH, MTRee, ...
PDF
Optics ordering points to identify the clustering structure
PPTX
Optics
PDF
Siamese networks
PPTX
Clique
PDF
Clustering
Document Modeling with Implicit Approximate Posterior Distributions
Clique and sting
Improving search time for contentment based image retrieval via, LSH, MTRee, ...
Optics ordering points to identify the clustering structure
Optics
Siamese networks
Clique
Clustering

What's hot (20)

PPT
3.4 density and grid methods
PPT
Clustering: Large Databases in data mining
PDF
Density Based Clustering
PPTX
Data compression
PDF
Oblivious Neural Network Predictions via MiniONN Transformations
PPTX
DBSCAN : A Clustering Algorithm
PPTX
A Diffusion Wavelet Approach For 3 D Model Matching
PDF
Machine learning in science and industry — day 4
PPT
4 Cliques Clusters
PDF
www.ijerd.com
PPTX
Dbscan algorithom
PPT
Lecture8 clustering
PDF
(研究会輪読) Weight Uncertainty in Neural Networks
PPTX
Deep Learning
PDF
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
DOC
Fuzzieee-98-final
PDF
Report Satellite Navigation Systems
PDF
(DL hacks輪読)Bayesian Neural Network
PPTX
Hierarchical clustering techniques
PPTX
Density based clustering
3.4 density and grid methods
Clustering: Large Databases in data mining
Density Based Clustering
Data compression
Oblivious Neural Network Predictions via MiniONN Transformations
DBSCAN : A Clustering Algorithm
A Diffusion Wavelet Approach For 3 D Model Matching
Machine learning in science and industry — day 4
4 Cliques Clusters
www.ijerd.com
Dbscan algorithom
Lecture8 clustering
(研究会輪読) Weight Uncertainty in Neural Networks
Deep Learning
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Fuzzieee-98-final
Report Satellite Navigation Systems
(DL hacks輪読)Bayesian Neural Network
Hierarchical clustering techniques
Density based clustering
Ad

Similar to A Visual Exploration of Distance, Documents, and Distributions (20)

PDF
Words in space
PPT
similarities-knn-1.ppt
PDF
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
PPTX
Data Mining Lecture_5.pptx
PPTX
similarities-knn.pptx
PDF
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
PDF
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
PPT
[PPT]
PDF
09Evaluation_Clustering.pdf
PPTX
SkNoushadddoja_28100119039.pptx
PPT
16 17 bag_words
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
PDF
Introduction to machine learning
PDF
Combined cosine-linear regression model similarity with application to handwr...
PDF
DMTM 2015 - 06 Introduction to Clustering
PPTX
3a-knn.pptxhggmtdu0lphm0kultkkkkkkkkkkkk
PDF
DMTM Lecture 11 Clustering
PPT
Cs345 cl
PPT
Support Vector Machines Support Vector Machines
PPT
KNN and SVM algorithm in Machine Learning for MCA
Words in space
similarities-knn-1.ppt
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Data Mining Lecture_5.pptx
similarities-knn.pptx
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
[PPT]
09Evaluation_Clustering.pdf
SkNoushadddoja_28100119039.pptx
16 17 bag_words
Clustering Algorithms - Kmeans,Min ALgorithm
Introduction to machine learning
Combined cosine-linear regression model similarity with application to handwr...
DMTM 2015 - 06 Introduction to Clustering
3a-knn.pptxhggmtdu0lphm0kultkkkkkkkkkkkk
DMTM Lecture 11 Clustering
Cs345 cl
Support Vector Machines Support Vector Machines
KNN and SVM algorithm in Machine Learning for MCA
Ad

More from Rebecca Bilbro (20)

PDF
Data Secrets From a Platform Engineer (Bilbro)
PDF
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PDF
Data Structures for Data Privacy: Lessons Learned in Production
PDF
Conflict-Free Replicated Data Types (PyCon 2022)
PDF
(Py)testing the Limits of Machine Learning
PDF
Anti-Entropy Replication for Cost-Effective Eventual Consistency
PDF
The Promise and Peril of Very Big Models
PDF
Beyond Off the-Shelf Consensus
PDF
PyData Global: Thrifty Machine Learning
PDF
EuroSciPy 2019: Visual diagnostics at scale
PDF
Visual diagnostics at scale
PDF
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
PDF
The Incredible Disappearing Data Scientist
PPTX
PPTX
Learning machine learning with Yellowbrick
PPTX
Escaping the Black Box
PDF
Data Intelligence 2017 - Building a Gigaword Corpus
PDF
Building a Gigaword Corpus (PyCon 2017)
PDF
Yellowbrick: Steering machine learning with visual transformers
PDF
Visualizing the model selection process
Data Secrets From a Platform Engineer (Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Data Structures for Data Privacy: Lessons Learned in Production
Conflict-Free Replicated Data Types (PyCon 2022)
(Py)testing the Limits of Machine Learning
Anti-Entropy Replication for Cost-Effective Eventual Consistency
The Promise and Peril of Very Big Models
Beyond Off the-Shelf Consensus
PyData Global: Thrifty Machine Learning
EuroSciPy 2019: Visual diagnostics at scale
Visual diagnostics at scale
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
The Incredible Disappearing Data Scientist
Learning machine learning with Yellowbrick
Escaping the Black Box
Data Intelligence 2017 - Building a Gigaword Corpus
Building a Gigaword Corpus (PyCon 2017)
Yellowbrick: Steering machine learning with visual transformers
Visualizing the model selection process

Recently uploaded (20)

PPTX
New ISO 27001_2022 standard and the changes
PPTX
Introduction to Inferential Statistics.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Transcultural that can help you someday.
PPTX
Managing Community Partner Relationships
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
modul_python (1).pptx for professional and student
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Leprosy and NLEP programme community medicine
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
New ISO 27001_2022 standard and the changes
Introduction to Inferential Statistics.pptx
Qualitative Qantitative and Mixed Methods.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Transcultural that can help you someday.
Managing Community Partner Relationships
Optimise Shopper Experiences with a Strong Data Estate.pdf
annual-report-2024-2025 original latest.
Pilar Kemerdekaan dan Identi Bangsa.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
modul_python (1).pptx for professional and student
Acceptance and paychological effects of mandatory extra coach I classes.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Leprosy and NLEP programme community medicine
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt

A Visual Exploration of Distance, Documents, and Distributions

  • 1. Words in Space A Visual Exploration of Distance, Documents, and Distributions for Text Analysis PyData DC 2018
  • 2. Dr. Rebecca Bilbro Head of Data Science, ICX Media Co-creator, Scikit-Yellowbrick Author, Applied Text Analysis with Python @rebeccabilbro
  • 4. The Machine Learning Problem: Given a set of n samples of data such that each sample is represented by more than a single number (e.g. multivariate data that has several attributes or features), create a model that is able to predict unknown properties of each sample.
  • 5. Spatial interpretation: Given data points in a bounded, high dimensional space, define regions of decisions for any point in that space.
  • 6. Instances are composed of features that make up our dimensions.
  • 7. Feature space is the n-dimensions where our variables live (not including target). Feature extraction is the art of creating a space with decision boundaries.
  • 8. Example Target Y ≡ Thickness of car tires after some testing period Variables X1 ≡ distance travelled in test X2 ≡ time duration of test X3 ≡ amount of chemical C in tires The feature space is R3 , or more accurately, the positive quadrant in R3 as all the X variables can only be positive quantities.
  • 9. Domain knowledge about tires might suggest that the speed the vehicle was moving at is important, hence we generate another variable, X4 (this is the feature extraction part): X4 = X1 / X2 ≡ the speed of the vehicle during testing. This extends our old feature space into a new one, the positive part of R4 . A mapping is a function, ϕ, from R3 to R4 : ϕ(x1 ,x2 ,x3 ) = (x1 ,x2 ,x3 ,x1 x2 )
  • 11. Real-world data is often not represented numerically out of the box (e.g. text, images), therefore some transformation must be applied in order to do machine learning.
  • 12. Tricky Part Machine learning relies on our ability to imagine data as points in space, where the relative closeness of any two is a measure of their similarity. So...when we transform those non-numeric features into numeric ones, how should we quantify the distance between instances?
  • 13. Many ways of quantifying “distance” (or similarity) often the default for numeric data common rule of thumb for text data
  • 14. With text, our choice of distance metric is very important! Why?
  • 15. Challenges of Modeling Text Data ● Very high dimensional ○ One dimension for every word (token) in the corpus! ● Sparsely distributed ○ Documents vary in length! ○ Most instances (documents) may be mostly zeros! ● Has some features that are more important than others ○ E.g. the “of” dimension vs. the “basketball” dimension when clustering sports articles. ● Has some feature variations that matter more than others ○ E.g. freq(tree) vs. freq(horticulture) in classifying gardening books.
  • 16. Help!
  • 17. scikit-learn from sklearn.metrics import pairwise_distances(X, Y=None, metric=’euclidean’, n_jobs=None, **kwds) Compute the distance matrix from a vector array X and optional Y. Valid values for metric are: ● From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]. ● From scipy.spatial.distance...
  • 18. SciPy! Distance functions between two numeric vectors u and v: ● braycurtis(u, v[, w]) ● canberra(u, v[, w]) ● chebyshev(u, v[, w]) ● cityblock(u, v[, w]) ● correlation(u, v[, w, centered]) ● cosine(u, v[, w]) ● euclidean(u, v[, w]) ● mahalanobis(u, v, VI) ● minkowski(u, v[, p, w]) ● seuclidean(u, v, V) ● sqeuclidean(u, v[, w]) ● wminkowski(u, v, p, w) Distance functions between two boolean vectors (sets) u and v: ● dice(u, v[, w]) ● hamming(u, v[, w]) ● jaccard(u, v[, w]) ● kulsinski(u, v[, w]) ● rogerstanimoto(u, v[, w]) ● russellrao(u, v[, w]) ● sokalmichener(u, v[, w]) ● sokalsneath(u, v[, w]) ● yule(u, v[, w]) Note: most don’t support sparse matrix inputs
  • 19. ● Extends the Scikit-Learn API. ● Enhances the model selection process. ● Tools for feature visualization, visual diagnostics, and visual steering. ● Not a replacement for other visualization libraries. Yellowbrick Feature Analysis Algorithm Selection Hyperparameter Tuning model selection isiterative, but can besteered!
  • 20. TSNE (t-distributed Stochastic Neighbor Embedding) 1. Apply SVD (or PCA) to reduce dimensionality (for efficiency). 2. Embed vectors using probability distributions from both the original dimensionality and the decomposed dimensionality. 3. Cluster and visualize similar documents in a scatterplot.
  • 21. Three Example Datasets Hobbies corpus ● From the Baleen project ● 448 newspaper/blog articles ● 5 classes: gaming, cooking, cinema, books, sports ● Doc length (in words): 532 avg, 14564 max, 1 min Farm Ads corpus ● From the UCI Repository ● 4144 ads represented as a list of metadata tags ● 2 classes: accepted, not accepted ● Doc length (in words): 270 avg, 5316 max, 1 min Dresses Attributes Sales corpus ● From the UCI Repository ● 500 dresses represented as features: neckline, waistline, fabric, size, season ● Doc length (in words): 11 avg, 11 max, 11 min
  • 22. Euclidean Distance Euclidean distance is the straight-line distance between 2 points in Euclidean (metric) space. tsne = TSNEVisualizer(metric="euclidean") tsne.fit(docs, labels) tsne.poof() 5 10 15 20 25 252015105 Doc 2 (20, 19) Doc 1 (7, 14)
  • 23. Euclidean Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 24. Cityblock (Manhattan) Distance Manhattan distance between two points is computed as the sum of the absolute differences of their Cartesian coordinates. tsne = TSNEVisualizer(metric="cityblock") tsne.fit(docs, labels) tsne.poof()
  • 25. Cityblock (Manhattan) Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 26. Chebyshev Distance Chebyshev distance is the L∞-norm of the difference between two points (a special case of the Minkowski distance where p goes to infinity). It is also known as chessboard distance. tsne = TSNEVisualizer(metric="chebyshev") tsne.fit(docs, labels) tsne.poof()
  • 27. Chebyshev Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 28. Minkowski Distance Minkowski distance is a generalization of Euclidean, Manhattan, and Chebyshev distance, and defines distance between points in a normalized vector space as the generalized Lp-norm of their difference. tsne = TSNEVisualizer(metric="minkowski") tsne.fit(docs, labels) tsne.poof()
  • 29. Minkowski Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 30. Mahalanobis Distance A multidimensional generalization of the distance between a point and a distribution of points. tsne = TSNEVisualizer(metric="mahalanobis", method='exact') tsne.fit(docs, labels) tsne.poof() Think: shifting and rescaling coordinates with respect to distribution. Can help find similarities between different-length docs.
  • 31. Mahalanobis Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 32. Cosine “Distance” Cosine “distance” is the cosine of the angle between two doc vectors. The more parallel, the more similar. Corrects for length variations (angles rather than magnitudes). Considers only non-zero elements (efficient for sparse vectors!). Note: Cosine distance is not technically a distance measure because it doesn’t satisfy the triangle inequality. tsne = TSNEVisualizer(metric="cosine") tsne.fit(docs, labels) tsne.poof()
  • 33. Cosine “Distance” Hobbies Corpus Ads Corpus Dresses Corpus
  • 34. Canberra Distance Canberra distance is a weighted version of Manhattan distance. It is often used for data scattered around an origin, as it is biased for measures around the origin and very sensitive for values close to zero. tsne = TSNEVisualizer(metric="canberra") tsne.fit(docs, labels) tsne.poof()
  • 35. Canberra Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 36. Jaccard Distance Jaccard distance defines similarity between finite sets as the quotient of their intersection and their union. More effective for detecting things like document duplication. tsne = TSNEVisualizer(metric="jaccard") tsne.fit(docs, labels) tsne.poof()
  • 37. Jaccard Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 38. Hamming Distance Hamming distance between two strings is the number of positions at which the corresponding symbols are different. Measures minimum substitutions required to change one string into the other. tsne = TSNEVisualizer(metric="hamming") tsne.fit(docs, labels) tsne.poof()
  • 39. Hamming Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 40. Other Yellowbrick Text Visualizers