SlideShare a Scribd company logo
Text Similarity Measures
Text Similarity Measures
• What are Text Similarity Measures?
▪ Text Similarity Measures are metrics that measure the similarity or
distance between two text strings.
▪ They can be done on surface closeness (lexical similarity) of the text
strings or meaning closeness (semantic similarity)
• In this class, we will be discussing lexical word similarities and lexical
documents similarities.
• Measuring similarity between documents is fundamental to most forms of
document analysis. Some of the applications that use document similarity
measures include; information retrieval, text classification, document
clustering, topic modeling, topic tracking, matrix decomposition
Text Similarity Measures
• Word Similarity
▪ Levenshtein distance
• Document Similarity
▪ Count vectorizer and the document-term matrix
▪ Bag of words
▪ Cosine similarity
▪ Term frequency-inverse document frequency (TF-IDF)
Text Similarity Measures
• Word Similarity
▪ Levenshtein distance
• Document Similarity
▪ Count vectorizer and the document-term matrix
▪ Bag of words
▪ Cosine similarity
▪ Term frequency-inverse document frequency (TF-IDF)
Word Similarity
Why is word similarity important? It can be used for the following:
▪ Spell check
▪ Speech recognition
▪ Plagiarism detection
What is a common way of quantifying word similarity?
▪ Levenshtein distance
▪ Also known as edit distance in computer science
Word Similarity
How similar are the following pairs of words?
MATH MATH
MATH BATH
MATH BAT
MATH SMASH
Word Similarity
Levenshtein distance: Minimum number of operations to get from one
word to another. Levenshtein operations are:
▪ Deletions: Delete a character
▪ Insertions: Insert a character
▪ Mutations: Change a character
Example: kitten —> sitting
▪ kitten —> sitten (1 letter change)
▪ sitten —> sittin (1 letter change)
▪ sittin —> sitting (1 letter insertion)
Levenshtein distance = 3
Word Similarity
How similar are the following pairs of words?
MATH MATH
MATH BATH
MATH BAT
MATH SMASH
Levenshtein distance = 0
Levenshtein distance = 1
Levenshtein distance = 2
Levenshtein distance = 2
TextBlob
Another toolkit other than NLTK
▪ Wraps around NLTK and makes it easier to use
TextBlob capabilities
▪ Tokenization
▪ Parts of speech tagging
▪ Sentiment analysis
▪ Spell check
▪ … and more
TextBlob Demo: Tokenization
# Command line: pip install textblob
from textblob import TextBlob
my_text = TextBlob("We're moving from NLTK to TextBlob. How fun!")
my_text.words
Input:
Output:
WordList(['We', "'re", 'moving', 'from', 'NLTK', 'to', 'TextBlob', 'How',
'fun'])
TextBlob Demo: Spell Check
blob = TextBlob("I'm graat at speling.")
print(blob.correct()) # print function requires Python 3
Input:
Output:
I'm great at spelling.
How does the correct function work?
▪ Calculates the Levenshtein distance between the word ‘graat’ and all words in its word list
▪ Of the words with the smallest Levenshtein distance, it outputs the most popular word
Text Similarity Measures Checkpoint
• Word Similarity
▪ Levenshtein distance
• Document Similarity
▪ Count vectorizer and the document-term matrix
▪ Bag of words
▪ Cosine similarity
▪ Term frequency-inverse document frequency (TF-IDF)
Document Similarity
When is document similarity used?
▪ When sifting through a large number of documents and trying to find similar ones
▪ When trying to group, or cluster, together similar documents
To compare documents, the first step is to put them in a similar format so they
can be compared
▪ Tokenization
▪ Count vectorizer and the document-term matrix
Text Format for Analysis
There are a few ways that text data can be put into a standard format for analysis
“This is an example”
Split Text Into Words
[‘This’,’is’,’an’,’example’]
One-Hot EncodingTokenization
Numerically Encode Words
This [1,0,0,0]
is [0,1,0,0]
an [0,0,1,0]
example [0,0,0,1]
This slide could make more sense
Text Format for Analysis: Count Vectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.',
'This is the second document.',
'And the third one. One is fun.’]
cv = CountVectorizer()
X = cv.fit_transform(corpus)
pd.DataFrame(X.toarray(),columns=cv.get_feature_names())
Input:
Output:
A Corpus is a collection of texts
This is called a
Document-Term Matrix
Text Format for Analysis: Key Concepts
The Count Vectorizer helps us create a Document-Term Matrix
• Rows = documents
• Columns = terms
corpus = ['This is the first document.',
'This is the second document.',
'And the third one. One is fun.’]
Text Format for Analysis: Key Concepts
Bag of Words Model
• Simplified representation of text,
where each document is recognized
as a bag of its words
• Grammar and word order are
disregarded, but multiplicity is kept
Document Similarity Checkpoint
What was our original goal? Finding similar documents.
To compare documents, the first step is to put them in a similar format so they
can be compared
▪ Tokenization
▪ Count vectorizer and the document-term matrix
The big assumption that we’re making here is that each document is just a
Bag of Words
Document Similarity: Cosine Similarity
Cosine Similarity is a way to quantify the similarity between documents
• Step 1: Put each document in vector format
• Step 2: Find the cosine of the angle between the documents
“I love you”
“I love NLP”
i love you nlp
Doc 1 1 1 1 0
Doc 2 1 1 0 1
a = [1, 1, 1, 0]
b = [1, 1, 0, 1]
= 0.667
Cosine similarity measures the similarity between two non-zero vectors with the cosine of the angle between them.
from numpy import dot
from numpy.linalg import norm
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
cosine([1, 1, 1, 0], [1, 1, 0, 1])
Document Similarity: Cosine Similarity
0.667
Document Similarity: Example
Here are five documents. Which ones seem most similar to you?
“The weather is hot under the sun”
“I make my hot chocolate with milk”
“One hot encoding”
“I will have a chai latte with milk”
“There is a hot sale today”
Let’s see which ones are most similar from a mathematical approach.
Document Similarity: Example
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']
# create the document-term matrix with count vectorizer
cv = CountVectorizer(stop_words="english")
X = cv.fit_transform(corpus).toarray()
dt = pd.DataFrame(X, columns=cv.get_feature_names())
dt
Input:
Document Similarity: Example
Output:
Document Similarity: Example
# calculate the cosine similarity between all combinations of documents
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity
# list all of the combinations of 5 take 2 as well as the pairs of phrases
pairs = list(combinations(range(len(corpus)),2))
combos = [(corpus[a_index], corpus[b_index]) for (a_index, b_index) in pairs]
# calculate the cosine similarity for all pairs of phrases and sort by most similar
results = [cosine_similarity([X[a_index]], [X[b_index]]) for (a_index, b_index) in
pairs]
sorted(zip(results, combos), reverse=True)
Input:
Document Similarity: Example
[(0.40824829, ('The weather is hot under the sun', 'One hot encoding')),
(0.40824829, ('One hot encoding', 'There is a hot sale today')),
(0.35355339, ('I make my hot chocolate with milk', 'One hot encoding')),
(0.33333333, ('The weather is hot under the sun', 'There is a hot sale today')),
(0.28867513, ('The weather is hot under the sun', 'I make my hot chocolate with milk')),
(0.28867513, ('I make my hot chocolate with milk', 'There is a hot sale today')),
(0.28867513, ('I make my hot chocolate with milk', 'I will have a chai latte with milk')),
(0.0, ('The weather is hot under the sun', 'I will have a chai latte with milk')),
(0.0, ('One hot encoding', 'I will have a chai latte with milk')),
(0.0, ('I will have a chai latte with milk', 'There is a hot sale today'))]
Output:
▪ These two documents are most similar, but it’s
just because the term “hot” is a popular word
▪ “Milk” seems to be a better differentiator, so how
we can mathematically highlight that?
Document Similarity: Beyond Count Vectorizer
Downsides of Count Vectorizer
• Counts can be too simplistic
• High counts can dominate, especially for high frequency words or long
documents
• Each word is treated equally, when some terms might be more important than
others
We want a metric that accounts for these issues
• Introducing Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency-Inverse Document Frequency
TF-IDF = (Term Frequency) * (Inverse Document Frequency)
log( )
+1
+1
Different value
for every
document / term
combination
Term Count in
Document
Total Terms in
Document
Total Documents
Documents Containing
the Term
Term Frequency-Inverse Document Frequency
Term Frequency
• So far, we’ve been recording the term (word) count
“This is an example”
• However, if there were two documents, one very long and one very short, it
wouldn’t be fair to compare them by word count alone
• A better way to compare them is by a normalized term frequency, which is
(term count) / (total terms).
• There are many ways to do this. Another example is log(count +1)
This is an example
1 1 1 1
This is an example
0.25 0.25 0.25 0.25
Term Frequency-Inverse Document Frequency
Inverse Document Frequency
• Besides term frequency, another thing to consider is how common a word is
among all the documents
• Rare words should get additional weight
Total Documents
Documents Containing
the Term +1
+1 Want to make
sure that the
denominator is
never 0
log( )
The log
dampens the
effect of IDF
Term Frequency-Inverse Document Frequency
TF-IDF = (Term Frequency) * (Inverse Document Frequency)
log( )
+1
+1
Different value
for every
document / term
combination
Term Count in
Document
Total Terms in
Document
Total Documents
Documents Containing
the Term
Term Frequency-Inverse Document Frequency
TF-IDF Intuition:
• TF-IDF assigns more weight to rare words and less weight to commonly
occurring words.
• Tells us how frequent a word is in a document relative to its frequency in
the entire corpus.
• Tells us that two documents are similar when they have more rare words
in common.
Count Vectorizer vs TF-IDF Vectorizer
import pandas as pd
corpus = ['This is the first document.',
'This is the second document.',
'And the third one. One is fun.’]
# original Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=cv.get_feature_names())
# new TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
cv_tfidf = TfidfVectorizer()
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names())
Count Vectorizer vs TF-IDF Vectorizer
Count Vectorizer Output:
TF-IDF Vectorizer Output:
Document Similarity: Example
Let’s go back to the problem we were originally trying to solve.
Here are five documents. Which ones seem most similar to you?
“The weather is hot under the sun”
“I make my hot chocolate with milk”
“One hot encoding”
“I will have a chai latte with milk”
“There is a hot sale today”
With Count Vectorizer,
these two documents
were the most similar
Document Similarity: Example with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
# create the document-term matrix with TF-IDF vectorizer
cv_tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
dt_tfidf = pd.DataFrame(X_tfidf,columns=cv_tfidf.get_feature_names())
dt_tfidf
Input:
Output:
Document Similarity: Example with TF-IDF
# calculate the cosine similarity for all pairs of phrases and sort by most similar
results_tfidf = [cosine_similarity(X_tfidf[a_index], X_tfidf[b_index])
for (a_index, b_index) in pairs]
sorted(zip(results_tfidf, combos), reverse=True)
[(0.23204485, (‘I make my hot chocolate with milk', 'I will have a chai latte with
milk')),
(0.18165505, ('The weather is hot under the sun', 'One hot encoding')),
(0.18165505, ('One hot encoding', 'There is a hot sale today')),
(0.16050660, ('I make my hot chocolate with milk', 'One hot encoding')),
(0.13696380, ('The weather is hot under the sun', 'There is a hot sale today')),
(0.12101835, ('The weather is hot under the sun', 'I make my hot chocolate with milk')),
(0.12101835, ('I make my hot chocolate with milk', 'There is a hot sale today')),
(0.0, (‘The weather is hot under the sun', 'I will have a chai latte with milk')),
(0.0, ('One hot encoding', 'I will have a chai latte with milk')),
(0.0, ('I will have a chai latte with milk', 'There is a hot sale today'))]
By weighting “milk” (rare) > “hot” (popular), we get a smarter similarity score
Text Similarity Measures Summary
• Word Similarity
▪ Levenshtein distance is a popular way to calculate word similarity
▪ TextBlob, another NLP library, uses this concept for its spell check function
• Document Similarity
▪ Cosine similarity is a popular way to calculate document similarity
▪ To compare documents, they need to be put in document-term matrix form
▪ The document-term matrix can be made using Count Vectorizer or TF-IDF Vectorizer
Text similarity measures

More Related Content

PPTX
Tutorial on word2vec
PPTX
Nlp toolkits and_preprocessing_techniques
PPTX
What is word2vec?
PDF
Representation Learning of Text for NLP
PDF
Word2Vec: Vector presentation of words - Mohammad Mahdavi
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PPTX
A Simple Introduction to Word Embeddings
PDF
Natural Language Processing
Tutorial on word2vec
Nlp toolkits and_preprocessing_techniques
What is word2vec?
Representation Learning of Text for NLP
Word2Vec: Vector presentation of words - Mohammad Mahdavi
NLP Bootcamp 2018 : Representation Learning of text for NLP
A Simple Introduction to Word Embeddings
Natural Language Processing

What's hot (20)

PPTX
Text Classification
PPTX
Word embedding
PPTX
PDF
L2. Evaluating Machine Learning Algorithms I
PPTX
Language Model (N-Gram).pptx
PPTX
NLP_KASHK:N-Grams
PPTX
NLP_KASHK:Minimum Edit Distance
PDF
Performance Metrics for Machine Learning Algorithms
PPTX
NLP_KASHK:Evaluating Language Model
PDF
Word2Vec
PPTX
Presentation on Text Classification
PDF
Natural language processing (NLP) introduction
PPTX
Recurrent Neural Network
PPTX
Language models
PPTX
PPTX
Natural Language Processing in AI
PDF
Natural language processing
PDF
Text summarization
PPTX
The vector space model
PPTX
Text summarization
Text Classification
Word embedding
L2. Evaluating Machine Learning Algorithms I
Language Model (N-Gram).pptx
NLP_KASHK:N-Grams
NLP_KASHK:Minimum Edit Distance
Performance Metrics for Machine Learning Algorithms
NLP_KASHK:Evaluating Language Model
Word2Vec
Presentation on Text Classification
Natural language processing (NLP) introduction
Recurrent Neural Network
Language models
Natural Language Processing in AI
Natural language processing
Text summarization
The vector space model
Text summarization
Ad

Similar to Text similarity measures (20)

PPTX
IR.pptx
PPTX
Document similarity
PDF
A Survey of Text Mining
PDF
International Journal of Engineering Research and Development (IJERD)
PPT
text
PDF
Big Data Palooza Talk: Aspects of Semantic Processing
PDF
CS571: Vector Space Models
PDF
Vector Space Models
PDF
IRJET- Python Based Machine Learning for Profile Matching
PPTX
Dialog system understanding
PPTX
Similarity Metrics for Textual Data.pptx
PDF
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
PPTX
Chat bot using text similarity approach
PDF
F017243241
PDF
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
PDF
Text similarity and the vector space model
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PPT
Text Representation methods in Natural language processing
IR.pptx
Document similarity
A Survey of Text Mining
International Journal of Engineering Research and Development (IJERD)
text
Big Data Palooza Talk: Aspects of Semantic Processing
CS571: Vector Space Models
Vector Space Models
IRJET- Python Based Machine Learning for Profile Matching
Dialog system understanding
Similarity Metrics for Textual Data.pptx
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
Chat bot using text similarity approach
F017243241
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
Frontiers of Computational Journalism week 2 - Text Analysis
Text similarity and the vector space model
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
Text Representation methods in Natural language processing
Ad

More from ankit_ppt (20)

PPTX
Deep learning summary
PPTX
08 neural networks
PPTX
07 learning
PPTX
06 image features
PPTX
05 contours seg_matching
PPTX
04 image transformations_ii
PPTX
03 image transformations_i
PPTX
02 image processing
PPTX
01 foundations
PPTX
Word2 vec
PPTX
Text generation and_advanced_topics
PPTX
Matrix decomposition and_applications_to_nlp
PPTX
Machine learning and_nlp
PPTX
Latent dirichlet allocation_and_topic_modeling
PPTX
Intro to nlp
PPTX
Ot regularization and_gradient_descent
PPTX
Ml10 dimensionality reduction-and_advanced_topics
PPTX
Ml9 introduction to-unsupervised_learning_and_clustering_methods
PPTX
Ml8 boosting and-stacking
PPTX
Ml7 bagging
Deep learning summary
08 neural networks
07 learning
06 image features
05 contours seg_matching
04 image transformations_ii
03 image transformations_i
02 image processing
01 foundations
Word2 vec
Text generation and_advanced_topics
Matrix decomposition and_applications_to_nlp
Machine learning and_nlp
Latent dirichlet allocation_and_topic_modeling
Intro to nlp
Ot regularization and_gradient_descent
Ml10 dimensionality reduction-and_advanced_topics
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml8 boosting and-stacking
Ml7 bagging

Recently uploaded (20)

PPTX
web development for engineering and engineering
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Well-logging-methods_new................
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Project quality management in manufacturing
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Construction Project Organization Group 2.pptx
web development for engineering and engineering
Arduino robotics embedded978-1-4302-3184-4.pdf
OOP with Java - Java Introduction (Basics)
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
Structs to JSON How Go Powers REST APIs.pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Well-logging-methods_new................
CYBER-CRIMES AND SECURITY A guide to understanding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Lesson 3_Tessellation.pptx finite Mathematics
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Project quality management in manufacturing
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Construction Project Organization Group 2.pptx

Text similarity measures

  • 2. Text Similarity Measures • What are Text Similarity Measures? ▪ Text Similarity Measures are metrics that measure the similarity or distance between two text strings. ▪ They can be done on surface closeness (lexical similarity) of the text strings or meaning closeness (semantic similarity) • In this class, we will be discussing lexical word similarities and lexical documents similarities. • Measuring similarity between documents is fundamental to most forms of document analysis. Some of the applications that use document similarity measures include; information retrieval, text classification, document clustering, topic modeling, topic tracking, matrix decomposition
  • 3. Text Similarity Measures • Word Similarity ▪ Levenshtein distance • Document Similarity ▪ Count vectorizer and the document-term matrix ▪ Bag of words ▪ Cosine similarity ▪ Term frequency-inverse document frequency (TF-IDF)
  • 4. Text Similarity Measures • Word Similarity ▪ Levenshtein distance • Document Similarity ▪ Count vectorizer and the document-term matrix ▪ Bag of words ▪ Cosine similarity ▪ Term frequency-inverse document frequency (TF-IDF)
  • 5. Word Similarity Why is word similarity important? It can be used for the following: ▪ Spell check ▪ Speech recognition ▪ Plagiarism detection What is a common way of quantifying word similarity? ▪ Levenshtein distance ▪ Also known as edit distance in computer science
  • 6. Word Similarity How similar are the following pairs of words? MATH MATH MATH BATH MATH BAT MATH SMASH
  • 7. Word Similarity Levenshtein distance: Minimum number of operations to get from one word to another. Levenshtein operations are: ▪ Deletions: Delete a character ▪ Insertions: Insert a character ▪ Mutations: Change a character Example: kitten —> sitting ▪ kitten —> sitten (1 letter change) ▪ sitten —> sittin (1 letter change) ▪ sittin —> sitting (1 letter insertion) Levenshtein distance = 3
  • 8. Word Similarity How similar are the following pairs of words? MATH MATH MATH BATH MATH BAT MATH SMASH Levenshtein distance = 0 Levenshtein distance = 1 Levenshtein distance = 2 Levenshtein distance = 2
  • 9. TextBlob Another toolkit other than NLTK ▪ Wraps around NLTK and makes it easier to use TextBlob capabilities ▪ Tokenization ▪ Parts of speech tagging ▪ Sentiment analysis ▪ Spell check ▪ … and more
  • 10. TextBlob Demo: Tokenization # Command line: pip install textblob from textblob import TextBlob my_text = TextBlob("We're moving from NLTK to TextBlob. How fun!") my_text.words Input: Output: WordList(['We', "'re", 'moving', 'from', 'NLTK', 'to', 'TextBlob', 'How', 'fun'])
  • 11. TextBlob Demo: Spell Check blob = TextBlob("I'm graat at speling.") print(blob.correct()) # print function requires Python 3 Input: Output: I'm great at spelling. How does the correct function work? ▪ Calculates the Levenshtein distance between the word ‘graat’ and all words in its word list ▪ Of the words with the smallest Levenshtein distance, it outputs the most popular word
  • 12. Text Similarity Measures Checkpoint • Word Similarity ▪ Levenshtein distance • Document Similarity ▪ Count vectorizer and the document-term matrix ▪ Bag of words ▪ Cosine similarity ▪ Term frequency-inverse document frequency (TF-IDF)
  • 13. Document Similarity When is document similarity used? ▪ When sifting through a large number of documents and trying to find similar ones ▪ When trying to group, or cluster, together similar documents To compare documents, the first step is to put them in a similar format so they can be compared ▪ Tokenization ▪ Count vectorizer and the document-term matrix
  • 14. Text Format for Analysis There are a few ways that text data can be put into a standard format for analysis “This is an example” Split Text Into Words [‘This’,’is’,’an’,’example’] One-Hot EncodingTokenization Numerically Encode Words This [1,0,0,0] is [0,1,0,0] an [0,0,1,0] example [0,0,0,1] This slide could make more sense
  • 15. Text Format for Analysis: Count Vectorizer import pandas as pd from sklearn.feature_extraction.text import CountVectorizer corpus = ['This is the first document.', 'This is the second document.', 'And the third one. One is fun.’] cv = CountVectorizer() X = cv.fit_transform(corpus) pd.DataFrame(X.toarray(),columns=cv.get_feature_names()) Input: Output: A Corpus is a collection of texts This is called a Document-Term Matrix
  • 16. Text Format for Analysis: Key Concepts The Count Vectorizer helps us create a Document-Term Matrix • Rows = documents • Columns = terms corpus = ['This is the first document.', 'This is the second document.', 'And the third one. One is fun.’]
  • 17. Text Format for Analysis: Key Concepts Bag of Words Model • Simplified representation of text, where each document is recognized as a bag of its words • Grammar and word order are disregarded, but multiplicity is kept
  • 18. Document Similarity Checkpoint What was our original goal? Finding similar documents. To compare documents, the first step is to put them in a similar format so they can be compared ▪ Tokenization ▪ Count vectorizer and the document-term matrix The big assumption that we’re making here is that each document is just a Bag of Words
  • 19. Document Similarity: Cosine Similarity Cosine Similarity is a way to quantify the similarity between documents • Step 1: Put each document in vector format • Step 2: Find the cosine of the angle between the documents “I love you” “I love NLP” i love you nlp Doc 1 1 1 1 0 Doc 2 1 1 0 1 a = [1, 1, 1, 0] b = [1, 1, 0, 1] = 0.667 Cosine similarity measures the similarity between two non-zero vectors with the cosine of the angle between them.
  • 20. from numpy import dot from numpy.linalg import norm cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2)) cosine([1, 1, 1, 0], [1, 1, 0, 1]) Document Similarity: Cosine Similarity 0.667
  • 21. Document Similarity: Example Here are five documents. Which ones seem most similar to you? “The weather is hot under the sun” “I make my hot chocolate with milk” “One hot encoding” “I will have a chai latte with milk” “There is a hot sale today” Let’s see which ones are most similar from a mathematical approach.
  • 22. Document Similarity: Example import pandas as pd from sklearn.feature_extraction.text import CountVectorizer corpus = ['The weather is hot under the sun', 'I make my hot chocolate with milk', 'One hot encoding', 'I will have a chai latte with milk', 'There is a hot sale today'] # create the document-term matrix with count vectorizer cv = CountVectorizer(stop_words="english") X = cv.fit_transform(corpus).toarray() dt = pd.DataFrame(X, columns=cv.get_feature_names()) dt Input:
  • 24. Document Similarity: Example # calculate the cosine similarity between all combinations of documents from itertools import combinations from sklearn.metrics.pairwise import cosine_similarity # list all of the combinations of 5 take 2 as well as the pairs of phrases pairs = list(combinations(range(len(corpus)),2)) combos = [(corpus[a_index], corpus[b_index]) for (a_index, b_index) in pairs] # calculate the cosine similarity for all pairs of phrases and sort by most similar results = [cosine_similarity([X[a_index]], [X[b_index]]) for (a_index, b_index) in pairs] sorted(zip(results, combos), reverse=True) Input:
  • 25. Document Similarity: Example [(0.40824829, ('The weather is hot under the sun', 'One hot encoding')), (0.40824829, ('One hot encoding', 'There is a hot sale today')), (0.35355339, ('I make my hot chocolate with milk', 'One hot encoding')), (0.33333333, ('The weather is hot under the sun', 'There is a hot sale today')), (0.28867513, ('The weather is hot under the sun', 'I make my hot chocolate with milk')), (0.28867513, ('I make my hot chocolate with milk', 'There is a hot sale today')), (0.28867513, ('I make my hot chocolate with milk', 'I will have a chai latte with milk')), (0.0, ('The weather is hot under the sun', 'I will have a chai latte with milk')), (0.0, ('One hot encoding', 'I will have a chai latte with milk')), (0.0, ('I will have a chai latte with milk', 'There is a hot sale today'))] Output: ▪ These two documents are most similar, but it’s just because the term “hot” is a popular word ▪ “Milk” seems to be a better differentiator, so how we can mathematically highlight that?
  • 26. Document Similarity: Beyond Count Vectorizer Downsides of Count Vectorizer • Counts can be too simplistic • High counts can dominate, especially for high frequency words or long documents • Each word is treated equally, when some terms might be more important than others We want a metric that accounts for these issues • Introducing Term Frequency-Inverse Document Frequency (TF-IDF)
  • 27. Term Frequency-Inverse Document Frequency TF-IDF = (Term Frequency) * (Inverse Document Frequency) log( ) +1 +1 Different value for every document / term combination Term Count in Document Total Terms in Document Total Documents Documents Containing the Term
  • 28. Term Frequency-Inverse Document Frequency Term Frequency • So far, we’ve been recording the term (word) count “This is an example” • However, if there were two documents, one very long and one very short, it wouldn’t be fair to compare them by word count alone • A better way to compare them is by a normalized term frequency, which is (term count) / (total terms). • There are many ways to do this. Another example is log(count +1) This is an example 1 1 1 1 This is an example 0.25 0.25 0.25 0.25
  • 29. Term Frequency-Inverse Document Frequency Inverse Document Frequency • Besides term frequency, another thing to consider is how common a word is among all the documents • Rare words should get additional weight Total Documents Documents Containing the Term +1 +1 Want to make sure that the denominator is never 0 log( ) The log dampens the effect of IDF
  • 30. Term Frequency-Inverse Document Frequency TF-IDF = (Term Frequency) * (Inverse Document Frequency) log( ) +1 +1 Different value for every document / term combination Term Count in Document Total Terms in Document Total Documents Documents Containing the Term
  • 31. Term Frequency-Inverse Document Frequency TF-IDF Intuition: • TF-IDF assigns more weight to rare words and less weight to commonly occurring words. • Tells us how frequent a word is in a document relative to its frequency in the entire corpus. • Tells us that two documents are similar when they have more rare words in common.
  • 32. Count Vectorizer vs TF-IDF Vectorizer import pandas as pd corpus = ['This is the first document.', 'This is the second document.', 'And the third one. One is fun.’] # original Count Vectorizer from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() X = cv.fit_transform(corpus).toarray() pd.DataFrame(X, columns=cv.get_feature_names()) # new TF-IDF Vectorizer from sklearn.feature_extraction.text import TfidfVectorizer cv_tfidf = TfidfVectorizer() X_tfidf = cv_tfidf.fit_transform(corpus).toarray() pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names())
  • 33. Count Vectorizer vs TF-IDF Vectorizer Count Vectorizer Output: TF-IDF Vectorizer Output:
  • 34. Document Similarity: Example Let’s go back to the problem we were originally trying to solve. Here are five documents. Which ones seem most similar to you? “The weather is hot under the sun” “I make my hot chocolate with milk” “One hot encoding” “I will have a chai latte with milk” “There is a hot sale today” With Count Vectorizer, these two documents were the most similar
  • 35. Document Similarity: Example with TF-IDF from sklearn.feature_extraction.text import TfidfVectorizer # create the document-term matrix with TF-IDF vectorizer cv_tfidf = TfidfVectorizer(stop_words="english") X_tfidf = cv_tfidf.fit_transform(corpus).toarray() dt_tfidf = pd.DataFrame(X_tfidf,columns=cv_tfidf.get_feature_names()) dt_tfidf Input: Output:
  • 36. Document Similarity: Example with TF-IDF # calculate the cosine similarity for all pairs of phrases and sort by most similar results_tfidf = [cosine_similarity(X_tfidf[a_index], X_tfidf[b_index]) for (a_index, b_index) in pairs] sorted(zip(results_tfidf, combos), reverse=True) [(0.23204485, (‘I make my hot chocolate with milk', 'I will have a chai latte with milk')), (0.18165505, ('The weather is hot under the sun', 'One hot encoding')), (0.18165505, ('One hot encoding', 'There is a hot sale today')), (0.16050660, ('I make my hot chocolate with milk', 'One hot encoding')), (0.13696380, ('The weather is hot under the sun', 'There is a hot sale today')), (0.12101835, ('The weather is hot under the sun', 'I make my hot chocolate with milk')), (0.12101835, ('I make my hot chocolate with milk', 'There is a hot sale today')), (0.0, (‘The weather is hot under the sun', 'I will have a chai latte with milk')), (0.0, ('One hot encoding', 'I will have a chai latte with milk')), (0.0, ('I will have a chai latte with milk', 'There is a hot sale today'))] By weighting “milk” (rare) > “hot” (popular), we get a smarter similarity score
  • 37. Text Similarity Measures Summary • Word Similarity ▪ Levenshtein distance is a popular way to calculate word similarity ▪ TextBlob, another NLP library, uses this concept for its spell check function • Document Similarity ▪ Cosine similarity is a popular way to calculate document similarity ▪ To compare documents, they need to be put in document-term matrix form ▪ The document-term matrix can be made using Count Vectorizer or TF-IDF Vectorizer

Editor's Notes

  • #2: Welcome to Week 3! Today we’ll be reviewing some key terms in NLP. 13
  • #4: Today, we’ll be comparing how similar two pieces of text are. To do this, we’ll introduce the concepts of Levenshtein distance, document-term matrices and cosine similarity as ways of quantifying how ‘similar’ two pieces of text are. In addition, we’ll go over what the ‘Bag of Words’ assumption in NLP is and how Count Vectorizers and Term frequency-inverse document frequency help us create document-term matrices.
  • #5: Before we start comparing two entire pieces of text, let’s start small, how do we even tell how ‘similar’ two individual words are?
  • #6: Word similarity has a host of applications in natural language processing. <example?> Today, we’ll be covering the Levenshtein distance, the standard word similarity measure in practice.
  • #7: This seems pretty easy at first. ‘MATH’ and ‘MATH’ are the same word, they’re 100% similar! ‘MATH’ and ‘BATH’ are only one letter apart, so they must be pretty similar too. However, as you go further down the list, it’s a bit trickier to quantify. There is actually a standard way of measuring distance though called Levenshtein distance.
  • #8: As you can see, Levenshtein distance is pretty intuitive, there are only three operations and the computer finds the shortest way to edit one word into another. It’s similar to a childhood game - what’s the minimum number of steps you can take to change a word to another word?
  • #9: Now, let’s reapply this to the example we showed earlier. Can you find the operations used to edit one word into the other with ‘Levenshtein distance’ number of steps?
  • #10: Now, let’s go over TextBlob which is a toolkit which wraps around NLTK, making it easier to use. It has many of the same functions as NLTK, as long as a few new ones, which make it worth examining. You may also want to note that spaCy is an up and coming NLP library in Python It’s very powerful and like TextBlob, it’s easier to use than NLTK
  • #11: As you can see, TextBlob is able to tokenize our text pretty easily!
  • #12: As you can see, TextBlob is able to understand what we were trying to say! How does it do this? The Levenshtein distance we just introduced! Here are some more details on how TextBlob’s spell check works: http://norvig.com/spell-correct.html This is a very basic spell checker. It could definitely be smarter.
  • #13: Now we move into the meat of the similarity measures section. Often, you’ll want to be finding the similarity between documents instead of just words.
  • #14: Document similarity is used when you have a large amount of texts (a corpus) and you would like to perform some kind of analysis on them, without having to read them all in person. Document similarity measures allow you to encode the texts in a similar format and utilizing some cool maths, we can find a ‘distance’ between the texts to rate their similarities. To start though, we need to actually encode the documents. To do this, we’ll be leveraging the tokenization idea from last week and be introducing the idea of a ‘document-term’ matrix.
  • #15: First let’s start with encoding a simple sentence (a short text). Let’s say we wanted to split up the sentence above. If we first extracted the words in the sentence (via tokenization) and we had a dictionary of all possible words, we could encode these tokens as vectors. This is known as the ‘one-hot encoding’ because as you can see, there is a ‘1’ in the vector where the word is present and a ‘0’ in every other vector not present in that record/document.
  • #16: Now let’s move onto multiple texts! We’ll be using what’s known as a CountVectorizer to produce a document term matrix, which you can see at the bottom of the slide. Instantiation is easy as you can see, and its easy to turn the corpus that we made into a document-term matrix. Let’s analyze more closely. As you can see, our corpus has three texts, and our document-term matrix has three rows. In addition, our matrix has words along the top. Hopefully you can start to get an intuition for how our document-term matrix relates to our corpus from these insights.
  • #17: So if you guessed that each ‘row’ corresponded to a document and each column corresponded to the terms in all the documents, you’d be right! What do the numbers mean? Well if we look at the top left entry, there’s a 0. That means that there are 0 occurences of the term ‘and’ in document 0. Doing a quick visual check, yep that seem’s right! Now, let’s do another example, look at the bottom left entry, it has a 1, meaning that document 2 (or text #3 because the matrix rows are 0-indexed) has exactly one occurrence of the word ‘and’. That also seem’s right! Quickly check the other entries if you aren’t convinced!
  • #18: Now, this may seem like an overly simplistic encoding of a document to some of you. Why? Because all the information in our document seems to be able to be held in a single vector. The fact is that the vector is able to hold some of complexity of the text but it misses out on a lot. All that is encoded in a document term matrices is 1) which terms appear in a document and 2) how many times they appeared. If you had two documents [‘This is a text’, ‘Text this is a’], according to our encoding, they would be exactly the same! Basically, if you took the words in a text, put them in a bag and shook em around and looked at the output, to our encoding, it would be exactly the same. Thus, grammar and word order is ignored in the Bag of Words model. Although this seems extremely simplistic, we are actually able to perform some pretty powerful analysis using this assumption.
  • #19: Let’s review what we’ve covered. To start, we tokenize the text into words, then used the count vectorizer to encode the texts into a document-term matrix. This means that we are using a ‘Bag of Words’ assumption because our encoding only cares about which words appear and how many times they appear.
  • #20: How do we compare texts? We use cosine (yes the same formula from Linear Algebra). Why does that make sense? Remember, we encoded our text into vectors so to compare these vectors, we compute the cosine as a measure of how far our vectors are in our ‘concept space’. In our document-term matrix, the vector for document 1 would be the numbers in the row corresponding to document 1. This is true for every document, so we now have a vector for every document.
  • #21: To make a cosine function, we just reproduce the formula using numpy functions.
  • #22: Now, let’s try this on an example! Here are 5 texts that we want to compare. Let’s ask the question, which of these two texts seem the most similar? While many of the documents have the word “hot” in it, the two documents with “milk” actually seem most similar. Let’s see what our formula tells us.
  • #23: First, we encode our corpus into a document term matrix . The reason we use the parameter stop_words=“english” is that our CountVectorizer has an automatic feature to remove stop_words in certain languages. Less work for us! Then, we convert this document-term matrix into a pandas DataFrame! Pandas is the premier data science library in python and if you aren’t familiar with it all, here’s a quick introduction to it included in the pandas documentation: https://pandas.pydata.org/pandas-docs/stable/10min.html
  • #24: Here is our document-term matrix.
  • #25: We use itertools to find all list of all the numerical combinations of size 2 in the number range of the length of the corpus using the ‘combinations’ function Next, we use those numerical combinations as indices and create a list of all the actual pairs of sentences. Next, we perform cosine similarity on all these pairs and create a list using that Finally, we sort them by their cosine similarity (in reverse so that the largest cosine similarity is presented first).
  • #26: It seems that our cosine similarity chose these two documents as being the most similar, but looking at them, they don’t seem terribly similar. If we examine our document-term matrix by hand for these two documents, you can see that after removing stop words, these documents have very few differences. However, the word ‘milk’ seems to be a better differentiator of documents. I wonder if there’s a way to capture that?
  • #27: As you just saw, the Count Vectorizer alone may be too simplistic for some purposes because every word is treated equally. A high frequency word that is not good for differentiation may be what’s dominating our similarity measure. Thus, we introduce the Term Frequency-Inverse Document Frequency metric to combat this.
  • #28: Let’s break down this long and complicated name in the next two slides.
  • #29: Term Frequency: How many times a term appears in a document. However, we don’t want to simply deal with raw counts, we want to weight the counts given how many terms are in the document in total. ----------------------------------------------------------- Want a sublinear transformation of the word count ?? <- what does this mean.
  • #30: Remember, how we said we wanted a better differentiation metric? This is where the magic happens, we divide the total documents by the the number of documents containing the term we’re looking at. What happens as the number of documents containing the term grows towards the total number of documents? The rarer the word, the higher the IDF because it appears in less documents.
  • #31: What does a TF-IDF score tell us about a word? What does it tell us about a document? TF-IDF assigns more weight to rare words and less weight to commonly occurring words. IDF is often written as log((D+1)/(d+1)) where: D = total documents (a document could be a line, paragraph, page etc. that represents a record/row in the dataset) d = documents contain the term
  • #32: Notice that TF-IDF produces a score for each term/document combo. Thus, this is what we will use in place of the ‘count’ from before. So, our document vector will now be its TF-IDF score for every term. Notice that the score will still be 0 if the term doesn’t appear at all in the document. Our document similarity will now also be weighted towards rare words, and ‘similar’ documents will contain more ‘rare’ words in common.
  • #33: Thankfully, not much is different from a code standpoint except for the Vectorizer that you import!
  • #34: Our document-term matrix also looks very similar but instead of whole numbers as our counts, we now have TF-IDF scores which are floating point.
  • #35: Remember, these were the two documents chosen as most similar by the CountVectorizer.
  • #36: The code is same as before except for the TF-IDF vectorizer! Note that the values in the matrix are now TF-IDF values instead of counts
  • #37: Now, let’s find the most similar documents! The two documents containing the word milk were rated as most similar because the our TF-IDF encoding allowed us to weight the rarer words, so we got the metric we wanted!
  • #38: In summary, today we covered Levenshtein distance as a measure of word similarity and looked briefly at TextBlob. For document similarity we introduced the ideas of a document-term matrix, which can be produced either through a Count Vectorizer or TF-IDF Vectorizer, which weights rarer words. In addition, we looked at utilizing the cosine similarity between two document vectors to compare their similarity. Don’t forget that when we’re using a document-term matrix that we are under the ‘Bag of Words’ assumption. Our encoding does not care about word order or grammar!