SlideShare a Scribd company logo
TF-IDF
Natalie Parde
UIC CS 421
What other ways can
we build vector
representations for
words?
critique
c1 … critique … cn
w1 … … … … …
… … … … … …
critique ? ? ? ? ?
… … … … … …
wn … … … … …
Natalie Parde - UIC CS 421
One Approach: TF-IDF
• Term Frequency * Inverse Document Frequency
• Meaning of a word is defined by the counts of
words in the same document, as well as overall
• To do this, a co-occurrence matrix is needed
Natalie Parde - UIC CS 421
TF-IDF originated as a tool for
information retrieval.
• Rows: Words in a vocabulary
• Columns: Documents in a
selection
As You
Like It
Twelfth
Night
Julius
Caesar
Henry V
Natalie Parde - UIC CS 421
TF-IDF originated as a tool for
information retrieval.
• Rows: Words in a vocabulary
• Columns: Documents in a
selection
As You
Like It
Twelfth
Night
Julius
Caesar
Henry V
As You
Like It
Twelfth
Night
Julius
Caesar
Henry V
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
“wit” appears 3 times in Henry V
Natalie Parde - UIC CS 421
In a term-document matrix, rows can be
viewed as word vectors.
• Each dimension
corresponds to a
document
• Words with similar
vectors occur in similar
documents
As You
Like It
Twelfth
Night
Julius
Caesar
Henry
V
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
In a term-document matrix, rows can be
viewed as word vectors.
As You
Like It
Twelfth
Night
Julius
Caesar
Henry
V
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Julius Caesar
Henry
V
battle [7, 13]
good [62, 89]
fool [1, 4]
wit [2, 3]
Natalie Parde - UIC CS 421
Different
Types of
Context
• Documents aren’t the most common type of
context used to represent meaning in word
vectors
• More common: word context
• Referred to as a term-term matrix, word-word
matrix, or term-context matrix
• In a word-word matrix, the columns are also
labeled by words
• Thus, dimensionality is |V| x |V|
• Each cell records the number of times the
row (target) word and the column (context)
word co-occur in some context in a training
corpus
Natalie Parde - UIC CS 421
How can you decide
if two words occur
in the same
context?
• Common context windows:
• Entire document
• Cell value = # times the
words co-occur in the
same document
• Predetermined span
surrounding the target
• Cell value = # times the
words co-occur in this
span of words
Natalie Parde - UIC CS 421
Example Context Window (Size = 4)
• Take each occurrence of a word (e.g., strawberry)
• Count the context words in the four-word spans before and after it
to get a word-word co-occurrence matrix
is traditionally followed by cherry pie, a traditional dessert
often mixed, such as strawberry rhubarb pie. Apple pie
computer peripherals and personal digital assistants. These devices usually
a computer. This includes information available on the internet
Natalie Parde - UIC CS 421
Example Context
Window (Size = 4)
• A simplified subset of a word-
word co-occurrence matrix
could appear as follows, given a
sufficient corpus
aardvark … computer data result pie sugar …
cherry 0 … 2 8 9 442 25 …
strawberry 0 … 0 0 1 60 19 …
digital 0 … 1670 1683 85 5 4 …
information 0 … 3325 3982 378 5 13 …
is traditionally followed by cherry pie, a traditional dessert
often mixed, such as strawberry rhubarb pie. Apple pie
computer peripherals and personal digital assistants. These devices usually
a computer. This includes information available on the internet Vector for
“strawberry”
Natalie Parde - UIC CS 421
So far, our co-
occurrence
matrices have
contained raw
frequency
counts of
word co-
occurrences.
• However, this isn’t the best measure of
association between words
• Some words co-occur frequently with
many words, so won’t be very informative
• the, it, they
• We want to know about words that co-
occur frequently with one another, but
less frequently across all texts
Natalie Parde - UIC CS 421
This is
where TF-
IDF comes
in handy!
• Term Frequency: The frequency of the
word t in the document d
• 𝑡𝑓!,# = count(𝑡, 𝑑)
• Document Frequency: The number of
documents in which the word t occurs
• Different from collection frequency (the
number of times the word occurs in the
entire collection of documents)
Natalie Parde - UIC CS 421
Computing
TF-IDF
• Inverse Document Frequency: The inverse of
document frequency, where N is the total number of
documents in the collection
• 𝑖𝑑𝑓! =
"
#$!
• IDF is higher when the term occurs in fewer
documents
• What is a document?
• Individual instance in your corpus (e.g., book,
play, sentence, etc.)
• It is often useful to perform these computations in log
space
• TF: log%&(𝑡𝑓!,#+1)
• IDF: log%& 𝑖𝑑𝑓!
Natalie Parde - UIC CS 421
Computing
TF*IDF
• TF-IDF is then simply the combination of TF
and IDF
• 𝑡𝑓𝑖𝑑𝑓!,# = 𝑡𝑓!,#×𝑖𝑑𝑓!
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
• IDF(battle) = N/DF(battle) =
37/21 = 1.76
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
word df
battle 21
good 37
fool 36
wit 34
Document frequencies from
37-document corpus
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
• IDF(battle) = N/DF(battle) = 37/21
= 1.76
• TF-IDF(battle, d1) = 1 * 1.76 =
1.76
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
• IDF(battle) = N/DF(battle) = 37/21
= 1.76
• TF-IDF(battle, d1) = 1 * 1.76 =
1.76
• Alternately, TF-IDF(battle, d1) =
𝒍𝒐𝒈𝟏𝟎(𝟏 + 𝟏) ∗ 𝒍𝒐𝒈𝟏𝟎 𝟏. 𝟕𝟔 =
0.074
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
• IDF(battle) = N/DF(battle) = 37/21
= 1.76
• TF-IDF(battle, d1) = 1 * 1.76 =
1.76
• Alternately, TF-IDF(battle, d1) =
𝑙𝑜𝑔$%(1 + 1) ∗ 𝑙𝑜𝑔$% 1.76 = 0.074
d1 d2 d3 d4
battle 0.074 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
To convert our
entire word co-
occurrence matrix
to a TF-IDF
matrix, we need
to repeat this
calculation for
each element.
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Natalie Parde - UIC CS 421
How does the TF-IDF matrix compare
to the original term frequency matrix?
d1 d2 d3 d4
battle
1 0 7 13
good
114 80 62 89
fool
36 58 1 4
wit
20 15 2 3
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Natalie Parde - UIC CS 421
How does the TF-IDF matrix compare
to the original term frequency matrix?
d1 d2 d3 d4
battle
1 0 7 13
good
114 80 62 89
fool
36 58 1 4
wit
20 15 2 3
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Occurs in every document …not important in the overall scheme of things!
Natalie Parde - UIC CS 421
How does the TF-IDF matrix compare
to the original term frequency matrix?
d1 d2 d3 d4
battle
1 0 7 13
good
114 80 62 89
fool
36 58 1 4
wit
20 15 2 3
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Increases the importance of rarer words like “battle”
Natalie Parde - UIC CS 421
Note that the TF-IDF
model produces a
sparse vector.
• Sparse: Many (usually
most) cells have values
of 0
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Natalie Parde - UIC CS 421
d1 d2 d3 d4 d5 d6 d7
battle 0.1 0.0 0.0 0.0 0.2 0.0 0.3
good 0.0 0.0 0.0 0.0 0.0 0.0 0.0
fool 0.0 0.0 0.0 0.0 0.0 0.0 0.0
wit 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Natalie Parde - UIC CS 421
Note that the TF-IDF
model produces a
sparse vector.
• Sparse: Many (usually
most) cells have values
of 0
This can be
problematic!
• However, TF-IDF remains a useful starting
point for vector space models
• Generally combined with standard machine
learning algorithms
• Logistic Regression
• Naïve Bayes
Natalie Parde - UIC CS 421

More Related Content

PPT
information retrieval term Weighting.ppt
PPTX
Dialog system understanding
PDF
NLP_term frequency Inverse document frequency Tf-IDF
PPTX
Term weighting
PPTX
Document similarity
PPT
text
PPTX
Recommender systems
information retrieval term Weighting.ppt
Dialog system understanding
NLP_term frequency Inverse document frequency Tf-IDF
Term weighting
Document similarity
text
Recommender systems

Similar to TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy (20)

PPTX
Information retrieval 10 tf idf and bag of words
PDF
Cosine tf idf_example
PPTX
IR.pptx
PDF
Applications of Word Vectors in Text Retrieval and Classification
PDF
NLP Lecture on the preprocessing approaches
PPTX
Intro to Vectorization Concepts - GaTech cse6242
PPTX
Remembrance of data past
PDF
3517 10753-1-pb
PDF
Word embeddings and glove and word2vec nlp
PPTX
Vector space model12345678910111213.pptx
PPTX
The vector space model
PPTX
Deep Learning for Search
PPTX
Representing_Text_and_Language_v1.8.pptx
PDF
191CSEH IR UNIT - II for an engineering subject
PPTX
A first look at tf idf-pdx data science meetup
PDF
Exploiting Distributional Semantic Models in Question Answering
PPTX
Information Retrieval – Dense Vectors – Neural IR for Question Answering
PPTX
Text similarity measures
PDF
Information retrieval to recommender systems
PDF
IRJET - Document Comparison based on TF-IDF Metric
Information retrieval 10 tf idf and bag of words
Cosine tf idf_example
IR.pptx
Applications of Word Vectors in Text Retrieval and Classification
NLP Lecture on the preprocessing approaches
Intro to Vectorization Concepts - GaTech cse6242
Remembrance of data past
3517 10753-1-pb
Word embeddings and glove and word2vec nlp
Vector space model12345678910111213.pptx
The vector space model
Deep Learning for Search
Representing_Text_and_Language_v1.8.pptx
191CSEH IR UNIT - II for an engineering subject
A first look at tf idf-pdx data science meetup
Exploiting Distributional Semantic Models in Question Answering
Information Retrieval – Dense Vectors – Neural IR for Question Answering
Text similarity measures
Information retrieval to recommender systems
IRJET - Document Comparison based on TF-IDF Metric
Ad

More from RAtna29 (20)

PPT
RedBlackTrees_2.pptNNNNNNNNNNNNNNNNNNNNNN
PPT
6Sorting.pptBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
PPTX
statisticsforsupportslides.pptxnnnnnnnnnnnnnnnnnn
PPT
Gerstman_PP09.pptvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
PPT
chapter8.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PDF
MLT_KCS055 (Unit-2 Notes).pdfNNNNNNNNNNNNNNNN
PPTX
red black tree.pptxMMMMMMMMMMMMMMMMMMMMMMMMMM
PPTX
Unit 5 m way tree.pptxMMMMMMMMMMMMMMMMMMM
PPTX
TF_IDF_PMI_Jurafsky.pptxnnnnnnnnnnnnnnnn
PPTX
13-DependencyParsing.pptxnnnnnnnnnnnnnnnnnnn
PPT
pos-tagging.pptbbbbbbbbbbbbbbbbbbbbnnnnnnnnnn
PPT
lecture_15.pptffffffffffffffffffffffffff
PPT
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
PPT
Chapter 4.pptmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
PPT
cse220lec4.pptnnnnnnnnnnnnnnnnnnnnnnnnnnn
PPT
slp05.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnmmmmmmmmm
PPTX
lecture14-distributed-reprennnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnsentations.pptx
PPTX
lecture2-intro-boolean.pptbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbx
PPT
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
PPT
lecture3-indexconstruction.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
RedBlackTrees_2.pptNNNNNNNNNNNNNNNNNNNNNN
6Sorting.pptBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
statisticsforsupportslides.pptxnnnnnnnnnnnnnnnnnn
Gerstman_PP09.pptvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
chapter8.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
MLT_KCS055 (Unit-2 Notes).pdfNNNNNNNNNNNNNNNN
red black tree.pptxMMMMMMMMMMMMMMMMMMMMMMMMMM
Unit 5 m way tree.pptxMMMMMMMMMMMMMMMMMMM
TF_IDF_PMI_Jurafsky.pptxnnnnnnnnnnnnnnnn
13-DependencyParsing.pptxnnnnnnnnnnnnnnnnnnn
pos-tagging.pptbbbbbbbbbbbbbbbbbbbbnnnnnnnnnn
lecture_15.pptffffffffffffffffffffffffff
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Chapter 4.pptmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cse220lec4.pptnnnnnnnnnnnnnnnnnnnnnnnnnnn
slp05.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnmmmmmmmmm
lecture14-distributed-reprennnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnsentations.pptx
lecture2-intro-boolean.pptbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbx
lecture10-efficient-scoring.ppmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmt
lecture3-indexconstruction.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Ad

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
737-MAX_SRG.pdf student reference guides
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
Sustainable Sites - Green Building Construction
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Construction Project Organization Group 2.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Current and future trends in Computer Vision.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
UNIT 4 Total Quality Management .pptx
Fundamentals of safety and accident prevention -final (1).pptx
737-MAX_SRG.pdf student reference guides
additive manufacturing of ss316l using mig welding
Internet of Things (IOT) - A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
R24 SURVEYING LAB MANUAL for civil enggi
Categorization of Factors Affecting Classification Algorithms Selection
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Sustainable Sites - Green Building Construction
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Safety Seminar civil to be ensured for safe working.
CYBER-CRIMES AND SECURITY A guide to understanding
Construction Project Organization Group 2.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Current and future trends in Computer Vision.pptx

TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

  • 2. What other ways can we build vector representations for words? critique c1 … critique … cn w1 … … … … … … … … … … … critique ? ? ? ? ? … … … … … … wn … … … … … Natalie Parde - UIC CS 421
  • 3. One Approach: TF-IDF • Term Frequency * Inverse Document Frequency • Meaning of a word is defined by the counts of words in the same document, as well as overall • To do this, a co-occurrence matrix is needed Natalie Parde - UIC CS 421
  • 4. TF-IDF originated as a tool for information retrieval. • Rows: Words in a vocabulary • Columns: Documents in a selection As You Like It Twelfth Night Julius Caesar Henry V Natalie Parde - UIC CS 421
  • 5. TF-IDF originated as a tool for information retrieval. • Rows: Words in a vocabulary • Columns: Documents in a selection As You Like It Twelfth Night Julius Caesar Henry V As You Like It Twelfth Night Julius Caesar Henry V battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 “wit” appears 3 times in Henry V Natalie Parde - UIC CS 421
  • 6. In a term-document matrix, rows can be viewed as word vectors. • Each dimension corresponds to a document • Words with similar vectors occur in similar documents As You Like It Twelfth Night Julius Caesar Henry V battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421
  • 7. In a term-document matrix, rows can be viewed as word vectors. As You Like It Twelfth Night Julius Caesar Henry V battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Julius Caesar Henry V battle [7, 13] good [62, 89] fool [1, 4] wit [2, 3] Natalie Parde - UIC CS 421
  • 8. Different Types of Context • Documents aren’t the most common type of context used to represent meaning in word vectors • More common: word context • Referred to as a term-term matrix, word-word matrix, or term-context matrix • In a word-word matrix, the columns are also labeled by words • Thus, dimensionality is |V| x |V| • Each cell records the number of times the row (target) word and the column (context) word co-occur in some context in a training corpus Natalie Parde - UIC CS 421
  • 9. How can you decide if two words occur in the same context? • Common context windows: • Entire document • Cell value = # times the words co-occur in the same document • Predetermined span surrounding the target • Cell value = # times the words co-occur in this span of words Natalie Parde - UIC CS 421
  • 10. Example Context Window (Size = 4) • Take each occurrence of a word (e.g., strawberry) • Count the context words in the four-word spans before and after it to get a word-word co-occurrence matrix is traditionally followed by cherry pie, a traditional dessert often mixed, such as strawberry rhubarb pie. Apple pie computer peripherals and personal digital assistants. These devices usually a computer. This includes information available on the internet Natalie Parde - UIC CS 421
  • 11. Example Context Window (Size = 4) • A simplified subset of a word- word co-occurrence matrix could appear as follows, given a sufficient corpus aardvark … computer data result pie sugar … cherry 0 … 2 8 9 442 25 … strawberry 0 … 0 0 1 60 19 … digital 0 … 1670 1683 85 5 4 … information 0 … 3325 3982 378 5 13 … is traditionally followed by cherry pie, a traditional dessert often mixed, such as strawberry rhubarb pie. Apple pie computer peripherals and personal digital assistants. These devices usually a computer. This includes information available on the internet Vector for “strawberry” Natalie Parde - UIC CS 421
  • 12. So far, our co- occurrence matrices have contained raw frequency counts of word co- occurrences. • However, this isn’t the best measure of association between words • Some words co-occur frequently with many words, so won’t be very informative • the, it, they • We want to know about words that co- occur frequently with one another, but less frequently across all texts Natalie Parde - UIC CS 421
  • 13. This is where TF- IDF comes in handy! • Term Frequency: The frequency of the word t in the document d • 𝑡𝑓!,# = count(𝑡, 𝑑) • Document Frequency: The number of documents in which the word t occurs • Different from collection frequency (the number of times the word occurs in the entire collection of documents) Natalie Parde - UIC CS 421
  • 14. Computing TF-IDF • Inverse Document Frequency: The inverse of document frequency, where N is the total number of documents in the collection • 𝑖𝑑𝑓! = " #$! • IDF is higher when the term occurs in fewer documents • What is a document? • Individual instance in your corpus (e.g., book, play, sentence, etc.) • It is often useful to perform these computations in log space • TF: log%&(𝑡𝑓!,#+1) • IDF: log%& 𝑖𝑑𝑓! Natalie Parde - UIC CS 421
  • 15. Computing TF*IDF • TF-IDF is then simply the combination of TF and IDF • 𝑡𝑓𝑖𝑑𝑓!,# = 𝑡𝑓!,#×𝑖𝑑𝑓! Natalie Parde - UIC CS 421
  • 16. Example: Computing TF-IDF • TF-IDF(battle, d1) = ? d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421
  • 17. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 18. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 • IDF(battle) = N/DF(battle) = 37/21 = 1.76 d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 word df battle 21 good 37 fool 36 wit 34 Document frequencies from 37-document corpus Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 19. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 • IDF(battle) = N/DF(battle) = 37/21 = 1.76 • TF-IDF(battle, d1) = 1 * 1.76 = 1.76 d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 20. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 • IDF(battle) = N/DF(battle) = 37/21 = 1.76 • TF-IDF(battle, d1) = 1 * 1.76 = 1.76 • Alternately, TF-IDF(battle, d1) = 𝒍𝒐𝒈𝟏𝟎(𝟏 + 𝟏) ∗ 𝒍𝒐𝒈𝟏𝟎 𝟏. 𝟕𝟔 = 0.074 d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 21. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 • IDF(battle) = N/DF(battle) = 37/21 = 1.76 • TF-IDF(battle, d1) = 1 * 1.76 = 1.76 • Alternately, TF-IDF(battle, d1) = 𝑙𝑜𝑔$%(1 + 1) ∗ 𝑙𝑜𝑔$% 1.76 = 0.074 d1 d2 d3 d4 battle 0.074 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 22. To convert our entire word co- occurrence matrix to a TF-IDF matrix, we need to repeat this calculation for each element. d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Natalie Parde - UIC CS 421
  • 23. How does the TF-IDF matrix compare to the original term frequency matrix? d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Natalie Parde - UIC CS 421
  • 24. How does the TF-IDF matrix compare to the original term frequency matrix? d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Occurs in every document …not important in the overall scheme of things! Natalie Parde - UIC CS 421
  • 25. How does the TF-IDF matrix compare to the original term frequency matrix? d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Increases the importance of rarer words like “battle” Natalie Parde - UIC CS 421
  • 26. Note that the TF-IDF model produces a sparse vector. • Sparse: Many (usually most) cells have values of 0 d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Natalie Parde - UIC CS 421
  • 27. d1 d2 d3 d4 d5 d6 d7 battle 0.1 0.0 0.0 0.0 0.2 0.0 0.3 good 0.0 0.0 0.0 0.0 0.0 0.0 0.0 fool 0.0 0.0 0.0 0.0 0.0 0.0 0.0 wit 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Natalie Parde - UIC CS 421 Note that the TF-IDF model produces a sparse vector. • Sparse: Many (usually most) cells have values of 0
  • 28. This can be problematic! • However, TF-IDF remains a useful starting point for vector space models • Generally combined with standard machine learning algorithms • Logistic Regression • Naïve Bayes Natalie Parde - UIC CS 421