SlideShare a Scribd company logo
Text
Classificati
on and
Naive
Bayes
The Task of Text
Classification
Is this spam?
Who wrote which Federalist Papers?
1787-8: essays anonymously written by:
Alexander Hamilton, James Madison, and John Jay
to convince New York to ratify U.S Constitution
Authorship of 12 of the letters unclear between:
1963: solved by Mosteller and Wallace using Bayesian methods
James Madison
Alexander Hamilton
4
Positive or negative movie review?
unbelievably disappointing
Full of zany characters and richly applied satire, and
some great plot twists
this is the greatest screwball comedy ever filmed
It was pathetic. The worst part about it was the
boxing scenes.
5
What is the subject of this article?
Antogonists and Inhibitors
Blood Supply
Chemistry
Drug Therapy
Embryology
Epidemiology
…
MeSH Subject Category Hierarchy
?
MEDLINE Article
Text Classification
Assigning subject categories, topics, or genres
Spam detection
Authorship identification (who wrote this?)
Language Identification (is this Portuguese?)
Sentiment analysis
…
Text Classification: definition
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class c  C
Basic Classification Method:
Hand-coded rules
Rules based on combinations of words or other features
◦ spam: black-list-address OR (“dollars” AND “have been selected”)
Accuracy can be high
• In very specific domains
• If rules are carefully refined by experts
But:
• building and maintaining rules is expensive
• they are too literal and specific: "high-precision, low-recall"
9
Classification Method:
Supervised Machine Learning
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
◦ A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
Output:
◦ a learned classifier γ:d  c
Classification Methods:
Supervised Machine Learning
Many kinds of classifiers!
• Naïve Bayes (this lecture)
• Logistic regression
• Neural networks
• k-nearest neighbors
• …
We can also use pretrained large language models!
• Fine-tuned as classifiers
• Prompted to give a classification
Text
Classificati
on and
Naive
Bayes
The Naive Bayes Classifier
Naive Bayes Intuition
Simple ("naive") classification method based on
Bayes rule
Relies on very simple representation of document
◦ Bag of words
The Bag of Words Representation
13
The bag of words representation
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Bayes’ Rule Applied to Documents and Classes
• For a document d and a class c
Naive Bayes Classifier (I)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
Naive Bayes Classifier (II)
Document d
represented as
features x1..xn
"Likelihood" "Prior"
Naïve Bayes Classifier (IV)
How often does this
class occur?
O(|X|n
•|C|) parameters
We can just count the
relative frequencies in
a corpus
Could only be estimated if a
very, very large number of
training examples was
available.
Multinomial Naive Bayes Independence
Assumptions
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
Multinomial Naive Bayes Classifier
Applying Multinomial Naive Bayes Classifiers to Text
Classification
positions  all word positions in test document
Problems with multiplying lots of probs
There's a problem with this:
Multiplying lots of probabilities can result in floating-point underflow!
.0006 * .0007 * .0009 * .01 * .5 * .000008….
Idea: Use logs, because log(ab) = log(a) + log(b)
We'll sum logs of probabilities instead of multiplying probabilities!
We actually do everything in log space
Instead of this:
This:
Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Text
Classificati
on and
Naive
Bayes
The Naive Bayes Classifier
Text
Classificati
on and
Naïve
Bayes
Naive Bayes: Learning
Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
Sec.13.3
^
𝑃 (𝑐 𝑗)=
𝑁𝑐 𝑗
𝑁𝑡𝑜𝑡𝑎𝑙
Parameter estimation
Create mega-document for topic j by concatenating all
docs in this topic
◦ Use frequency of w in mega-document
fraction of times word wi appears
among all words in documents of topic cj
Problem with Maximum Likelihood
What if we have seen no training documents with the word fantastic
and classified in the topic positive (thumbs-up)?
Zero probabilities cannot be conditioned away, no matter the other
evidence!
Sec.13.3
Laplace (add-1) smoothing for Naïve Bayes
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
◦ For each cj in C do
docsj  all docs with class =cj
• Calculate P(wk | cj) terms
• Textj  single doc containing all docsj
• Foreach word wk in Vocabulary
nk  # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
Unknown words
What about unknown words
◦ that appear in our test data
◦ but not in our training data or vocabulary?
We ignore them
◦ Remove them from the test document!
◦ Pretend they weren't there!
◦ Don't include any probability for them at all!
Why don't we build an unknown word model?
◦ It doesn't help: knowing which class has more unknown words is
not generally helpful!
Stop words
Some systems ignore stop words
◦ Stop words: very frequent words like the and a.
◦ Sort the vocabulary by word frequency in training set
◦ Call the top 10 or 50 words the stopword list.
◦ Remove all stop words from both training and test sets
◦ As if they were never there!
But removing stop words doesn't usually help
• So in practice most NB algorithms use all words and don't
use stopword lists
Text
Classificati
on and
Naive
Bayes
Naive Bayes: Learning
Text
Classificati
on and
Naive
Bayes
Sentiment and Binary
Naive Bayes
Let's do a worked sentiment example!
A worked sentiment example with add-1 smoothing
1. Prior from training:
P(-) = 3/5
P(+) = 2/5
2. Drop "with"
3. Likelihoods from training:
4. Scoring the test set:
𝑝 (𝑤𝑖|𝑐)=
𝑐𝑜𝑢𝑛𝑡 (𝑤𝑖 ,𝑐 )+1
( ∑
𝑤 ∈𝑉
𝑐𝑜𝑢𝑛𝑡 (𝑤 , 𝑐 )
)+¿ 𝑉 ∨¿ ¿
^
𝑃 (𝑐 𝑗)=
𝑁𝑐 𝑗
𝑁𝑡𝑜𝑡𝑎𝑙
Optimizing for sentiment analysis
For tasks like sentiment, word occurrence seems to
be more important than word frequency.
◦ The occurrence of the word fantastic tells us a lot
◦ The fact that it occurs 5 times may not tell us much more.
Binary multinominal naive bayes, or binary NB
◦ Clip our word counts at 1
◦ Note: this is different than Bernoulli naive bayes; see the
textbook at the end of the chapter.
Binary Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
◦ For each cj in C do
docsj  all docs with class =cj
• Textj  single doc containing all docsj
• Foreach word wk in Vocabulary
nk  # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
• Calculate P(wk | cj) terms
• Remove duplicates in each doc:
• For each word type w in docj
• Retain only a single instance of w
Binary Multinomial Naive Bayes
on a test document d
39
First remove all duplicate words from d
Then compute NB using the same equation:
Binary multinominal naive Bayes
Binary multinominal naive Bayes
Binary multinominal naive Bayes
Binary multinominal naive Bayes
Counts can still be 2! Binarization is within-doc!
Text
Classificati
on and
Naive
Bayes
Sentiment and Binary
Naive Bayes
Text
Classificati
on and
Naive
Bayes
More on Sentiment
Classification
Sentiment Classification: Dealing with Negation
I really like this movie
I really don't like this movie
Negation changes the meaning of "like" to negative.
Negation can also change negative to positive-ish
◦ Don't dismiss this film
◦ Doesn't let us get bored
Sentiment Classification: Dealing with Negation
Simple baseline method:
Add NOT_ to every word between negation and following punctuation:
didn’t like this movie , but I
didn’t NOT_like NOT_this NOT_movie but I
Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In
Proceedings of the Asia Pacific Finance Association Annual Conference (APFA).
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using
Machine Learning Techniques. EMNLP-2002, 79—86.
Sentiment Classification: Lexicons
Sometimes we don't have enough labeled training
data
In that case, we can make use of pre-built word lists
Called lexicons
There are various publically available lexicons
MPQA Subjectivity Cues Lexicon
Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
6885 words from 8221 lemmas, annotated for intensity (strong/weak)
◦ 2718 positive
◦ 4912 negative
+ : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great
− : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate
49
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in
Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.
The General Inquirer
◦ Home page: http://www.wjh.harvard.edu/~inquirer
◦ List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm
◦ Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
Categories:
◦ Positiv (1915 words) and Negativ (2291 words)
◦ Strong vs Weak, Active vs Passive, Overstated versus Understated
◦ Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc
Free for Research Use
Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General
Inquirer: A Computer Approach to Content Analysis. MIT Press
Using Lexicons in Sentiment Classification
Add a feature that gets a count whenever a word
from the lexicon occurs
◦ E.g., a feature called "this word occurs in the positive
lexicon" or "this word occurs in the negative lexicon"
Now all positive words (good, great, beautiful,
wonderful) or negative words count for that feature.
Using 1-2 features isn't as good as using all the words.
• But when training data is sparse or not representative of the
test set, dense lexicon features can help
Naive Bayes in Other tasks: Spam Filtering
SpamAssassin Features:
◦ Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
◦ From: starts with many numbers
◦ Subject is all capitals
◦ HTML has a low ratio of text to image area
◦ "One hundred percent guaranteed"
◦ Claims you can be removed from the list
Naive Bayes in Language ID
Determining what language a piece of text is written in.
Features based on character n-grams do very well
Important to train on lots of varieties of each language
(e.g., American English varieties like African-American English,
or English varieties around the world like Indian English)
Summary: Naive Bayes is Not So Naive
Very Fast, low storage requirements
Work well with very small amounts of training data
Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
Very good in domains with many equally important features
Decision Trees suffer from fragmentation in such cases – especially if little data
Optimal if the independence assumptions hold: If assumed independence
is correct, then it is the Bayes Optimal Classifier for problem
A good dependable baseline for text classification
◦ But we will see other classifiers that give better accuracy
Slide from Chris Manning
Text
Classificati
on and
Naive
Bayes
More on Sentiment
Classification
Text Classification
and Naïve Bayes
Naïve Bayes:
Relationship to
Language Modeling
Dan Jurafsky
57
Generative Model for Multinomial Naïve Bayes
c=China
X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds
Dan Jurafsky
58
Naïve Bayes and Language Modeling
• Naïve bayes classifiers can use any sort of feature
• URL, email address, dictionaries, network features
• But if, as in the previous slides
• We use only word features
• we use all of the words in the text (not a subset)
• Then
• Naïve bayes has an important similarity to language
modeling.
Dan Jurafsky
Each class = a unigram language model
• Assigning each word: P(word | c)
• Assigning each sentence: P(s|c)=Π P(word|c)
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
…
I love this fun film
0.1 0.1 .05 0.01 0.1
Class pos
P(s | pos) = 0.0000005
Sec.13.2.1
Dan Jurafsky
Naïve Bayes as a Language Model
• Which class assigns the higher probability to s?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Model pos Model neg
film
love this fun
I
0.1
0.1 0.01 0.05
0.1
0.1
0.001 0.01 0.005
0.2
P(s|pos) > P(s|neg)
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Sec.13.2.1
Text Classification
and Naïve Bayes
Naïve Bayes:
Relationship to
Language Modeling
Text
Classificati
on and
Naive
Bayes
Precision, Recall, and F1
Evaluating Classifiers: How well does our classifier work?
Let's first address binary classifiers:
• Is this email spam?
spam (+) or not spam (-)
• Is this post about Delicious Pie Company?
about Del. Pie Co (+) or not about Del. Pie Co(-)
We'll need to know
1. What did our classifier say about each email or post?
2. What should our classifier have said, i.e., the correct
answer, usually as defined by humans ("gold label")
First step in evaluation: The confusion matrix
Accuracy on the confusion matrix
Why don't we use accuracy?
Accuracy doesn't work well when we're dealing with
uncommon or imbalanced classes
Suppose we look at 1,000,000 social media posts to find
Delicious Pie-lovers (or haters)
• 100 of them talk about our pie
• 999,900 are posts about something unrelated
Imagine the following simple classifier
Every post is "not about pie"
Accuracy re: pie posts 100 posts are about pie; 999,900 aren't
Why don't we use accuracy?
Accuracy of our "nothing is pie" classifier
999,900 true negatives and 100 false negatives
Accuracy is 999,900/1,000,000 = 99.99%!
But useless at finding pie-lovers (or haters)!!
Which was our goal!
Accuracy doesn't work well for unbalanced classes
Most tweets are not about pie!
Instead of accuracy we use precision and recall
Precision: % of selected items that are correct
Recall: % of correct items that are selected
Precision/Recall aren't fooled by the"just call everything negative"
classifier!
Stupid classifier: Just say no: every tweet is "not about pie"
• 100 tweets talk about pie, 999,900 tweets don't
• Accuracy = 999,900/1,000,000 = 99.99%
But the Recall and Precision for this classifier are terrible:
A combined measure: F1
F1 is a combination of precision and recall.
F1 is a special case of the general "F-measure"
F-measure is the (weighted) harmonic mean of precision and recall
F1 is a special case of F-measure with β=1, α=½
Suppose we have more than 2 classes?
Lots of text classification tasks have more than two classes.
◦ Sentiment analysis (positive, negative, neutral) , named entities (person, location,
organization)
We can define precision and recall for multiple classes like this 3-way
email task:
How to combine P/R values for different classes:
Microaveraging vs Macroaveraging
Text
Classificati
on and
Naive
Bayes
Precision, Recall, and F1
Text
Classificati
on and
Naive
Bayes
Avoiding Harms in Classification
Harms of classification
Classifiers, like any NLP algorithm, can cause harms
This is true for any classifier, whether Naive Bayes or
other algorithms
Representational Harms
• Harms caused by a system that demeans a social group
• Such as by perpetuating negative stereotypes about them.
• Kiritchenko and Mohammad 2018 study
• Examined 200 sentiment analysis systems on pairs of sentences
• Identical except for names:
• common African American (Shaniqua) or European American (Stephanie).
• Like "I talked to Shaniqua yesterday" vs "I talked to Stephanie yesterday"
• Result: systems assigned lower sentiment and more negative
emotion to sentences with African American names
• Downstream harm:
• Perpetuates stereotypes about African Americans
• African Americans treated differently by NLP tools like sentiment (widely
used in marketing research, mental health studies, etc.)
Harms of Censorship
• Toxicity detection is the text classification task of detecting hate speech,
abuse, harassment, or other kinds of toxic language.
• Widely used in online content moderation
• Toxicity classifiers incorrectly flag non-toxic sentences that simply mention
minority identities (like the words "blind" or "gay")
• women (Park et al., 2018),
• disabled people (Hutchinson et al., 2020)
• gay people (Dixon et al., 2018; Oliva et al., 2021)
• Downstream harms:
• Censorship of speech by disabled people and other groups
• Speech by these groups becomes less visible online
• Writers might be nudged by these algorithms to avoid these words
Performance Disparities
1. Text classifiers perform worse on many languages
of the world due to lack of data or labels
2. Text classifiers perform worse on varieties of
even high-resource languages like English
• Example task: language identification, a first step in
NLP pipeline ("Is this post in English or not?")
• English language detection performance worse for
writers who are African American (Blodgett and
O'Connor 2017) or from India (Jurgens et al., 2017)
Harms in text classification
• Causes:
• Issues in the data; NLP systems amplify biases in training data
• Problems in the labels
• Problems in the algorithms (like what the model is trained to
optimize)
• Prevalence: The same problems occur throughout NLP
(including large language models)
• Solutions: There are no general mitigations or solutions
• But harm mitigation is an active area of research
• And there are standard benchmarks and tools that we can use for
measuring some of the harms
Text
Classificati
on and
Naive
Bayes
Avoiding Harms in Classification

More Related Content

PPTX
Topic_5_NB_Sentiment_Classification_.pptx
PDF
PPTX
CMSC 723: Computational Linguistics I
PPTX
A Simple Walkthrough of Word Sense Disambiguation
PDF
R.E, Text Normalization, Tokenization ALgs, BPE.pdf
PPTX
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
PDF
Word embeddings and glove and word2vec nlp
PPT
Using binary classifiers
Topic_5_NB_Sentiment_Classification_.pptx
CMSC 723: Computational Linguistics I
A Simple Walkthrough of Word Sense Disambiguation
R.E, Text Normalization, Tokenization ALgs, BPE.pdf
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
Word embeddings and glove and word2vec nlp
Using binary classifiers

Similar to Nave Bias algorithm in Nature language processing (20)

PPTX
A Panorama of Natural Language Processing
PDF
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
PPTX
Supervised learning: Types of Machine Learning
PDF
Natural Language Processing: Lecture 255
DOC
Downs standards based videos
PDF
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
PDF
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
PPTX
Lecture 10
PDF
A Featherweight Approach to FOOL
PDF
Knowledge Representation and Reasoning Slides
PPTX
Content lit strat
PPTX
02 naive bays classifier and sentiment analysis
PPT
Quality Criteria Test Questions
PDF
Handout Packet for Textbook Evaluation Workshop
PPTX
update -2022 Lec 4 - Knowledge Representation.pptx
DOCX
6.7.8 sbs
PDF
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PDF
How can text-mining leverage developments in Deep Learning? Presentation at ...
PPT
Final Review And Assignment For Vocab Unit
A Panorama of Natural Language Processing
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Supervised learning: Types of Machine Learning
Natural Language Processing: Lecture 255
Downs standards based videos
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
Lecture 10
A Featherweight Approach to FOOL
Knowledge Representation and Reasoning Slides
Content lit strat
02 naive bays classifier and sentiment analysis
Quality Criteria Test Questions
Handout Packet for Textbook Evaluation Workshop
update -2022 Lec 4 - Knowledge Representation.pptx
6.7.8 sbs
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Deep Learning for Natural Language Processing: Word Embeddings
How can text-mining leverage developments in Deep Learning? Presentation at ...
Final Review And Assignment For Vocab Unit
Ad

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Lecture1 pattern recognition............
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
annual-report-2024-2025 original latest.
Computer network topology notes for revision
ISS -ESG Data flows What is ESG and HowHow
Miokarditis (Inflamasi pada Otot Jantung)
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
Supervised vs unsupervised machine learning algorithms
Fluorescence-microscope_Botany_detailed content
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Knowledge Engineering Part 1
Lecture1 pattern recognition............
oil_refinery_comprehensive_20250804084928 (1).pptx
annual-report-2024-2025 original latest.
Ad

Nave Bias algorithm in Nature language processing

  • 3. Who wrote which Federalist Papers? 1787-8: essays anonymously written by: Alexander Hamilton, James Madison, and John Jay to convince New York to ratify U.S Constitution Authorship of 12 of the letters unclear between: 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton
  • 4. 4 Positive or negative movie review? unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists this is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes.
  • 5. 5 What is the subject of this article? Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology … MeSH Subject Category Hierarchy ? MEDLINE Article
  • 6. Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification (who wrote this?) Language Identification (is this Portuguese?) Sentiment analysis …
  • 7. Text Classification: definition Input: ◦ a document d ◦ a fixed set of classes C = {c1, c2,…, cJ} Output: a predicted class c  C
  • 8. Basic Classification Method: Hand-coded rules Rules based on combinations of words or other features ◦ spam: black-list-address OR (“dollars” AND “have been selected”) Accuracy can be high • In very specific domains • If rules are carefully refined by experts But: • building and maintaining rules is expensive • they are too literal and specific: "high-precision, low-recall"
  • 9. 9 Classification Method: Supervised Machine Learning Input: ◦ a document d ◦ a fixed set of classes C = {c1, c2,…, cJ} ◦ A training set of m hand-labeled documents (d1,c1),....,(dm,cm) Output: ◦ a learned classifier γ:d  c
  • 10. Classification Methods: Supervised Machine Learning Many kinds of classifiers! • Naïve Bayes (this lecture) • Logistic regression • Neural networks • k-nearest neighbors • … We can also use pretrained large language models! • Fine-tuned as classifiers • Prompted to give a classification
  • 12. Naive Bayes Intuition Simple ("naive") classification method based on Bayes rule Relies on very simple representation of document ◦ Bag of words
  • 13. The Bag of Words Representation 13
  • 14. The bag of words representation γ( )=c seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
  • 15. Bayes’ Rule Applied to Documents and Classes • For a document d and a class c
  • 16. Naive Bayes Classifier (I) MAP is “maximum a posteriori” = most likely class Bayes Rule Dropping the denominator
  • 17. Naive Bayes Classifier (II) Document d represented as features x1..xn "Likelihood" "Prior"
  • 18. Naïve Bayes Classifier (IV) How often does this class occur? O(|X|n •|C|) parameters We can just count the relative frequencies in a corpus Could only be estimated if a very, very large number of training examples was available.
  • 19. Multinomial Naive Bayes Independence Assumptions Bag of Words assumption: Assume position doesn’t matter Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c.
  • 21. Applying Multinomial Naive Bayes Classifiers to Text Classification positions  all word positions in test document
  • 22. Problems with multiplying lots of probs There's a problem with this: Multiplying lots of probabilities can result in floating-point underflow! .0006 * .0007 * .0009 * .01 * .5 * .000008…. Idea: Use logs, because log(ab) = log(a) + log(b) We'll sum logs of probabilities instead of multiplying probabilities!
  • 23. We actually do everything in log space Instead of this: This: Notes: 1) Taking log doesn't change the ranking of classes! The class with highest probability also has highest log probability! 2) It's a linear model: Just a max of a sum of weights: a linear function of the inputs So naive bayes is a linear classifier
  • 26. Learning the Multinomial Naive Bayes Model First attempt: maximum likelihood estimates ◦ simply use the frequencies in the data Sec.13.3 ^ 𝑃 (𝑐 𝑗)= 𝑁𝑐 𝑗 𝑁𝑡𝑜𝑡𝑎𝑙
  • 27. Parameter estimation Create mega-document for topic j by concatenating all docs in this topic ◦ Use frequency of w in mega-document fraction of times word wi appears among all words in documents of topic cj
  • 28. Problem with Maximum Likelihood What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)? Zero probabilities cannot be conditioned away, no matter the other evidence! Sec.13.3
  • 29. Laplace (add-1) smoothing for Naïve Bayes
  • 30. Multinomial Naïve Bayes: Learning Calculate P(cj) terms ◦ For each cj in C do docsj  all docs with class =cj • Calculate P(wk | cj) terms • Textj  single doc containing all docsj • Foreach word wk in Vocabulary nk  # of occurrences of wk in Textj • From training corpus, extract Vocabulary
  • 31. Unknown words What about unknown words ◦ that appear in our test data ◦ but not in our training data or vocabulary? We ignore them ◦ Remove them from the test document! ◦ Pretend they weren't there! ◦ Don't include any probability for them at all! Why don't we build an unknown word model? ◦ It doesn't help: knowing which class has more unknown words is not generally helpful!
  • 32. Stop words Some systems ignore stop words ◦ Stop words: very frequent words like the and a. ◦ Sort the vocabulary by word frequency in training set ◦ Call the top 10 or 50 words the stopword list. ◦ Remove all stop words from both training and test sets ◦ As if they were never there! But removing stop words doesn't usually help • So in practice most NB algorithms use all words and don't use stopword lists
  • 35. Let's do a worked sentiment example!
  • 36. A worked sentiment example with add-1 smoothing 1. Prior from training: P(-) = 3/5 P(+) = 2/5 2. Drop "with" 3. Likelihoods from training: 4. Scoring the test set: 𝑝 (𝑤𝑖|𝑐)= 𝑐𝑜𝑢𝑛𝑡 (𝑤𝑖 ,𝑐 )+1 ( ∑ 𝑤 ∈𝑉 𝑐𝑜𝑢𝑛𝑡 (𝑤 , 𝑐 ) )+¿ 𝑉 ∨¿ ¿ ^ 𝑃 (𝑐 𝑗)= 𝑁𝑐 𝑗 𝑁𝑡𝑜𝑡𝑎𝑙
  • 37. Optimizing for sentiment analysis For tasks like sentiment, word occurrence seems to be more important than word frequency. ◦ The occurrence of the word fantastic tells us a lot ◦ The fact that it occurs 5 times may not tell us much more. Binary multinominal naive bayes, or binary NB ◦ Clip our word counts at 1 ◦ Note: this is different than Bernoulli naive bayes; see the textbook at the end of the chapter.
  • 38. Binary Multinomial Naïve Bayes: Learning Calculate P(cj) terms ◦ For each cj in C do docsj  all docs with class =cj • Textj  single doc containing all docsj • Foreach word wk in Vocabulary nk  # of occurrences of wk in Textj • From training corpus, extract Vocabulary • Calculate P(wk | cj) terms • Remove duplicates in each doc: • For each word type w in docj • Retain only a single instance of w
  • 39. Binary Multinomial Naive Bayes on a test document d 39 First remove all duplicate words from d Then compute NB using the same equation:
  • 43. Binary multinominal naive Bayes Counts can still be 2! Binarization is within-doc!
  • 46. Sentiment Classification: Dealing with Negation I really like this movie I really don't like this movie Negation changes the meaning of "like" to negative. Negation can also change negative to positive-ish ◦ Don't dismiss this film ◦ Doesn't let us get bored
  • 47. Sentiment Classification: Dealing with Negation Simple baseline method: Add NOT_ to every word between negation and following punctuation: didn’t like this movie , but I didn’t NOT_like NOT_this NOT_movie but I Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA). Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.
  • 48. Sentiment Classification: Lexicons Sometimes we don't have enough labeled training data In that case, we can make use of pre-built word lists Called lexicons There are various publically available lexicons
  • 49. MPQA Subjectivity Cues Lexicon Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/ 6885 words from 8221 lemmas, annotated for intensity (strong/weak) ◦ 2718 positive ◦ 4912 negative + : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great − : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate 49 Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005. Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.
  • 50. The General Inquirer ◦ Home page: http://www.wjh.harvard.edu/~inquirer ◦ List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm ◦ Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls Categories: ◦ Positiv (1915 words) and Negativ (2291 words) ◦ Strong vs Weak, Active vs Passive, Overstated versus Understated ◦ Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc Free for Research Use Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press
  • 51. Using Lexicons in Sentiment Classification Add a feature that gets a count whenever a word from the lexicon occurs ◦ E.g., a feature called "this word occurs in the positive lexicon" or "this word occurs in the negative lexicon" Now all positive words (good, great, beautiful, wonderful) or negative words count for that feature. Using 1-2 features isn't as good as using all the words. • But when training data is sparse or not representative of the test set, dense lexicon features can help
  • 52. Naive Bayes in Other tasks: Spam Filtering SpamAssassin Features: ◦ Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN) ◦ From: starts with many numbers ◦ Subject is all capitals ◦ HTML has a low ratio of text to image area ◦ "One hundred percent guaranteed" ◦ Claims you can be removed from the list
  • 53. Naive Bayes in Language ID Determining what language a piece of text is written in. Features based on character n-grams do very well Important to train on lots of varieties of each language (e.g., American English varieties like African-American English, or English varieties around the world like Indian English)
  • 54. Summary: Naive Bayes is Not So Naive Very Fast, low storage requirements Work well with very small amounts of training data Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data Optimal if the independence assumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem A good dependable baseline for text classification ◦ But we will see other classifiers that give better accuracy Slide from Chris Manning
  • 56. Text Classification and Naïve Bayes Naïve Bayes: Relationship to Language Modeling
  • 57. Dan Jurafsky 57 Generative Model for Multinomial Naïve Bayes c=China X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds
  • 58. Dan Jurafsky 58 Naïve Bayes and Language Modeling • Naïve bayes classifiers can use any sort of feature • URL, email address, dictionaries, network features • But if, as in the previous slides • We use only word features • we use all of the words in the text (not a subset) • Then • Naïve bayes has an important similarity to language modeling.
  • 59. Dan Jurafsky Each class = a unigram language model • Assigning each word: P(word | c) • Assigning each sentence: P(s|c)=Π P(word|c) 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film … I love this fun film 0.1 0.1 .05 0.01 0.1 Class pos P(s | pos) = 0.0000005 Sec.13.2.1
  • 60. Dan Jurafsky Naïve Bayes as a Language Model • Which class assigns the higher probability to s? 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film Model pos Model neg film love this fun I 0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2 P(s|pos) > P(s|neg) 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film Sec.13.2.1
  • 61. Text Classification and Naïve Bayes Naïve Bayes: Relationship to Language Modeling
  • 63. Evaluating Classifiers: How well does our classifier work? Let's first address binary classifiers: • Is this email spam? spam (+) or not spam (-) • Is this post about Delicious Pie Company? about Del. Pie Co (+) or not about Del. Pie Co(-) We'll need to know 1. What did our classifier say about each email or post? 2. What should our classifier have said, i.e., the correct answer, usually as defined by humans ("gold label")
  • 64. First step in evaluation: The confusion matrix
  • 65. Accuracy on the confusion matrix
  • 66. Why don't we use accuracy? Accuracy doesn't work well when we're dealing with uncommon or imbalanced classes Suppose we look at 1,000,000 social media posts to find Delicious Pie-lovers (or haters) • 100 of them talk about our pie • 999,900 are posts about something unrelated Imagine the following simple classifier Every post is "not about pie"
  • 67. Accuracy re: pie posts 100 posts are about pie; 999,900 aren't
  • 68. Why don't we use accuracy? Accuracy of our "nothing is pie" classifier 999,900 true negatives and 100 false negatives Accuracy is 999,900/1,000,000 = 99.99%! But useless at finding pie-lovers (or haters)!! Which was our goal! Accuracy doesn't work well for unbalanced classes Most tweets are not about pie!
  • 69. Instead of accuracy we use precision and recall Precision: % of selected items that are correct Recall: % of correct items that are selected
  • 70. Precision/Recall aren't fooled by the"just call everything negative" classifier! Stupid classifier: Just say no: every tweet is "not about pie" • 100 tweets talk about pie, 999,900 tweets don't • Accuracy = 999,900/1,000,000 = 99.99% But the Recall and Precision for this classifier are terrible:
  • 71. A combined measure: F1 F1 is a combination of precision and recall.
  • 72. F1 is a special case of the general "F-measure" F-measure is the (weighted) harmonic mean of precision and recall F1 is a special case of F-measure with β=1, α=½
  • 73. Suppose we have more than 2 classes? Lots of text classification tasks have more than two classes. ◦ Sentiment analysis (positive, negative, neutral) , named entities (person, location, organization) We can define precision and recall for multiple classes like this 3-way email task:
  • 74. How to combine P/R values for different classes: Microaveraging vs Macroaveraging
  • 77. Harms of classification Classifiers, like any NLP algorithm, can cause harms This is true for any classifier, whether Naive Bayes or other algorithms
  • 78. Representational Harms • Harms caused by a system that demeans a social group • Such as by perpetuating negative stereotypes about them. • Kiritchenko and Mohammad 2018 study • Examined 200 sentiment analysis systems on pairs of sentences • Identical except for names: • common African American (Shaniqua) or European American (Stephanie). • Like "I talked to Shaniqua yesterday" vs "I talked to Stephanie yesterday" • Result: systems assigned lower sentiment and more negative emotion to sentences with African American names • Downstream harm: • Perpetuates stereotypes about African Americans • African Americans treated differently by NLP tools like sentiment (widely used in marketing research, mental health studies, etc.)
  • 79. Harms of Censorship • Toxicity detection is the text classification task of detecting hate speech, abuse, harassment, or other kinds of toxic language. • Widely used in online content moderation • Toxicity classifiers incorrectly flag non-toxic sentences that simply mention minority identities (like the words "blind" or "gay") • women (Park et al., 2018), • disabled people (Hutchinson et al., 2020) • gay people (Dixon et al., 2018; Oliva et al., 2021) • Downstream harms: • Censorship of speech by disabled people and other groups • Speech by these groups becomes less visible online • Writers might be nudged by these algorithms to avoid these words
  • 80. Performance Disparities 1. Text classifiers perform worse on many languages of the world due to lack of data or labels 2. Text classifiers perform worse on varieties of even high-resource languages like English • Example task: language identification, a first step in NLP pipeline ("Is this post in English or not?") • English language detection performance worse for writers who are African American (Blodgett and O'Connor 2017) or from India (Jurgens et al., 2017)
  • 81. Harms in text classification • Causes: • Issues in the data; NLP systems amplify biases in training data • Problems in the labels • Problems in the algorithms (like what the model is trained to optimize) • Prevalence: The same problems occur throughout NLP (including large language models) • Solutions: There are no general mitigations or solutions • But harm mitigation is an active area of research • And there are standard benchmarks and tools that we can use for measuring some of the harms

Editor's Notes

  • #1: In this lecture we define the Naive Bayes classifier, a basic text classifier that will allow us to introduce many of the fundamental issues in text classification.
  • #11: In this lecture we define the Naive Bayes classifier, a basic text classifier that will allow us to introduce many of the fundamental issues in text classification.
  • #24: We've now seen the basic principles of naïve bayes classification!
  • #34: In this lecture we'll do a worked example of naïve bayes sentiment analysis, and also introduce the binary multinominal naïve bayes algorithm.
  • #62: In this lecture we introduce the concepts of precision and recall, and the F1 metric that combines them. These concepts are central to the evaluation of classifiers, and we'll use them throughout NLP, not just for Naïve Bayes but also for logistic regression and neural models.
  • #63: Let's first consider binary classifiers. So that might include a simple binary decision about email (spam or not-spam). Or imagine that we are the proprietors of the Delicious Pie Company and we want to find out what people are saying about our pies on social media. We want to know if a particular social media post is talking about our pies (positive) or isn't (negative). To evaluate such a binary classifier, we'll need to know two things. What our classifier said about each email or post, and what it should have said, i.e. the correct answer, usually as defined by human labelers.
  • #64: The first step is the confusion matrix a table for visualizing how an algorithm performs with respect to the human gold labels, We use two dimensions (system output and gold labels), and each cell labels a set of possible outcomes. In the pie detection case, for example, true positives are posts that are indeed about Delicious Pie (indicated by human-created gold labels) that our system correctly said were about pie. False negatives are posts that are indeed about pie but our system incorrectly labeled as pie. False positives are posts that aren't about pie but our system incorrectly said they were. And true negatives are non-pie-posts that are system correctly said were not about pie
  • #65: . Here is the equation for accuracy: what percentage of all the observations (for the spam or pie examples that means all emails or tweets) our system labeled correctly. Although accuracy might seem a natural metric, we generally don’t use it for text classification tasks. Although accuracy might seem a natural metric, we generally don’t use it for text classification tasks.
  • #66: Why don't we use accuracy? …Accuracy doesn't work well when we're dealing with uncommon or imbalanced classes. Suppose we look at a million social media posts to find Delicious Pie-lovers. Now most posts on the web are not about our pie, so let's imagine that only 100 posts are about our pie, and 999,900 are about something else. Now imagine we have the following really simple classifier: Every post is "not about pie"
  • #67: Let's see what happens.
  • #68: But this fabulous ‘no pie’ classifier would be completely useless, since it wouldn’t find a single one of the customer comments we are looking for. In other words, accuracy is not a good metric when the goal is to discover something that is rare, or at least not completely balanced in frequency, which is a very common situation in the world.
  • #69: Precision is out of the things the system selected (the set of emails or tweets the sytstem claimed were positive, i.e. spam or pie-related), how many did it get right? how many were true positives, out of what I selected (true positives _ false positives). Recall is out of all the correct items that should have been positive, what % of them did the system select? So out of all the things that are gold positive, how many did the system find as true positives? So precision is about how much garbage we included in our findings; recall is more about making sure we didn't miss any treasure.
  • #70: Recall and Precision will correctly evaluate our stupid "just say no" classifier as a bad classifier. The recall will be 0, since we returned no true positives out of the 100 true pie tweets (0 + 100). Precision is similarly 0 or in fact undefined, since both the numerator and denominator are 0. The metrics correctly assigns bad scores to our useless classifier. [note that this is a particular egregious example, but in normal situations there is typically a trade-off between precision and recall!] [to get high precision, a system should be very reluctant to guess – but then it may miss somethings and have poor recall] [to get high recall, a system shoul dbe very willing to guess– but then it may return some junk and have poor precision]
  • #71: a combined measure: F-1: Why this function instead of just the arithmetic or geometric mean? F1 turns out to be the harmonic mean between precision and recall
  • #72: F1 is a special case of the F-measure: weighted harmonic mean of precision and recall. The harmonic mean of a set of numbers is the reciprocal of the arithmetic mean of reciprocals. You can see here that F score is the harmonic mean, if we replace alpha with ½ we get 2/ (1/p + 1/r). The Harmonic mean of two values is closer to the minimum of the two numbers than arithmetic or geometric mean, so it weighs the lower of the two numbers more heavily. That is, if P and R are far apart, F will be nearer the lower value, which makes it a kind of conservative mean in this situation. Thus to do well on F1, you have to do well on BOTH P and R. Why the weights? in some applications you may care more about P or R. In practice we mainly use the balanced measure with beta = 1 and alpha =1/2
  • #73: Lots of classification tasks have more than two classes, like sentiment could be 3-way. Consider the confusion matrix for a hypothetical 3-way email categorization decision (urgent, normal, spam). Notice that the system mistakenly labeled one spam document as urgent. We can compute distinct precision and recall values for each class. For example, the precision of the urgent category is 8 (the true positive urgents) over the true positives + false positives (the 10 normal and that 1 spam). The result, however, is 3 separate precision values and 3 separaterecall values!
  • #74: To derive a single metric from these 3 sets of precisions and recalls, we can combine these in two ways.  In macroaveraging, we compute the performance for each class, and then average over classes. That is, we take the same precisions and recalls we computed on the previsions slide, one set for Urgent, one set for Normal, and one set for spam, and we just average those precisions to get a single macro average. In microaveraging, we instead first combine the decisions for all classes in a single confusion matrix, and then compute precision and recall from that table. Note that the microaverage is dominated by the more frequent class (in this case spam), since the counts are pooled and F1 is dominated by the count of true positives. The macroaverage better reflects the statistics of the smaller classes, and so is more appropriate when performance on all the classes is equally important. 
  • #75: We'll be using precision, recall, and F1 constantly in evaluating NLP classifiers, since they are more appropriate than accuracy when classes are imbalanced.
  • #76: For any of our NLP tasks, we'll want to think about how to avoid doing harm. Let's discuss some important considerations for classification.
  • #78: One type of harm is called representational harm. These are harms that affect the representation of a group, for example demeaning them by perpetuating negative stereotypes about them. For example, Kititchenko and Mohammad in 2018 looked at representational harms in 200 sentiment analysis systems. They did this by creating pairs of sentences that were identicial except for a name. So a sentence might be "I talked to <person> yesterday". Then they created two versions of this sentence, one with a common African American name (like Shaniqua), and one with a common European American name (like Stephanie). Then they ran all the sentences in these sets of sentence pairs through 200 sentiment analysis systems. What they found is that systems assigned lower sentiment and more negative emotion to sentences with African American names This not only perpetuates negative stereotypes about African Americans, but means that a standard NLP tool like sentiment, widley used in marketing research, mental health studies, and so on, is likely to treats African Americans differently.
  • #81: What causes these harms in text classification, and what can we do about it? These harms are partly caused by biases in the data; classifiers are trained on text that contains bias already. And research shows that NLP systems have a tendancy to amplify biases that already exist in the training data. The bias can come from the labels. Labels come from humans too, and humans have biases. And these harms can arise from the algorithms thsemlves These problems occur throughout NLP, even in the latest and most powerful neural models. Indeed, some research suggests that the bigger more powerful NLP models may exhibit even greater examples of these harms. There are no general-purpose mitigation algorithms or solutions. But harm mitigation is an active area of research throughout NLP.
  • #82: Representational harms, censorship, and performance disparities, are just some of the harms that can be caused by text classification. It's important to examine our NLP tools for these harms and work to reduce them.