SlideShare a Scribd company logo
Naive Bayes
and
Sentiment
Classification
The Task of Text
Classification
Is this spam?
Who wrote which Federalist papers?
1787-8: anonymous essays try to convince New
York to ratify U.S Constitution: Jay, Madison,
Hamilton.
Authorship of 12 of the letters in dispute
1963: solved by Mosteller and Wallace using
Bayesian methods
James Madison Alexander Hamilton
What is the subject of this medical article?
Antogonists and Inhibitors
Blood Supply
Chemistry
Drug Therapy
Embryology
Epidemiology
…
4
MeSH Subject Category Hierarchy
?
MEDLINE Article
Positive or negative movie review?
...zany characters and richly applied satire, and some great
plot twists
It was pathetic. The worst part about it was the boxing
scenes...
...awesome caramel sauce and sweet toasty almonds. I
love this place!
...awful pizza and ridiculously overpriced...
5
+
+
−
−
Positive or negative movie review?
...zany characters and richly applied satire, and some great
plot twists
It was pathetic. The worst part about it was the boxing
scenes...
...awesome caramel sauce and sweet toasty almonds. I
love this place!
...awful pizza and ridiculously overpriced...
6
+
+
−
−
Why sentiment analysis?
Movie: is this review positive or negative?
Products: what do people think about the new iPhone?
Public sentiment: how is consumer confidence?
Politics: what do people think about this candidate or issue?
Prediction: predict election outcomes or market trends from
sentiment
7
Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous
Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous
Basic Sentiment Classification
Sentiment analysis is the detection of
attitudes
Simple task we focus on in this chapter
◦ Is the attitude of this text positive or negative?
We return to affect classification in later
chapters
Summary: Text Classification
Sentiment analysis
Spam detection
Authorship identification
Language Identification
Assigning subject categories, topics, or genres
…
Text Classification: definition
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class c  C
Classification Methods: Hand-coded rules
Rules based on combinations of words or other features
◦ spam: black-list-address OR (“dollars” AND “you have been
selected”)
Accuracy can be high
◦ If rules carefully refined by expert
But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
◦ A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
Output:
◦ a learned classifier γ:d → c
14
Classification Methods:
Supervised Machine Learning
Any kind of classifier
◦ Naïve Bayes
◦ Logistic regression
◦ Neural networks
◦ k-Nearest Neighbors
◦ …
Text
Classification
and Naive
Bayes
The Task of Text
Classification
Text
Classification
and Naive
Bayes
Naive Bayes (I)
Naive Bayes Intuition
Simple (“naive”) classification method based on
Bayes rule
Relies on very simple representation of document
◦ Bag of words
The Bag of Words Representation
19
The bag of words representation
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Text
Classification
and Naïve
Bayes
Naive Bayes (I)
Text
Classification
and Naïve
Bayes
Formalizing the Naive
Bayes Classifier
Bayes’ Rule Applied to Documents and Classes
•For a document d and a class c
P(c | d) =
P(d |c)P(c)
P(d)
Naive Bayes Classifier (I)
cMAP = argmax
cÎC
P(c | d)
= argmax
cÎC
P(d | c)P(c)
P(d)
= argmax
cÎC
P(d |c)P(c)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
Naive Bayes Classifier (II)
cMAP = argmax
cÎC
P(d | c)P(c)
Document d
represented as
features
x1..xn
= argmax
cÎC
P(x1, x2,… , xn | c)P(c)
"Likelihood" "Prior"
Naïve Bayes Classifier (IV)
How often does this
class occur?
cMAP = argmax
cÎC
P(x1, x2,… , xn | c)P(c)
O(|X|n•|C|) parameters
We can just count the
relative frequencies in
a corpus
Could only be estimated if a
very, very large number of
training examples was
available.
Multinomial Naive Bayes Independence
Assumptions
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
P(x1, x2,… , xn |c)
P(x1,… , xn |c)= P(x1 |c)·P(x2 |c)·P(x3 |c)·...·P(xn |c)
Multinomial Naive Bayes Classifier
cMAP = argmax
cÎC
P(x1, x2,… , xn | c)P(c)
cNB = argmax
cÎC
P(cj ) P(x | c)
xÎX
Õ
Applying Multinomial Naive Bayes Classifiers
to Text Classification
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ
positions  all word positions in test document
Example
Let me explain a Multinomial Naïve Bayes Classifier
where we want to filter out the spam messages.
Initially, we consider eight normal messages and
four spam messages.
Histogram of all the words that occur in the
normal messages from family and friends
The probability of word dear given that we saw in
normal message is-
Probability (Dear|Normal) =
Probability (Friend|Normal) =
Probability (Lunch|Normal) =
Probability (Money|Normal) =
The probability of word dear given that we saw in
normal message is-
Probability (Dear|Normal) = 8 /17 = 0.47
Similarly, the probability of word Friend is-
Probability (Friend/Normal) = 5/ 17 =0.29
Probability (Lunch/Normal) = 3/ 17 =0.18
Probability (Money/Normal) = 1/ 17 =0.06
Histogram for Spam Message
he probability of word dear given that we saw in
spam message is-
Probability (Dear|Spam) =
Probability (Friend|Spam) =
Probability (Lunch|Spam) =
Probability (Money|Spam) =
he probability of word dear given that we saw in
spam message is-
Probability (Dear|Spam) = 2 /7 = 0.29
Similarly, the probability of word Friend is-
Probability (Friend|Spam) = 1/ 7 =0.14
Probability (Lunch|Spam) = 0/ 7 =0.00
Probability (Money|Spam) = 4/ 7 =0.57
What is the probability of “Dear Friend” as
normal message?
What is the probability of “Dear Friend” as
Spam message?
Problems with multiplying lots of probs
There's a problem with this:
Multiplying lots of probabilities can result in floating-point
underflow!
Luckily, log(ab) = log(a) + log(b)
Let's sum logs of probabilities instead of multiplying probabilities!
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ
We actually do everything in log space
Instead of this:
This:
This is ok since log doesn't change the ranking of the classes (class with
highest prob still has highest log prob)
Model is now just max of sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ
Text
Classificatio
n and Naïve
Bayes
Formalizing the Naïve
Bayes Classifier
Text
Classification
and Naïve
Bayes
Naive Bayes: Learning
Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
Sec.13.3
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å
P̂(cj ) =
doccount(C = cj )
Ndoc
Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
Sec.13.3
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å
P̂(cj ) =
doccount(C = cj )
Ndoc
Compute the probability of for a class C
Compute the probability of a word given a class ∈{ Positive, Negative }
P(Normal) =
8/12
Parameter estimation
Create mega-document for topic j by concatenating all
docs in this topic
◦ Use frequency of w in mega-document
fraction of times word wi appears
among all words in documents of topic cj
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å
Doc 12, Normal 8 , Spam = 4
P ( Normal) = 8/12
P (Spam) = 4/12
Normal (17)
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam (7)
Dear – 2
Friend – 1
Lunch – 0
Money – 4
Probability of “Dear Friend” belongs to -
P ( Normal| “Dear Friend”) = (8/17) * (5/17) * (8/12)
P (Spam| “Dear Friend”) = (2/7) * (1/7) * (4/12)
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4
Probability of “Lunch Money” belongs to -
P ( Normal| “Lunch Money”) = (3/17) * (1/17) * (8/12)
P (Spam| “Lunch Money”) = (0/7) * (4/7) * (4/12) = 0
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4
Problem with Maximum Likelihood
What if we have seen no training documents with the word fantastic
and classified in the topic positive (thumbs-up)?
Zero probabilities cannot be conditioned away, no matter the other
evidence!
P̂("fantastic" positive) =
count("fantastic", positive)
count(w,positive
wÎV
å )
= 0
cMAP = argmaxc P̂(c) P̂(xi | c)
i
Õ
Sec.13.3
Laplace (add-1) smoothing for Naïve Bayes
P̂(wi | c) =
count(wi,c)+1
count(w,c)+1
( )
wÎV
å
=
count(wi,c)+1
count(w,c
wÎV
å )
æ
è
ç
ç
ö
ø
÷
÷ + V
P̂(wi | c) =
count(wi,c)
count(w,c)
( )
wÎV
å
P ( Normal| “Lunch Money”) = (?) * (?) * (8/12)
P (Spam| “Lunch Money”) =
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4
=
count(wi,c)+1
count(w,c
wÎV
å )
æ
è
ç
ç
ö
ø
÷
÷ + V
P̂(wi | c) =
count(wi,c)
count(w,c)
( )
wÎV
å
P(N|Lunch money)
= ( (3+1)/ (17+4) ) * (2/21 ) * (8/12) =0.012
P(S|Lunch money)
= (1/11) * (5/11) * (4/12) = 0.013
Unique Word = 4, Number of occurrence =
17
Unknown words
What about unknown words
◦ that appear in our test data
◦ but not in our training data or vocab
We ignore them
◦ Remove them from the test document!
◦ Pretend they weren't there!
◦ Don't include any probability for them at all.
Why don't we build an unknown word model?
◦ It doesn't help: knowing which class has more unknown words is
not generally a useful thing to know!
Stop words
Some systems ignore another class of words:
Stop words: very frequent words like the and a.
◦ Sort the whole vocabulary by frequency in the training, call the
top 10 or 50 words the stopword list.
◦ Now we remove all stop words from the training and test sets
as if they were never there.
But in most text classification applications, removing
stop words don't help, so it's more common to not use
stopword lists and use all the words in naive Bayes.
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
◦ For each cj in C do
docsj  all docs with class =cj
P(wk | cj )¬
nk +a
n+a |Vocabulary |
P(cj )¬
| docsj |
| total # documents|
• Calculate P(wk | cj) terms
• Textj  single doc containing all docsj
• Foreach word wk in Vocabulary
nk  # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
Text
Classification
and Naive
Bayes
Naive Bayes: Learning
Text
Classification
and Naive
Bayes
Sentiment and Binary
Naive Bayes
Let's do a worked sentiment example!
A worked sentiment example Just
Plain
Boar
Entire
Predict
And 2
Lack
Energy
No
Surprise
Very
Few
lough
A worked sentiment example
Prior from training:
P(-) = 3/5
P(+) = 2/5
Drop "with"
Likelihoods from training:
Scoring the test set:
P(Predict|+) = ? P(Predict|-) = ?
P(No|+) = ? P(No|-) = ?
P(Fun|+) = ? P(Fun|-) = ?
P(-) * P(“Predict No Fun”)
P(+) * P(“Predict No Fun”)
A worked sentiment example
Prior from training:
P(-) = ?
P(+) = ?
Drop "with"
Likelihoods from training:
Scoring the test set:
Optimizing for sentiment analysis
For tasks like sentiment, word occurrence is more
important than word frequency.
◦ The occurrence of the word fantastic tells us a lot
◦ The fact that it occurs 5 times may not tell us much more.
Binary multinominal naive bayes, or binary NB
◦ Clip our word counts at 1
◦ Note: this is different than Bernoulli naive bayes; see the
textbook at the end of the chapter.
Binary Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
◦ For each cj in C do
docsj  all docs with class =cj
P(cj )¬
| docsj |
| total # documents| P(wk | cj )¬
nk +a
n+a |Vocabulary |
• Textj  single doc containing all docsj
• Foreach word wk in Vocabulary
nk  # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
• Calculate P(wk | cj) terms
• Remove duplicates in each doc:
• For each word type w in docj
• Retain only a single instance of w
Binary Multinomial Naive Bayes
on a test document d
63
First remove all duplicate words from d
Then compute NB using the same equation:
cNB = argmax
cjÎC
P(cj ) P(wi |cj )
iÎpositions
Õ
Binary multinominal naive Bayes
Counts can still be 2! Binarization is within-doc!
Text
Classification
and Naive
Bayes
Sentiment and Binary
Naive Bayes
Text
Classification
and Naïve
Bayes
Naïve Bayes: Relationship
to Language Modeling
Generative Model for Multinomial Naïve Bayes
67
c=+
X1=I X2=love X3=this X4=fun X5=film
Naïve Bayes and Language Modeling
Naïve bayes classifiers can use any sort of feature
◦ URL, email address, dictionaries, network features
But if, as in the previous slides
◦ We use only word features
◦ we use all of the words in the text (not a subset)
Then
◦ Naive bayes has an important similarity to language
modeling.
68
Each class = a unigram language model
Assigning each word: P(word | c)
Assigning each sentence: P(s|c)=P(word|c)
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
I love this fun film
0.1 0.1 .05 0.01 0.1
Class pos
P(s | pos) = 0.0000005
Sec.13.2.1
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Model pos Model neg
film
love this fun
I
0.1
0.1 0.01 0.05
0.1
0.1
0.001 0.01 0.005
0.2
P(s|pos) > P(s|neg)
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Sec.13.2.1
Text
Classification
and Naïve
Bayes
Naïve Bayes: Relationship
to Language Modeling
Text
Classification
and Naïve
Bayes
Precision, Recall, and F
measure
Evaluation
Let's consider just binary text classification tasks
Imagine you're the CEO of Delicious Pie Company
You want to know what people are saying about
your pies
So you build a "Delicious Pie" tweet detector
◦ Positive class: tweets about Delicious Pie Co
◦ Negative class: all other tweets
The 2-by-2 confusion matrix
The 2-by-2 confusion matrix TP 10 FP 2
FN 3 TN 34
Evaluation: Accuracy
Why don't we use accuracy as our metric?
Imagine we saw 1 million tweets
◦ 100 of them talked about Delicious Pie Co.
◦ 999,900 talked about something else
We could build a dumb classifier that just labels every
tweet "not about pie"
◦ It would get 99.99% accuracy!!! Wow!!!!
◦ But useless! Doesn't return the comments we are looking for!
◦ That's why we use precision and recall instead
Evaluation: Precision
% of items the system detected (i.e., items the
system labeled as positive) that are in fact positive
(according to the human gold labels)
Evaluation: Recall
% of items actually present in the input that were
correctly identified by the system.
Why Precision and recall
Our dumb pie-classifier
◦ Just label nothing as "about pie"
Accuracy=99.99%
but
Recall = 0
◦ (it doesn't get any of the 100 Pie tweets)
Precision and recall, unlike accuracy, emphasize true
positives:
◦ finding the things that we are supposed to be looking for.
A combined measure: F
F measure: a single number that combines P and R:
We almost always use balanced F1 (i.e.,  = 1)

More Related Content

PDF
Gpt models
PDF
자바를 잡아주는 GURU가 있다구!? - 우여명 (아이스크림에듀) :: AWS Community Day 2020
PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PDF
Hacking Android OS
PDF
word2vec - From theory to practice
PDF
Skip gram and cbow
PDF
Flutter A year of creativity!
PDF
Animations in Flutter
Gpt models
자바를 잡아주는 GURU가 있다구!? - 우여명 (아이스크림에듀) :: AWS Community Day 2020
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
Hacking Android OS
word2vec - From theory to practice
Skip gram and cbow
Flutter A year of creativity!
Animations in Flutter

What's hot (20)

PDF
AI 10 | Naive Bayes Classifier
PPTX
[Paper Reading] Attention is All You Need
PDF
Word2Vec
PPTX
Python Programming Essentials - M20 - Classes and Objects
PDF
Working in NLP in the Age of Large Language Models
PPTX
Deploying & Scaling your Odoo Server
PDF
BERT Finetuning Webinar Presentation
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
PDF
Improved Trainings of Wasserstein GANs (WGAN-GP)
PDF
Python Variable Types, List, Tuple, Dictionary
PDF
Hello Flutter
PPTX
What is word2vec?
PPT
Expert Systems & Prolog
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
Tokenization using nlp | NLP Course
PDF
PRML 13.2.2: The Forward-Backward Algorithm
PDF
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
PPTX
Natural Language processing Parts of speech tagging, its classes, and how to ...
PPTX
BERT introduction
AI 10 | Naive Bayes Classifier
[Paper Reading] Attention is All You Need
Word2Vec
Python Programming Essentials - M20 - Classes and Objects
Working in NLP in the Age of Large Language Models
Deploying & Scaling your Odoo Server
BERT Finetuning Webinar Presentation
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Improved Trainings of Wasserstein GANs (WGAN-GP)
Python Variable Types, List, Tuple, Dictionary
Hello Flutter
What is word2vec?
Expert Systems & Prolog
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Tokenization using nlp | NLP Course
PRML 13.2.2: The Forward-Backward Algorithm
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Natural Language processing Parts of speech tagging, its classes, and how to ...
BERT introduction
Ad

Similar to Text Classification.pdf (7)

PPTX
Topic_5_NB_Sentiment_Classification_.pptx
PDF
Introduction to text classification using naive bayes
PPTX
Nave Bias algorithm in Nature language processing
DOCX
roman_numerals_buggypackage.bluej#BlueJ package filedepend.docx
PPTX
Probability Assignment Help
PDF
1. The Central Intelligence Agency has specialists who analyze the f.pdf
PDF
12.13.11 classwork tuesday
Topic_5_NB_Sentiment_Classification_.pptx
Introduction to text classification using naive bayes
Nave Bias algorithm in Nature language processing
roman_numerals_buggypackage.bluej#BlueJ package filedepend.docx
Probability Assignment Help
1. The Central Intelligence Agency has specialists who analyze the f.pdf
12.13.11 classwork tuesday
Ad

Recently uploaded (20)

PPTX
OOP with Java - Java Introduction (Basics)
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Sustainable Sites - Green Building Construction
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Construction Project Organization Group 2.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Lecture Notes Electrical Wiring System Components
PDF
PPT on Performance Review to get promotions
OOP with Java - Java Introduction (Basics)
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Internet of Things (IOT) - A guide to understanding
Sustainable Sites - Green Building Construction
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mechanical Engineering MATERIALS Selection
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT 4 Total Quality Management .pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Construction Project Organization Group 2.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Lecture Notes Electrical Wiring System Components
PPT on Performance Review to get promotions

Text Classification.pdf

  • 3. Who wrote which Federalist papers? 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. Authorship of 12 of the letters in dispute 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton
  • 4. What is the subject of this medical article? Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology … 4 MeSH Subject Category Hierarchy ? MEDLINE Article
  • 5. Positive or negative movie review? ...zany characters and richly applied satire, and some great plot twists It was pathetic. The worst part about it was the boxing scenes... ...awesome caramel sauce and sweet toasty almonds. I love this place! ...awful pizza and ridiculously overpriced... 5 + + − −
  • 6. Positive or negative movie review? ...zany characters and richly applied satire, and some great plot twists It was pathetic. The worst part about it was the boxing scenes... ...awesome caramel sauce and sweet toasty almonds. I love this place! ...awful pizza and ridiculously overpriced... 6 + + − −
  • 7. Why sentiment analysis? Movie: is this review positive or negative? Products: what do people think about the new iPhone? Public sentiment: how is consumer confidence? Politics: what do people think about this candidate or issue? Prediction: predict election outcomes or market trends from sentiment 7
  • 8. Scherer Typology of Affective States Emotion: brief organically synchronized … evaluation of a major event ◦ angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective feeling ◦ cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction ◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons ◦ liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies ◦ nervous, anxious, reckless, morose, hostile, jealous
  • 9. Scherer Typology of Affective States Emotion: brief organically synchronized … evaluation of a major event ◦ angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective feeling ◦ cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction ◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons ◦ liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies ◦ nervous, anxious, reckless, morose, hostile, jealous
  • 10. Basic Sentiment Classification Sentiment analysis is the detection of attitudes Simple task we focus on in this chapter ◦ Is the attitude of this text positive or negative? We return to affect classification in later chapters
  • 11. Summary: Text Classification Sentiment analysis Spam detection Authorship identification Language Identification Assigning subject categories, topics, or genres …
  • 12. Text Classification: definition Input: ◦ a document d ◦ a fixed set of classes C = {c1, c2,…, cJ} Output: a predicted class c  C
  • 13. Classification Methods: Hand-coded rules Rules based on combinations of words or other features ◦ spam: black-list-address OR (“dollars” AND “you have been selected”) Accuracy can be high ◦ If rules carefully refined by expert But building and maintaining these rules is expensive
  • 14. Classification Methods: Supervised Machine Learning Input: ◦ a document d ◦ a fixed set of classes C = {c1, c2,…, cJ} ◦ A training set of m hand-labeled documents (d1,c1),....,(dm,cm) Output: ◦ a learned classifier γ:d → c 14
  • 15. Classification Methods: Supervised Machine Learning Any kind of classifier ◦ Naïve Bayes ◦ Logistic regression ◦ Neural networks ◦ k-Nearest Neighbors ◦ …
  • 18. Naive Bayes Intuition Simple (“naive”) classification method based on Bayes rule Relies on very simple representation of document ◦ Bag of words
  • 19. The Bag of Words Representation 19
  • 20. The bag of words representation γ( )=c seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
  • 23. Bayes’ Rule Applied to Documents and Classes •For a document d and a class c P(c | d) = P(d |c)P(c) P(d)
  • 24. Naive Bayes Classifier (I) cMAP = argmax cÎC P(c | d) = argmax cÎC P(d | c)P(c) P(d) = argmax cÎC P(d |c)P(c) MAP is “maximum a posteriori” = most likely class Bayes Rule Dropping the denominator
  • 25. Naive Bayes Classifier (II) cMAP = argmax cÎC P(d | c)P(c) Document d represented as features x1..xn = argmax cÎC P(x1, x2,… , xn | c)P(c) "Likelihood" "Prior"
  • 26. Naïve Bayes Classifier (IV) How often does this class occur? cMAP = argmax cÎC P(x1, x2,… , xn | c)P(c) O(|X|n•|C|) parameters We can just count the relative frequencies in a corpus Could only be estimated if a very, very large number of training examples was available.
  • 27. Multinomial Naive Bayes Independence Assumptions Bag of Words assumption: Assume position doesn’t matter Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c. P(x1, x2,… , xn |c) P(x1,… , xn |c)= P(x1 |c)·P(x2 |c)·P(x3 |c)·...·P(xn |c)
  • 28. Multinomial Naive Bayes Classifier cMAP = argmax cÎC P(x1, x2,… , xn | c)P(c) cNB = argmax cÎC P(cj ) P(x | c) xÎX Õ
  • 29. Applying Multinomial Naive Bayes Classifiers to Text Classification cNB = argmax cjÎC P(cj ) P(xi | cj ) iÎpositions Õ positions  all word positions in test document
  • 30. Example Let me explain a Multinomial Naïve Bayes Classifier where we want to filter out the spam messages. Initially, we consider eight normal messages and four spam messages.
  • 31. Histogram of all the words that occur in the normal messages from family and friends
  • 32. The probability of word dear given that we saw in normal message is- Probability (Dear|Normal) = Probability (Friend|Normal) = Probability (Lunch|Normal) = Probability (Money|Normal) =
  • 33. The probability of word dear given that we saw in normal message is- Probability (Dear|Normal) = 8 /17 = 0.47 Similarly, the probability of word Friend is- Probability (Friend/Normal) = 5/ 17 =0.29 Probability (Lunch/Normal) = 3/ 17 =0.18 Probability (Money/Normal) = 1/ 17 =0.06
  • 35. he probability of word dear given that we saw in spam message is- Probability (Dear|Spam) = Probability (Friend|Spam) = Probability (Lunch|Spam) = Probability (Money|Spam) =
  • 36. he probability of word dear given that we saw in spam message is- Probability (Dear|Spam) = 2 /7 = 0.29 Similarly, the probability of word Friend is- Probability (Friend|Spam) = 1/ 7 =0.14 Probability (Lunch|Spam) = 0/ 7 =0.00 Probability (Money|Spam) = 4/ 7 =0.57
  • 37. What is the probability of “Dear Friend” as normal message?
  • 38. What is the probability of “Dear Friend” as Spam message?
  • 39. Problems with multiplying lots of probs There's a problem with this: Multiplying lots of probabilities can result in floating-point underflow! Luckily, log(ab) = log(a) + log(b) Let's sum logs of probabilities instead of multiplying probabilities! cNB = argmax cjÎC P(cj ) P(xi | cj ) iÎpositions Õ
  • 40. We actually do everything in log space Instead of this: This: This is ok since log doesn't change the ranking of the classes (class with highest prob still has highest log prob) Model is now just max of sum of weights: a linear function of the inputs So naive bayes is a linear classifier cNB = argmax cjÎC P(cj ) P(xi | cj ) iÎpositions Õ
  • 43. Learning the Multinomial Naive Bayes Model First attempt: maximum likelihood estimates ◦ simply use the frequencies in the data Sec.13.3 P̂(wi | cj ) = count(wi,cj ) count(w,cj ) wÎV å P̂(cj ) = doccount(C = cj ) Ndoc
  • 44. Learning the Multinomial Naive Bayes Model First attempt: maximum likelihood estimates ◦ simply use the frequencies in the data Sec.13.3 P̂(wi | cj ) = count(wi,cj ) count(w,cj ) wÎV å P̂(cj ) = doccount(C = cj ) Ndoc Compute the probability of for a class C Compute the probability of a word given a class ∈{ Positive, Negative } P(Normal) = 8/12
  • 45. Parameter estimation Create mega-document for topic j by concatenating all docs in this topic ◦ Use frequency of w in mega-document fraction of times word wi appears among all words in documents of topic cj P̂(wi | cj ) = count(wi,cj ) count(w,cj ) wÎV å
  • 46. Doc 12, Normal 8 , Spam = 4 P ( Normal) = 8/12 P (Spam) = 4/12 Normal (17) Dear – 8 Friend – 5 Lunch – 3 Money – 1 Spam (7) Dear – 2 Friend – 1 Lunch – 0 Money – 4
  • 47. Probability of “Dear Friend” belongs to - P ( Normal| “Dear Friend”) = (8/17) * (5/17) * (8/12) P (Spam| “Dear Friend”) = (2/7) * (1/7) * (4/12) Normal Dear – 8 Friend – 5 Lunch – 3 Money – 1 Spam Dear – 2 Friend – 1 Lunch – 0 Money – 4
  • 48. Probability of “Lunch Money” belongs to - P ( Normal| “Lunch Money”) = (3/17) * (1/17) * (8/12) P (Spam| “Lunch Money”) = (0/7) * (4/7) * (4/12) = 0 Normal Dear – 8 Friend – 5 Lunch – 3 Money – 1 Spam Dear – 2 Friend – 1 Lunch – 0 Money – 4
  • 49. Problem with Maximum Likelihood What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)? Zero probabilities cannot be conditioned away, no matter the other evidence! P̂("fantastic" positive) = count("fantastic", positive) count(w,positive wÎV å ) = 0 cMAP = argmaxc P̂(c) P̂(xi | c) i Õ Sec.13.3
  • 50. Laplace (add-1) smoothing for Naïve Bayes P̂(wi | c) = count(wi,c)+1 count(w,c)+1 ( ) wÎV å = count(wi,c)+1 count(w,c wÎV å ) æ è ç ç ö ø ÷ ÷ + V P̂(wi | c) = count(wi,c) count(w,c) ( ) wÎV å
  • 51. P ( Normal| “Lunch Money”) = (?) * (?) * (8/12) P (Spam| “Lunch Money”) = Normal Dear – 8 Friend – 5 Lunch – 3 Money – 1 Spam Dear – 2 Friend – 1 Lunch – 0 Money – 4 = count(wi,c)+1 count(w,c wÎV å ) æ è ç ç ö ø ÷ ÷ + V P̂(wi | c) = count(wi,c) count(w,c) ( ) wÎV å P(N|Lunch money) = ( (3+1)/ (17+4) ) * (2/21 ) * (8/12) =0.012 P(S|Lunch money) = (1/11) * (5/11) * (4/12) = 0.013 Unique Word = 4, Number of occurrence = 17
  • 52. Unknown words What about unknown words ◦ that appear in our test data ◦ but not in our training data or vocab We ignore them ◦ Remove them from the test document! ◦ Pretend they weren't there! ◦ Don't include any probability for them at all. Why don't we build an unknown word model? ◦ It doesn't help: knowing which class has more unknown words is not generally a useful thing to know!
  • 53. Stop words Some systems ignore another class of words: Stop words: very frequent words like the and a. ◦ Sort the whole vocabulary by frequency in the training, call the top 10 or 50 words the stopword list. ◦ Now we remove all stop words from the training and test sets as if they were never there. But in most text classification applications, removing stop words don't help, so it's more common to not use stopword lists and use all the words in naive Bayes.
  • 54. Multinomial Naïve Bayes: Learning Calculate P(cj) terms ◦ For each cj in C do docsj  all docs with class =cj P(wk | cj )¬ nk +a n+a |Vocabulary | P(cj )¬ | docsj | | total # documents| • Calculate P(wk | cj) terms • Textj  single doc containing all docsj • Foreach word wk in Vocabulary nk  # of occurrences of wk in Textj • From training corpus, extract Vocabulary
  • 57. Let's do a worked sentiment example!
  • 58. A worked sentiment example Just Plain Boar Entire Predict And 2 Lack Energy No Surprise Very Few lough
  • 59. A worked sentiment example Prior from training: P(-) = 3/5 P(+) = 2/5 Drop "with" Likelihoods from training: Scoring the test set: P(Predict|+) = ? P(Predict|-) = ? P(No|+) = ? P(No|-) = ? P(Fun|+) = ? P(Fun|-) = ? P(-) * P(“Predict No Fun”) P(+) * P(“Predict No Fun”)
  • 60. A worked sentiment example Prior from training: P(-) = ? P(+) = ? Drop "with" Likelihoods from training: Scoring the test set:
  • 61. Optimizing for sentiment analysis For tasks like sentiment, word occurrence is more important than word frequency. ◦ The occurrence of the word fantastic tells us a lot ◦ The fact that it occurs 5 times may not tell us much more. Binary multinominal naive bayes, or binary NB ◦ Clip our word counts at 1 ◦ Note: this is different than Bernoulli naive bayes; see the textbook at the end of the chapter.
  • 62. Binary Multinomial Naïve Bayes: Learning Calculate P(cj) terms ◦ For each cj in C do docsj  all docs with class =cj P(cj )¬ | docsj | | total # documents| P(wk | cj )¬ nk +a n+a |Vocabulary | • Textj  single doc containing all docsj • Foreach word wk in Vocabulary nk  # of occurrences of wk in Textj • From training corpus, extract Vocabulary • Calculate P(wk | cj) terms • Remove duplicates in each doc: • For each word type w in docj • Retain only a single instance of w
  • 63. Binary Multinomial Naive Bayes on a test document d 63 First remove all duplicate words from d Then compute NB using the same equation: cNB = argmax cjÎC P(cj ) P(wi |cj ) iÎpositions Õ
  • 64. Binary multinominal naive Bayes Counts can still be 2! Binarization is within-doc!
  • 66. Text Classification and Naïve Bayes Naïve Bayes: Relationship to Language Modeling
  • 67. Generative Model for Multinomial Naïve Bayes 67 c=+ X1=I X2=love X3=this X4=fun X5=film
  • 68. Naïve Bayes and Language Modeling Naïve bayes classifiers can use any sort of feature ◦ URL, email address, dictionaries, network features But if, as in the previous slides ◦ We use only word features ◦ we use all of the words in the text (not a subset) Then ◦ Naive bayes has an important similarity to language modeling. 68
  • 69. Each class = a unigram language model Assigning each word: P(word | c) Assigning each sentence: P(s|c)=P(word|c) 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film I love this fun film 0.1 0.1 .05 0.01 0.1 Class pos P(s | pos) = 0.0000005 Sec.13.2.1
  • 70. Naïve Bayes as a Language Model Which class assigns the higher probability to s? 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film Model pos Model neg film love this fun I 0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2 P(s|pos) > P(s|neg) 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film Sec.13.2.1
  • 71. Text Classification and Naïve Bayes Naïve Bayes: Relationship to Language Modeling
  • 73. Evaluation Let's consider just binary text classification tasks Imagine you're the CEO of Delicious Pie Company You want to know what people are saying about your pies So you build a "Delicious Pie" tweet detector ◦ Positive class: tweets about Delicious Pie Co ◦ Negative class: all other tweets
  • 75. The 2-by-2 confusion matrix TP 10 FP 2 FN 3 TN 34
  • 76. Evaluation: Accuracy Why don't we use accuracy as our metric? Imagine we saw 1 million tweets ◦ 100 of them talked about Delicious Pie Co. ◦ 999,900 talked about something else We could build a dumb classifier that just labels every tweet "not about pie" ◦ It would get 99.99% accuracy!!! Wow!!!! ◦ But useless! Doesn't return the comments we are looking for! ◦ That's why we use precision and recall instead
  • 77. Evaluation: Precision % of items the system detected (i.e., items the system labeled as positive) that are in fact positive (according to the human gold labels)
  • 78. Evaluation: Recall % of items actually present in the input that were correctly identified by the system.
  • 79. Why Precision and recall Our dumb pie-classifier ◦ Just label nothing as "about pie" Accuracy=99.99% but Recall = 0 ◦ (it doesn't get any of the 100 Pie tweets) Precision and recall, unlike accuracy, emphasize true positives: ◦ finding the things that we are supposed to be looking for.
  • 80. A combined measure: F F measure: a single number that combines P and R: We almost always use balanced F1 (i.e.,  = 1)