Text Classification.pdf

Naive Bayes
and
Sentiment
Classification
The Task of Text
Classification

Who wrote which Federalist papers?
1787-8: anonymous essays try to convince New
York to ratify U.S Constitution: Jay, Madison,
Hamilton.
Authorship of 12 of the letters in dispute
1963: solved by Mosteller and Wallace using
Bayesian methods
James Madison Alexander Hamilton

What is the subject of this medical article?
Antogonists and Inhibitors
Blood Supply
Chemistry
Drug Therapy
Embryology
Epidemiology
…
4
MeSH Subject Category Hierarchy
?
MEDLINE Article

Positive or negative movie review?
...zany characters and richly applied satire, and some great
plot twists
It was pathetic. The worst part about it was the boxing
scenes...
...awesome caramel sauce and sweet toasty almonds. I
love this place!
...awful pizza and ridiculously overpriced...
5
+
+
−
−

Positive or negative movie review?
...zany characters and richly applied satire, and some great
plot twists
It was pathetic. The worst part about it was the boxing
scenes...
...awesome caramel sauce and sweet toasty almonds. I
love this place!
...awful pizza and ridiculously overpriced...
6
+
+
−
−

Why sentiment analysis?
Movie: is this review positive or negative?
Products: what do people think about the new iPhone?
Public sentiment: how is consumer confidence?
Politics: what do people think about this candidate or issue?
Prediction: predict election outcomes or market trends from
sentiment
7

Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous

Basic Sentiment Classification
Sentiment analysis is the detection of
attitudes
Simple task we focus on in this chapter
◦ Is the attitude of this text positive or negative?
We return to affect classification in later
chapters

Summary: Text Classification
Sentiment analysis
Spam detection
Authorship identification
Language Identification
Assigning subject categories, topics, or genres
…

Text Classification: definition
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class c  C

Classification Methods: Hand-coded rules
Rules based on combinations of words or other features
◦ spam: black-list-address OR (“dollars” AND “you have been
selected”)
Accuracy can be high
◦ If rules carefully refined by expert
But building and maintaining these rules is expensive

Classification Methods:
Supervised Machine Learning
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
◦ A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
Output:
◦ a learned classifier γ:d → c
14

Classification Methods:
Supervised Machine Learning
Any kind of classifier
◦ Naïve Bayes
◦ Logistic regression
◦ Neural networks
◦ k-Nearest Neighbors
◦ …

Text
Classification
and Naive
Bayes
The Task of Text
Classification

Text
Classification
and Naive
Bayes
Naive Bayes (I)

Naive Bayes Intuition
Simple (“naive”) classification method based on
Bayes rule
Relies on very simple representation of document
◦ Bag of words

The Bag of Words Representation
19

The bag of words representation
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...

Text
Classification
and Naïve
Bayes
Naive Bayes (I)

Text
Classification
and Naïve
Bayes
Formalizing the Naive
Bayes Classifier

Bayes’ Rule Applied to Documents and Classes
•For a document d and a class c
P(c | d) =
P(d |c)P(c)
P(d)

Naive Bayes Classifier (I)
cMAP = argmax
cÎC
P(c | d)
= argmax
cÎC
P(d | c)P(c)
P(d)
= argmax
cÎC
P(d |c)P(c)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator

Naive Bayes Classifier (II)
cMAP = argmax
cÎC
P(d | c)P(c)
Document d
represented as
features
x1..xn
= argmax
cÎC
P(x1, x2,… , xn | c)P(c)
"Likelihood" "Prior"

Naïve Bayes Classifier (IV)
How often does this
class occur?
cMAP = argmax
cÎC
P(x1, x2,… , xn | c)P(c)
O(|X|n•|C|) parameters
We can just count the
relative frequencies in
a corpus
Could only be estimated if a
very, very large number of
training examples was
available.

Multinomial Naive Bayes Independence
Assumptions
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
P(x1, x2,… , xn |c)
P(x1,… , xn |c)= P(x1 |c)·P(x2 |c)·P(x3 |c)·...·P(xn |c)

Multinomial Naive Bayes Classifier
cMAP = argmax
cÎC
P(x1, x2,… , xn | c)P(c)
cNB = argmax
cÎC
P(cj ) P(x | c)
xÎX
Õ

Applying Multinomial Naive Bayes Classifiers
to Text Classification
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ
positions  all word positions in test document

Example
Let me explain a Multinomial Naïve Bayes Classifier
where we want to filter out the spam messages.
Initially, we consider eight normal messages and
four spam messages.

Histogram of all the words that occur in the
normal messages from family and friends

The probability of word dear given that we saw in
normal message is-
Probability (Dear|Normal) =
Probability (Friend|Normal) =
Probability (Lunch|Normal) =
Probability (Money|Normal) =

The probability of word dear given that we saw in
normal message is-
Probability (Dear|Normal) = 8 /17 = 0.47
Similarly, the probability of word Friend is-
Probability (Friend/Normal) = 5/ 17 =0.29
Probability (Lunch/Normal) = 3/ 17 =0.18
Probability (Money/Normal) = 1/ 17 =0.06

he probability of word dear given that we saw in
spam message is-
Probability (Dear|Spam) =
Probability (Friend|Spam) =
Probability (Lunch|Spam) =
Probability (Money|Spam) =

he probability of word dear given that we saw in
spam message is-
Probability (Dear|Spam) = 2 /7 = 0.29
Similarly, the probability of word Friend is-
Probability (Friend|Spam) = 1/ 7 =0.14
Probability (Lunch|Spam) = 0/ 7 =0.00
Probability (Money|Spam) = 4/ 7 =0.57

What is the probability of “Dear Friend” as
normal message?

What is the probability of “Dear Friend” as
Spam message?

Problems with multiplying lots of probs
There's a problem with this:
Multiplying lots of probabilities can result in floating-point
underflow!
Luckily, log(ab) = log(a) + log(b)
Let's sum logs of probabilities instead of multiplying probabilities!
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ

We actually do everything in log space
Instead of this:
This:
This is ok since log doesn't change the ranking of the classes (class with
highest prob still has highest log prob)
Model is now just max of sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
cNB = argmax
cjÎC
P(cj ) P(xi | cj )
iÎpositions
Õ

Text
Classificatio
n and Naïve
Bayes
Formalizing the Naïve
Bayes Classifier

Text
Classification
and Naïve
Bayes
Naive Bayes: Learning

Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
Sec.13.3
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å
P̂(cj ) =
doccount(C = cj )
Ndoc

Learning the Multinomial Naive Bayes Model
First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
Sec.13.3
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å
P̂(cj ) =
doccount(C = cj )
Ndoc
Compute the probability of for a class C
Compute the probability of a word given a class ∈{ Positive, Negative }
P(Normal) =
8/12

Parameter estimation
Create mega-document for topic j by concatenating all
docs in this topic
◦ Use frequency of w in mega-document
fraction of times word wi appears
among all words in documents of topic cj
P̂(wi | cj ) =
count(wi,cj )
count(w,cj )
wÎV
å

Doc 12, Normal 8 , Spam = 4
P ( Normal) = 8/12
P (Spam) = 4/12
Normal (17)
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam (7)
Dear – 2
Friend – 1
Lunch – 0
Money – 4

Probability of “Dear Friend” belongs to -
P ( Normal| “Dear Friend”) = (8/17) * (5/17) * (8/12)
P (Spam| “Dear Friend”) = (2/7) * (1/7) * (4/12)
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4

Probability of “Lunch Money” belongs to -
P ( Normal| “Lunch Money”) = (3/17) * (1/17) * (8/12)
P (Spam| “Lunch Money”) = (0/7) * (4/7) * (4/12) = 0
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4

Problem with Maximum Likelihood
What if we have seen no training documents with the word fantastic
and classified in the topic positive (thumbs-up)?
Zero probabilities cannot be conditioned away, no matter the other
evidence!
P̂("fantastic" positive) =
count("fantastic", positive)
count(w,positive
wÎV
å )
= 0
cMAP = argmaxc P̂(c) P̂(xi | c)
i
Õ
Sec.13.3

Laplace (add-1) smoothing for Naïve Bayes
P̂(wi | c) =
count(wi,c)+1
count(w,c)+1
( )
wÎV
å
=
count(wi,c)+1
count(w,c
wÎV
å )
æ
è
ç
ç
ö
ø
÷
÷ + V
P̂(wi | c) =
count(wi,c)
count(w,c)
( )
wÎV
å

P ( Normal| “Lunch Money”) = (?) * (?) * (8/12)
P (Spam| “Lunch Money”) =
Normal
Dear – 8
Friend – 5
Lunch – 3
Money – 1
Spam
Dear – 2
Friend – 1
Lunch – 0
Money – 4
=
count(wi,c)+1
count(w,c
wÎV
å )
æ
è
ç
ç
ö
ø
÷
÷ + V
P̂(wi | c) =
count(wi,c)
count(w,c)
( )
wÎV
å
P(N|Lunch money)
= ( (3+1)/ (17+4) ) * (2/21 ) * (8/12) =0.012
P(S|Lunch money)
= (1/11) * (5/11) * (4/12) = 0.013
Unique Word = 4, Number of occurrence =
17

Unknown words
What about unknown words
◦ that appear in our test data
◦ but not in our training data or vocab
We ignore them
◦ Remove them from the test document!
◦ Pretend they weren't there!
◦ Don't include any probability for them at all.
Why don't we build an unknown word model?
◦ It doesn't help: knowing which class has more unknown words is
not generally a useful thing to know!

Stop words
Some systems ignore another class of words:
Stop words: very frequent words like the and a.
◦ Sort the whole vocabulary by frequency in the training, call the
top 10 or 50 words the stopword list.
◦ Now we remove all stop words from the training and test sets
as if they were never there.
But in most text classification applications, removing
stop words don't help, so it's more common to not use
stopword lists and use all the words in naive Bayes.

Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
◦ For each cj in C do
docsj  all docs with class =cj
P(wk | cj )¬
nk +a
n+a |Vocabulary |
P(cj )¬
| docsj |
| total # documents|
• Calculate P(wk | cj) terms
• Textj  single doc containing all docsj
• Foreach word wk in Vocabulary
nk  # of occurrences of wk in Textj
• From training corpus, extract Vocabulary

Text
Classification
and Naive
Bayes
Naive Bayes: Learning

Text
Classification
and Naive
Bayes
Sentiment and Binary
Naive Bayes

Let's do a worked sentiment example!

A worked sentiment example Just
Plain
Boar
Entire
Predict
And 2
Lack
Energy
No
Surprise
Very
Few
lough

A worked sentiment example
Prior from training:
P(-) = ?
P(+) = ?
Drop "with"
Likelihoods from training:
Scoring the test set:

Optimizing for sentiment analysis
For tasks like sentiment, word occurrence is more
important than word frequency.
◦ The occurrence of the word fantastic tells us a lot
◦ The fact that it occurs 5 times may not tell us much more.
Binary multinominal naive bayes, or binary NB
◦ Clip our word counts at 1
◦ Note: this is different than Bernoulli naive bayes; see the
textbook at the end of the chapter.

Binary Multinomial Naïve Bayes: Learning
Calculate P(cj) terms
◦ For each cj in C do
docsj  all docs with class =cj
P(cj )¬
| docsj |
| total # documents| P(wk | cj )¬
nk +a
n+a |Vocabulary |
• Textj  single doc containing all docsj
• Foreach word wk in Vocabulary
nk  # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
• Calculate P(wk | cj) terms
• Remove duplicates in each doc:
• For each word type w in docj
• Retain only a single instance of w

Binary Multinomial Naive Bayes
on a test document d
63
First remove all duplicate words from d
Then compute NB using the same equation:
cNB = argmax
cjÎC
P(cj ) P(wi |cj )
iÎpositions
Õ

Binary multinominal naive Bayes
Counts can still be 2! Binarization is within-doc!

Text
Classification
and Naïve
Bayes
Naïve Bayes: Relationship
to Language Modeling

Generative Model for Multinomial Naïve Bayes
67
c=+
X1=I X2=love X3=this X4=fun X5=film

Naïve Bayes and Language Modeling
Naïve bayes classifiers can use any sort of feature
◦ URL, email address, dictionaries, network features
But if, as in the previous slides
◦ We use only word features
◦ we use all of the words in the text (not a subset)
Then
◦ Naive bayes has an important similarity to language
modeling.
68

Each class = a unigram language model
Assigning each word: P(word | c)
Assigning each sentence: P(s|c)=P(word|c)
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
I love this fun film
0.1 0.1 .05 0.01 0.1
Class pos
P(s | pos) = 0.0000005
Sec.13.2.1

Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Model pos Model neg
film
love this fun
I
0.1
0.1 0.01 0.05
0.1
0.1
0.001 0.01 0.005
0.2
P(s|pos) > P(s|neg)
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Sec.13.2.1

Text
Classification
and Naïve
Bayes
Precision, Recall, and F
measure

Evaluation
Let's consider just binary text classification tasks
Imagine you're the CEO of Delicious Pie Company
You want to know what people are saying about
your pies
So you build a "Delicious Pie" tweet detector
◦ Positive class: tweets about Delicious Pie Co
◦ Negative class: all other tweets

The 2-by-2 confusion matrix TP 10 FP 2
FN 3 TN 34

Evaluation: Accuracy
Why don't we use accuracy as our metric?
Imagine we saw 1 million tweets
◦ 100 of them talked about Delicious Pie Co.
◦ 999,900 talked about something else
We could build a dumb classifier that just labels every
tweet "not about pie"
◦ It would get 99.99% accuracy!!! Wow!!!!
◦ But useless! Doesn't return the comments we are looking for!
◦ That's why we use precision and recall instead

Evaluation: Precision
% of items the system detected (i.e., items the
system labeled as positive) that are in fact positive
(according to the human gold labels)

Evaluation: Recall
% of items actually present in the input that were
correctly identified by the system.

Why Precision and recall
Our dumb pie-classifier
◦ Just label nothing as "about pie"
Accuracy=99.99%
but
Recall = 0
◦ (it doesn't get any of the 100 Pie tweets)
Precision and recall, unlike accuracy, emphasize true
positives:
◦ finding the things that we are supposed to be looking for.

A combined measure: F
F measure: a single number that combines P and R:
We almost always use balanced F1 (i.e.,  = 1)

Text Classification.pdf

More Related Content

What's hot (20)

Similar to Text Classification.pdf (7)

Recently uploaded (20)

Text Classification.pdf