5_LR_Apr_7_2021.pptx in nature language processing

Logistic
Regression
Background: Generative and
Discriminative Classifiers

Logistic Regression
Important analytic tool in natural and
social sciences
Baseline supervised machine learning
tool for classification
Is also the foundation of neural
networks

Generative and Discriminative Classifiers
Naive Bayes is a generative classifier
by contrast:
Logistic regression is a discriminative
classifier

Generative and Discriminative Classifiers
Suppose we're distinguishing cat from dog images
imagenet imagenet

Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Also build a model for dog images
Now given a new image:
Run both models and see which one fits better

Discriminative Classifier
Just try to distinguish dogs from cats
Oh look, dogs have collars!
Let's ignore everything else

7
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers
Naive Bayes
Logistic Regression
P(c|d)
posterior

Components of a probabilistic machine learning classifier
1. A feature representation of the input. For each input
observation x(i)
, a vector of features [x1, x2, ... , xn]. Feature j
for input x(i)
is xj, more completely xj
(i)
, or sometimes fj(x).
2. A classification function that computes , the estimated class,
via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
Given m input/output pairs (x(i),
y(i)
):

The two phases of logistic regression
Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.
Test: Given a test example x we compute p(y|x)
using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability

Logistic
Regression
Classification in Logistic Regression

Classification Reminder
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton

Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class  C

Binary Classification in Logistic Regression
Given a series of input/output pairs:
◦ (x(i)
, y(i)
)
For each observation x(i)
◦ We represent x(i)
by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class (i)
 {0,1}

Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2

Logistic Regression for one observation x
Input observation: vector x = [x1, x2,…, xn]
Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class  {0,1}
(multinomial logistic regression:  {0, 1, 2, 3, 4})

How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias
If this sum is high, we say y=1; if low, then y=0

But we want a probabilistic classifier
We need to formalize “sum is high”.
We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)

The problem: z isn't a probability, it's just a number!
Solution: use a function of z that goes from 0 to 1

20
The very useful sigmoid or logistic function

Idea of logistic regression
We’ll compute w∙x+b
And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability

Making probabilities with sigmoids

Turning a probability into a classifier
0.5 here is called the decision boundary

The probabilistic classifier
wx + b
P(y=1)

Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0

Logistic
Regression
Logistic Regression: a text example
on sentiment classification

29
Sentiment example: does y=1 or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

31
Classifying sentiment for input x
Suppose w =
b = 0.1

32
Classifying sentiment for input x

33
We can build features for logistic regression for any classification
task: period disambiguation
This ends in a period.
The house at 465 Main St. is new.
End of sentence
Not end

Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ wi for each feature fi

Logistic
Regression
Learning: Cross-Entropy Loss

37
Wait, where did the W’s come from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate,
We want to set w and b to minimize the distance between our
estimate (i)
and the true y(i)
.
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.

Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent

The distance between and y
We want to know how far is the classifier output:
= σ(w x+b)
∙
from the true output:
y [= either 0 or 1]
We'll call this difference:
L(,y) = how much differs from the true y

Intuition of negative log likelihood loss
= cross-entropy loss
A case of conditional maximum likelihood estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x

Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the
thing we want to maximize) as
noting:
if y=1, this simplifies to
if y=0, this simplifies to 1-

Now take the log of both sides (mathematically handy)
Whatever values maximize log p(y|x) will also maximize p(y|x)
Maximize:
Maximize:

Now flip sign to turn this into a loss: something to minimize
Cross-entropy loss (because is formula for cross-entropy(y, ))
Or, plugging in definition of
Maximize:
Minimize:

Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another
nice touch is the music . I was overcome with the urge to get off the couch
and start dancing . It sucked me in , and it'll do the same to you .

True value is y=1. How well is our model doing?
Pretty well! What's the loss?

Suppose true value instead was y=0.
What's the loss?

The loss when model was right (if true y=1)
Is lower than the loss when model was wrong (if true y=0):
Sure enough, loss was bigger when model was wrong!

Logistic
Regression
Cross-Entropy Loss

Logistic
Regression
Stochastic Gradient Descent

Our goal: minimize the loss
Let's make explicit that the loss function is parameterized
by weights =(w,b)
𝛳
• And we’ll represent as f (x; θ ) to make the
dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:

Intuition of gradient descent
How do I get to the bottom of this river canyon?
x
Look around me 360∘
Find the direction of
steepest slope down
Go that way

Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)

Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
So we'll move positive

Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
Gradient Descent: Find the gradient of the loss
function at the current point and move in the
opposite direction.

How much do we move in that direction ?
• The value of the gradient (slope in our example)
weighted by a learning rate η
• Higher learning rate means move w faster

Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.

Imagine 2 dimensions, w and b
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane

Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i tells
us the slope with respect to that variable.
◦ “How much would a small change in wi influence the total
loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss ∂wi
The gradient is then defined as a vector of these
partials.

The gradient
We’ll represent as f (x; θ ) to make the dependence on θ more
obvious:
The final equation for updating θ based on the gradient is thus

What are these partial derivatives for logistic regression?
The loss function
The elegant derivative of this function (see textbook 5.8 for derivation)

5_LR_Apr_7_2021.pptx in nature language processing

Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.

Logistic
Regression
Stochastic Gradient Descent:
An example and more details

Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0
are zero:
w1 = w2 = b = 0
η = 0.1

Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
w1 = w2 = b = 0;
x1 = 3; x2 = 2

η = 0.1;
Now that we have a gradient, we compute the new parameter vector
θ1
by moving θ0
in the opposite direction from the gradient:

η = 0.1;
Now that we have a gradient, we compute the new parameter vector
θ1
by moving θ0
in the opposite direction from the gradient:
Note that enough negative examples would eventually make w2 negative

Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)

Logistic
Regression
Regularization

Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize

Overfitting
This movie drew me in, and it'll
do the same to you.
81
X1 = "this"
X2 = "movie
X3 = "hated"
I can't tell you how much I
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
X4 = "drew me in"
+
-
Useful or harmless features
4gram features that just
"memorize" training set and
might cause problems

82
Overfitting
4-gram model on tiny data will just memorize the data
◦ 100% accuracy on the training set
But it will be surprised by the novel 4-grams in the test data
◦ Low accuracy on test set
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks

Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function (for
now written as maximizing logprob rather than minimizing loss)
Idea: choose an R(θ) that penalizes large weights
◦ fitting the data well with lots of big weights not as good as
fitting the data a little less well, with small weights

L2 Regularization (= ridge regression)
The sum of the squares of the weights
The name is because this is the (square of the)
L2 norm ||θ||2, = Euclidean distance of θ to the origin.
L2 regularized objective function:

L1 Regularization (= lasso regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ||W||1, = sum of the absolute
values of the weights, = Manhattan distance
L1 regularized objective function:

Logistic
Regression
Multinomial Logistic
Regression

88
Multinomial Logistic Regression
Often we need more than 2 classes
◦ Positive/negative/neutral
◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
◦ Classify emergency SMSs into different actionable classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output classes)

89
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
Need a generalization of the sigmoid called the softmax
◦ Takes a vector z = [z1, z2, ..., zk] of k arbitrary values
◦ Outputs a probability distribution
◦ each value in the range [0,1]
◦ all the values summing to 1

90
The softmax function
Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into probabilities

91
The softmax function
◦ Turns a vector z = [z1,z2,...,zk] of k arbitrary values into probabilities

92
Softmax in multinomial logistic regression
Input is still the dot product between weight vector w
and input vector x
But now we’ll need separate weight vectors for each
of the K classes.

93
Features in binary versus multinomial logistic regression
Binary: positive weight  y=1 neg weight  y=0
Multinominal: separate weights for each class:
w5 = 3.0

5_LR_Apr_7_2021.pptx in nature language processing

More Related Content

Similar to 5_LR_Apr_7_2021.pptx in nature language processing (20)

Recently uploaded (20)

5_LR_Apr_7_2021.pptx in nature language processing

Editor's Notes