SlideShare a Scribd company logo
Logistic
Regression
Background: Generative and
Discriminative Classifiers
Logistic Regression
Important analytic tool in natural and
social sciences
Baseline supervised machine learning
tool for classification
Is also the foundation of neural
networks
Generative and Discriminative Classifiers
Naive Bayes is a generative classifier
by contrast:
Logistic regression is a discriminative
classifier
Generative and Discriminative Classifiers
Suppose we're distinguishing cat from dog images
imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Also build a model for dog images
Now given a new image:
Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats
Oh look, dogs have collars!
Let's ignore everything else
7
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers
Naive Bayes
Logistic Regression
P(c|d)
posterior
Components of a probabilistic machine learning classifier
1. A feature representation of the input. For each input
observation x(i)
, a vector of features [x1, x2, ... , xn]. Feature j
for input x(i)
is xj, more completely xj
(i)
, or sometimes fj(x).
2. A classification function that computes , the estimated class,
via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
Given m input/output pairs (x(i),
y(i)
):
The two phases of logistic regression
Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.
Test: Given a test example x we compute p(y|x)
using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Logistic
Regression
Background: Generative and
Discriminative Classifiers
Logistic
Regression
Classification in Logistic Regression
Classification Reminder
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class  C
Binary Classification in Logistic Regression
Given a series of input/output pairs:
◦ (x(i)
, y(i)
)
For each observation x(i)
◦ We represent x(i)
by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class (i)
 {0,1}
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Logistic Regression for one observation x
Input observation: vector x = [x1, x2,…, xn]
Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class  {0,1}
(multinomial logistic regression:  {0, 1, 2, 3, 4})
How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias
If this sum is high, we say y=1; if low, then y=0
But we want a probabilistic classifier
We need to formalize “sum is high”.
We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
The problem: z isn't a probability, it's just a number!
Solution: use a function of z that goes from 0 to 1
20
The very useful sigmoid or logistic function
Idea of logistic regression
We’ll compute w∙x+b
And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
By the way:
=
Because
Turning a probability into a classifier
0.5 here is called the decision boundary
The probabilistic classifier
wx + b
P(y=1)
Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0
Logistic
Regression
Classification in Logistic Regression
Logistic
Regression
Logistic Regression: a text example
on sentiment classification
29
Sentiment example: does y=1 or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
30
31
Classifying sentiment for input x
Suppose w =
b = 0.1
32
Classifying sentiment for input x
33
We can build features for logistic regression for any classification
task: period disambiguation
This ends in a period.
The house at 465 Main St. is new.
End of sentence
Not end
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ wi for each feature fi
Logistic
Regression
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Learning: Cross-Entropy Loss
37
Wait, where did the W’s come from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate,
We want to set w and b to minimize the distance between our
estimate (i)
and the true y(i)
.
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
The distance between and y
We want to know how far is the classifier output:
= σ(w x+b)
∙
from the true output:
y [= either 0 or 1]
We'll call this difference:
L(,y) = how much differs from the true y
Intuition of negative log likelihood loss
= cross-entropy loss
A case of conditional maximum likelihood estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the
thing we want to maximize) as
noting:
if y=1, this simplifies to
if y=0, this simplifies to 1-
Deriving cross-entropy loss for a single observation x
Now take the log of both sides (mathematically handy)
Whatever values maximize log p(y|x) will also maximize p(y|x)
Goal: maximize probability of the correct label p(y|x)
Maximize:
Maximize:
Deriving cross-entropy loss for a single observation x
Now flip sign to turn this into a loss: something to minimize
Cross-entropy loss (because is formula for cross-entropy(y, ))
Or, plugging in definition of
Goal: maximize probability of the correct label p(y|x)
Maximize:
Minimize:
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another
nice touch is the music . I was overcome with the urge to get off the couch
and start dancing . It sucked me in , and it'll do the same to you .
Let's see if this works for our sentiment example
True value is y=1. How well is our model doing?
Pretty well! What's the loss?
Let's see if this works for our sentiment example
Suppose true value instead was y=0.
What's the loss?
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)
Is lower than the loss when model was wrong (if true y=0):
Sure enough, loss was bigger when model was wrong!
Logistic
Regression
Cross-Entropy Loss
Logistic
Regression
Stochastic Gradient Descent
Our goal: minimize the loss
Let's make explicit that the loss function is parameterized
by weights =(w,b)
𝛳
• And we’ll represent as f (x; θ ) to make the
dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
Intuition of gradient descent
How do I get to the bottom of this river canyon?
x
Look around me 360∘
Find the direction of
steepest slope down
Go that way
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
So we'll move positive
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
So we'll move positive
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
Gradient Descent: Find the gradient of the loss
function at the current point and move in the
opposite direction.
How much do we move in that direction ?
• The value of the gradient (slope in our example)
weighted by a learning rate η
• Higher learning rate means move w faster
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
Imagine 2 dimensions, w and b
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i tells
us the slope with respect to that variable.
◦ “How much would a small change in wi influence the total
loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss ∂wi
The gradient is then defined as a vector of these
partials.
The gradient
We’ll represent as f (x; θ ) to make the dependence on θ more
obvious:
The final equation for updating θ based on the gradient is thus
What are these partial derivatives for logistic regression?
The loss function
The elegant derivative of this function (see textbook 5.8 for derivation)
5_LR_Apr_7_2021.pptx in nature language processing
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Logistic
Regression
Stochastic Gradient Descent
Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0
are zero:
w1 = w2 = b = 0
η = 0.1
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
w1 = w2 = b = 0;
x1 = 3; x2 = 2
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
w1 = w2 = b = 0;
x1 = 3; x2 = 2
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
w1 = w2 = b = 0;
x1 = 3; x2 = 2
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
w1 = w2 = b = 0;
x1 = 3; x2 = 2
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
w1 = w2 = b = 0;
x1 = 3; x2 = 2
Example of gradient descent
η = 0.1;
Now that we have a gradient, we compute the new parameter vector
θ1
by moving θ0
in the opposite direction from the gradient:
Example of gradient descent
η = 0.1;
Now that we have a gradient, we compute the new parameter vector
θ1
by moving θ0
in the opposite direction from the gradient:
Example of gradient descent
η = 0.1;
Now that we have a gradient, we compute the new parameter vector
θ1
by moving θ0
in the opposite direction from the gradient:
Example of gradient descent
η = 0.1;
Now that we have a gradient, we compute the new parameter vector
θ1
by moving θ0
in the opposite direction from the gradient:
Note that enough negative examples would eventually make w2 negative
Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Logistic
Regression
Regularization
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize
Overfitting
This movie drew me in, and it'll
do the same to you.
81
X1 = "this"
X2 = "movie
X3 = "hated"
I can't tell you how much I
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
X4 = "drew me in"
+
-
Useful or harmless features
4gram features that just
"memorize" training set and
might cause problems
82
Overfitting
4-gram model on tiny data will just memorize the data
◦ 100% accuracy on the training set
But it will be surprised by the novel 4-grams in the test data
◦ Low accuracy on test set
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function (for
now written as maximizing logprob rather than minimizing loss)
Idea: choose an R(θ) that penalizes large weights
◦ fitting the data well with lots of big weights not as good as
fitting the data a little less well, with small weights
L2 Regularization (= ridge regression)
The sum of the squares of the weights
The name is because this is the (square of the)
L2 norm ||θ||2, = Euclidean distance of θ to the origin.
L2 regularized objective function:
L1 Regularization (= lasso regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ||W||1, = sum of the absolute
values of the weights, = Manhattan distance
L1 regularized objective function:
Logistic
Regression
Regularization
Logistic
Regression
Multinomial Logistic
Regression
88
Multinomial Logistic Regression
Often we need more than 2 classes
◦ Positive/negative/neutral
◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
◦ Classify emergency SMSs into different actionable classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output classes)
89
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
Need a generalization of the sigmoid called the softmax
◦ Takes a vector z = [z1, z2, ..., zk] of k arbitrary values
◦ Outputs a probability distribution
◦ each value in the range [0,1]
◦ all the values summing to 1
90
The softmax function
Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into probabilities
91
The softmax function
◦ Turns a vector z = [z1,z2,...,zk] of k arbitrary values into probabilities
92
Softmax in multinomial logistic regression
Input is still the dot product between weight vector w
and input vector x
But now we’ll need separate weight vectors for each
of the K classes.
93
Features in binary versus multinomial logistic regression
Binary: positive weight  y=1 neg weight  y=0
Multinominal: separate weights for each class:
w5 = 3.0
Logistic
Regression
Multinomial Logistic
Regression

More Related Content

PDF
Logistic-Regression - Machine learning model
PDF
7_logistic-regression presentation sur la regression logistique.pdf
PDF
Review : Perceptron Artificial Intelligence.pdf
PPTX
Difference between logistic regression shallow neural network and deep neura...
PPTX
ML Study Jams - Session 3.pptx
PPTX
Lec05.pptx
PDF
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
PDF
classification_clean.pdf
Logistic-Regression - Machine learning model
7_logistic-regression presentation sur la regression logistique.pdf
Review : Perceptron Artificial Intelligence.pdf
Difference between logistic regression shallow neural network and deep neura...
ML Study Jams - Session 3.pptx
Lec05.pptx
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
classification_clean.pdf

Similar to 5_LR_Apr_7_2021.pptx in nature language processing (20)

PPT
Utah Code Camp 2014 - Learning from Data by Thomas Holloway
PDF
Classification Techniques for Machine Learning
PPTX
lec+5+_part+1 cloud .pptx
PDF
Logistic regression in Machine Learning
PDF
Cheatsheet supervised-learning
PDF
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
PPTX
Classification Algortyhm of Machine Learning
PDF
Modeling Social Data, Lecture 8: Classification
PDF
Lesson_8_DeepLearning.pdf
PDF
4. NN.pdf……………………………………………………………………………….
PDF
Module -6.pdf Machine Learning Types and examples
PDF
Practical AI class 1
PPTX
MACHINE LEARNING Unit -2 Algorithm.pptx
PPTX
supervised-learning.pptx
PPTX
Supervised Machine learning Algorithm.pptx
PPTX
Logistic-regression-Supervised-MachineLearning.pptx
PPTX
Supervised learning for IOT IN Vellore Institute of Technology
PPTX
Introduction to Classification . pptx
PDF
Machine Learning with Python- Machine Learning Algorithms- Logistic Regressio...
PPTX
Supervised Machine Learning Algorithms
Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Classification Techniques for Machine Learning
lec+5+_part+1 cloud .pptx
Logistic regression in Machine Learning
Cheatsheet supervised-learning
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Classification Algortyhm of Machine Learning
Modeling Social Data, Lecture 8: Classification
Lesson_8_DeepLearning.pdf
4. NN.pdf……………………………………………………………………………….
Module -6.pdf Machine Learning Types and examples
Practical AI class 1
MACHINE LEARNING Unit -2 Algorithm.pptx
supervised-learning.pptx
Supervised Machine learning Algorithm.pptx
Logistic-regression-Supervised-MachineLearning.pptx
Supervised learning for IOT IN Vellore Institute of Technology
Introduction to Classification . pptx
Machine Learning with Python- Machine Learning Algorithms- Logistic Regressio...
Supervised Machine Learning Algorithms
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Computer network topology notes for revision
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Foundation of Data Science unit number two notes
PDF
annual-report-2024-2025 original latest.
PPTX
1_Introduction to advance data techniques.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Knowledge Engineering Part 1
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Computer network topology notes for revision
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Foundation of Data Science unit number two notes
annual-report-2024-2025 original latest.
1_Introduction to advance data techniques.pptx
.pdf is not working space design for the following data for the following dat...
Reliability_Chapter_ presentation 1221.5784
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Mega Projects Data Mega Projects Data
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Analytics and business intelligence.pdf
Introduction to Knowledge Engineering Part 1
Ad

5_LR_Apr_7_2021.pptx in nature language processing

  • 2. Logistic Regression Important analytic tool in natural and social sciences Baseline supervised machine learning tool for classification Is also the foundation of neural networks
  • 3. Generative and Discriminative Classifiers Naive Bayes is a generative classifier by contrast: Logistic regression is a discriminative classifier
  • 4. Generative and Discriminative Classifiers Suppose we're distinguishing cat from dog images imagenet imagenet
  • 5. Generative Classifier: • Build a model of what's in a cat image • Knows about whiskers, ears, eyes • Assigns a probability to any image: • how cat-y is this image? Also build a model for dog images Now given a new image: Run both models and see which one fits better
  • 6. Discriminative Classifier Just try to distinguish dogs from cats Oh look, dogs have collars! Let's ignore everything else
  • 7. 7 Finding the correct class c from a document d in Generative vs Discriminative Classifiers Naive Bayes Logistic Regression P(c|d) posterior
  • 8. Components of a probabilistic machine learning classifier 1. A feature representation of the input. For each input observation x(i) , a vector of features [x1, x2, ... , xn]. Feature j for input x(i) is xj, more completely xj (i) , or sometimes fj(x). 2. A classification function that computes , the estimated class, via p(y|x), like the sigmoid or softmax functions. 3. An objective function for learning, like cross-entropy loss. 4. An algorithm for optimizing the objective function: stochastic gradient descent. Given m input/output pairs (x(i), y(i) ):
  • 9. The two phases of logistic regression Training: we learn weights w and b using stochastic gradient descent and cross-entropy loss. Test: Given a test example x we compute p(y|x) using learned weights w and b, and return whichever label (y = 1 or y = 0) is higher probability
  • 12. Classification Reminder Positive/negative sentiment Spam/not spam Authorship attribution (Hamilton or Madison?) Alexander Hamilton
  • 13. Text Classification: definition Input: ◦ a document x ◦ a fixed set of classes C = {c1, c2,…, cJ} Output: a predicted class  C
  • 14. Binary Classification in Logistic Regression Given a series of input/output pairs: ◦ (x(i) , y(i) ) For each observation x(i) ◦ We represent x(i) by a feature vector [x1, x2,…, xn] ◦ We compute an output: a predicted class (i)  {0,1}
  • 15. Features in logistic regression • For feature xi, weight wi tells is how important is xi • xi ="review contains ‘awesome’": wi = +10 • xj ="review contains ‘abysmal’": wj = -10 • xk =“review contains ‘mediocre’": wk = -2
  • 16. Logistic Regression for one observation x Input observation: vector x = [x1, x2,…, xn] Weights: one per feature: W = [w1, w2,…, wn] ◦ Sometimes we call the weights θ = [θ1, θ2,…, θn] Output: a predicted class  {0,1} (multinomial logistic regression:  {0, 1, 2, 3, 4})
  • 17. How to do classification For each feature xi, weight wi tells us importance of xi ◦ (Plus we'll have a bias b) We'll sum up all the weighted features and the bias If this sum is high, we say y=1; if low, then y=0
  • 18. But we want a probabilistic classifier We need to formalize “sum is high”. We’d like a principled classifier that gives us a probability, just like Naive Bayes did We want a model that can tell us: p(y=1|x; θ) p(y=0|x; θ)
  • 19. The problem: z isn't a probability, it's just a number! Solution: use a function of z that goes from 0 to 1
  • 20. 20 The very useful sigmoid or logistic function
  • 21. Idea of logistic regression We’ll compute w∙x+b And then we’ll pass it through the sigmoid function: σ(w∙x+b) And we'll just treat it as a probability
  • 24. Turning a probability into a classifier 0.5 here is called the decision boundary
  • 26. Turning a probability into a classifier if w∙x+b > 0 if w∙x+b ≤ 0
  • 28. Logistic Regression Logistic Regression: a text example on sentiment classification
  • 29. 29 Sentiment example: does y=1 or y=0? It's hokey . There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable ? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you .
  • 30. 30
  • 31. 31 Classifying sentiment for input x Suppose w = b = 0.1
  • 33. 33 We can build features for logistic regression for any classification task: period disambiguation This ends in a period. The house at 465 Main St. is new. End of sentence Not end
  • 34. Classification in (binary) logistic regression: summary Given: ◦ a set of classes: (+ sentiment,- sentiment) ◦ a vector x of features [x1, x2, …, xn] ◦ x1= count( "awesome") ◦ x2 = log(number of words in review) ◦ A vector w of weights [w1, w2, …, wn] ◦ wi for each feature fi
  • 35. Logistic Regression Logistic Regression: a text example on sentiment classification
  • 37. 37 Wait, where did the W’s come from? Supervised classification: • We know the correct label y (either 0 or 1) for each x. • But what the system produces is an estimate, We want to set w and b to minimize the distance between our estimate (i) and the true y(i) . • We need a distance estimator: a loss function or a cost function • We need an optimization algorithm to update w and b to minimize the loss.
  • 38. Learning components A loss function: ◦ cross-entropy loss An optimization algorithm: ◦ stochastic gradient descent
  • 39. The distance between and y We want to know how far is the classifier output: = σ(w x+b) ∙ from the true output: y [= either 0 or 1] We'll call this difference: L(,y) = how much differs from the true y
  • 40. Intuition of negative log likelihood loss = cross-entropy loss A case of conditional maximum likelihood estimation We choose the parameters w,b that maximize • the log probability • of the true y labels in the training data • given the observations x
  • 41. Deriving cross-entropy loss for a single observation x Goal: maximize probability of the correct label p(y|x) Since there are only 2 discrete outcomes (0 or 1) we can express the probability p(y|x) from our classifier (the thing we want to maximize) as noting: if y=1, this simplifies to if y=0, this simplifies to 1-
  • 42. Deriving cross-entropy loss for a single observation x Now take the log of both sides (mathematically handy) Whatever values maximize log p(y|x) will also maximize p(y|x) Goal: maximize probability of the correct label p(y|x) Maximize: Maximize:
  • 43. Deriving cross-entropy loss for a single observation x Now flip sign to turn this into a loss: something to minimize Cross-entropy loss (because is formula for cross-entropy(y, )) Or, plugging in definition of Goal: maximize probability of the correct label p(y|x) Maximize: Minimize:
  • 44. Let's see if this works for our sentiment example We want loss to be: • smaller if the model estimate is close to correct • bigger if model is confused Let's first suppose the true label of this is y=1 (positive) It's hokey . There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable ? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you .
  • 45. Let's see if this works for our sentiment example True value is y=1. How well is our model doing? Pretty well! What's the loss?
  • 46. Let's see if this works for our sentiment example Suppose true value instead was y=0. What's the loss?
  • 47. Let's see if this works for our sentiment example The loss when model was right (if true y=1) Is lower than the loss when model was wrong (if true y=0): Sure enough, loss was bigger when model was wrong!
  • 50. Our goal: minimize the loss Let's make explicit that the loss function is parameterized by weights =(w,b) 𝛳 • And we’ll represent as f (x; θ ) to make the dependence on θ more obvious We want the weights that minimize the loss, averaged over all examples:
  • 51. Intuition of gradient descent How do I get to the bottom of this river canyon? x Look around me 360∘ Find the direction of steepest slope down Go that way
  • 52. Our goal: minimize the loss For logistic regression, loss function is convex • A convex function has just one minimum • Gradient descent starting from any point is guaranteed to find the minimum • (Loss for neural networks is non-convex)
  • 53. Let's first visualize for a single scalar w Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function
  • 54. Let's first visualize for a single scalar w Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function So we'll move positive
  • 55. Let's first visualize for a single scalar w Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function So we'll move positive
  • 56. Gradients The gradient of a function of many variables is a vector pointing in the direction of the greatest increase in a function. Gradient Descent: Find the gradient of the loss function at the current point and move in the opposite direction.
  • 57. How much do we move in that direction ? • The value of the gradient (slope in our example) weighted by a learning rate η • Higher learning rate means move w faster
  • 58. Now let's consider N dimensions We want to know where in the N-dimensional space (of the N parameters that make up θ ) we should move. The gradient is just such a vector; it expresses the directional components of the sharpest slope along each of the N dimensions.
  • 59. Imagine 2 dimensions, w and b Visualizing the gradient vector at the red point It has two dimensions shown in the x-y plane
  • 60. Real gradients Are much longer; lots and lots of weights For each dimension wi the gradient component i tells us the slope with respect to that variable. ◦ “How much would a small change in wi influence the total loss function L?” ◦ We express the slope as a partial derivative ∂ of the loss ∂wi The gradient is then defined as a vector of these partials.
  • 61. The gradient We’ll represent as f (x; θ ) to make the dependence on θ more obvious: The final equation for updating θ based on the gradient is thus
  • 62. What are these partial derivatives for logistic regression? The loss function The elegant derivative of this function (see textbook 5.8 for derivation)
  • 64. Hyperparameters The learning rate η is a hyperparameter ◦ too high: the learner will take big steps and overshoot ◦ too low: the learner will take too long Hyperparameters: • Briefly, a special kind of parameter for an ML model • Instead of being learned by algorithm from supervision (like regular parameters), they are chosen by algorithm designer.
  • 67. Working through an example One step of gradient descent A mini-sentiment example, where the true y=1 (positive) Two features: x1 = 3 (count of positive lexicon words) x2 = 2 (count of negative lexicon words) Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero: w1 = w2 = b = 0 η = 0.1
  • 68. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: w1 = w2 = b = 0; x1 = 3; x2 = 2
  • 69. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: w1 = w2 = b = 0; x1 = 3; x2 = 2
  • 70. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: w1 = w2 = b = 0; x1 = 3; x2 = 2
  • 71. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: w1 = w2 = b = 0; x1 = 3; x2 = 2
  • 72. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: w1 = w2 = b = 0; x1 = 3; x2 = 2
  • 73. Example of gradient descent η = 0.1; Now that we have a gradient, we compute the new parameter vector θ1 by moving θ0 in the opposite direction from the gradient:
  • 74. Example of gradient descent η = 0.1; Now that we have a gradient, we compute the new parameter vector θ1 by moving θ0 in the opposite direction from the gradient:
  • 75. Example of gradient descent η = 0.1; Now that we have a gradient, we compute the new parameter vector θ1 by moving θ0 in the opposite direction from the gradient:
  • 76. Example of gradient descent η = 0.1; Now that we have a gradient, we compute the new parameter vector θ1 by moving θ0 in the opposite direction from the gradient: Note that enough negative examples would eventually make w2 negative
  • 77. Mini-batch training Stochastic gradient descent chooses a single random example at a time. That can result in choppy movements More common to compute gradient over batches of training instances. Batch training: entire dataset Mini-batch training: m examples (512, or 1024)
  • 80. Overfitting A model that perfectly match the training data has a problem. It will also overfit to the data, modeling noise ◦ A random word that perfectly predicts y (it happens to only occur in one class) will get a very high weight. ◦ Failing to generalize to a test set without this word. A good model should be able to generalize
  • 81. Overfitting This movie drew me in, and it'll do the same to you. 81 X1 = "this" X2 = "movie X3 = "hated" I can't tell you how much I hated this movie. It sucked. X5 = "the same to you" X7 = "tell you how much" X4 = "drew me in" + - Useful or harmless features 4gram features that just "memorize" training set and might cause problems
  • 82. 82 Overfitting 4-gram model on tiny data will just memorize the data ◦ 100% accuracy on the training set But it will be surprised by the novel 4-grams in the test data ◦ Low accuracy on test set Models that are too powerful can overfit the data ◦ Fitting the details of the training data so exactly that the model doesn't generalize well to the test set ◦ How to avoid overfitting? ◦ Regularization in logistic regression ◦ Dropout in neural networks
  • 83. Regularization A solution for overfitting Add a regularization term R(θ) to the loss function (for now written as maximizing logprob rather than minimizing loss) Idea: choose an R(θ) that penalizes large weights ◦ fitting the data well with lots of big weights not as good as fitting the data a little less well, with small weights
  • 84. L2 Regularization (= ridge regression) The sum of the squares of the weights The name is because this is the (square of the) L2 norm ||θ||2, = Euclidean distance of θ to the origin. L2 regularized objective function:
  • 85. L1 Regularization (= lasso regression) The sum of the (absolute value of the) weights Named after the L1 norm ||W||1, = sum of the absolute values of the weights, = Manhattan distance L1 regularized objective function:
  • 88. 88 Multinomial Logistic Regression Often we need more than 2 classes ◦ Positive/negative/neutral ◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.) ◦ Classify emergency SMSs into different actionable classes If >2 classes we use multinomial logistic regression = Softmax regression = Multinomial logit = (defunct names : Maximum entropy modeling or MaxEnt So "logistic regression" will just mean binary (2 output classes)
  • 89. 89 Multinomial Logistic Regression The probability of everything must still sum to 1 P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1 Need a generalization of the sigmoid called the softmax ◦ Takes a vector z = [z1, z2, ..., zk] of k arbitrary values ◦ Outputs a probability distribution ◦ each value in the range [0,1] ◦ all the values summing to 1
  • 90. 90 The softmax function Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into probabilities
  • 91. 91 The softmax function ◦ Turns a vector z = [z1,z2,...,zk] of k arbitrary values into probabilities
  • 92. 92 Softmax in multinomial logistic regression Input is still the dot product between weight vector w and input vector x But now we’ll need separate weight vectors for each of the K classes.
  • 93. 93 Features in binary versus multinomial logistic regression Binary: positive weight  y=1 neg weight  y=0 Multinominal: separate weights for each class: w5 = 3.0

Editor's Notes

  • #1: In this lecture we talk about the difference between generative and discriminative classifiers, and the relationship between Naïve Bayes and Logistic Regression.
  • #10: We've now seen the high-level intuition of the components of logistic regression and its relationship to the other classifier we've learned about, Naïve Bayes
  • #11: In this lecture we'll see how to do classification with logistic regression and introduce the important sigmoid function
  • #15: The weight wi represents how important that input feature is to the classification decision, and can be positive (providing evidence that the in- stance being classified belongs in the positive class) or negative (providing evidence that the instance being classified belongs in the negative class). Thus we might expect in a sentiment task the word awesome to have a high positive weight, and abysmal to have a very negative weight.
  • #17: The bias term, also called the intercept, is another real number that’s added to the weighted inputs. In the rest of the book we’ll represent such sums using the dot product notation from linear algebra. The dot product of two vectors a and b, written as a · b is the sum of the products of the corresponding elements of each vector.
  • #19: nothing in Eq. 5.3 forces z to be a legal probability, that is, to lie between 0 and 1. In fact, since weights are real-valued, the output might even be negative; z ranges from −∞ to ∞.
  • #20: The sigmoid function The 1/1+e^-z (so-named because it looks like an s) is also called the logistic function. It takes a real value and maps it to the range [0, 1 It is nearly linear around 0 but outlier values get squashed toward 0 or 1.
  • #22: We’re almost there. If we apply the sigmoid to the sum of the weighted features, we get a number between 0 and 1. To make it a probability, we just need to make sure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:
  • #23: The sigmoid function has the property
  • #27: We've seen how logistic regression uses the sigmoid function to take weighted features for an input example x and assign it to the class 1 or 0.
  • #28: Let's walk through an example using logistic regression to do sentiment classification
  • #29: Suppose we are doing binary sentiment classification on movie review text, and we would like to know whether to assign the sentiment class 1=positive or 0=negative to the following review
  • #30: We’ll represent each input observation by the 6 features x1...x6 of the input
  • #31: Let’s assume for the moment that we’ve already learned a real-valued weight for each of these features, and that the 6 weights corresponding to the 6 features are as follows. The weight w1, for example indicates how important a feature the number of positive lexicon words (great, nice, enjoyable, etc.) is to a positive sentiment decision, while w2 tells us the importance of negative lexicon words. Note that w1 = 2.5 is positive, while w2 = −5.0, meaning that negative words are negatively associated with a positive sentiment decision, and are about twice as important as positive words.
  • #32: Given these 6 features and the input review x, we can compute P(+|x) and P(−|x)
  • #33: We might use features like x1 below expressing that the current word is lower case and the class is EOS (perhaps with a positive weight), or that the current word is in our abbreviations dictionary (“Prof.”) and the class is EOS (perhaps with a negative weight). A feature can also express a quite complex combination of properties. For example a period following an upper case word is likely to be an EOS, but if the word itself is St. and the previous word is capitalized, then the period is likely part of a shortening of the word street.
  • #35: We've now seen the details of how logistic regression can take feature values and weights to assign a class to an input.
  • #36: Let's now turn to learning the parameters for logistic regression. We'll start with the cross-entropy loss function
  • #37: Logistic regression is an instance of supervised classification in which we know the correct label y (either 0 or 1) for each observation x. But what the system produces is an estimate, ^y. We want to learn parameters (meaning w and b) that make yˆ for each training observation as close as possible to the true y.
  • #38: This requires two components. The first is a metric for how close the current label (yˆ) is to the true gold label y. Rather than measure similarity, we usually talk about the opposite of this: the distance between the system output and the gold output, and we call this distance the loss function or the cost function. We'll we’ll introduce the loss function that is commonly used for logistic regression and also for neural networks, the cross-entropy loss. The second thing we need is an optimization algorithm for iteratively updating the weights so as to minimize this loss function. The standard algorithm for this is gradient descent; we’ll introduce the stochastic gradient descent algorithm in the following section.
  • #39: We need a loss function that expresses, for an observation x, how close the classifier output (yˆ = σ (w · x + b)) is to the correct output (y, which is 0 or 1).
  • #40: We do this via a loss function that prefers the correct class labels of the training examples to be more likely. This is called conditional maximum likelihood estimation: we choose the parameters w,b that maximize the log probability of the true y labels in the training data given the observations x. The resulting loss function is the negative log likelihood loss, generally called the cross-entropy loss.
  • #41: Let’s derive this loss function, applied to a single observation x. We’d like to learn weights that maximize the probability of the correct label, that's p(y|x).
  • #46: By contrast, let’s pretend instead that the example was actually negative, i.e., the true y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is terrible! I beg you not to see it!”) . In this case our model is confused, the model is wrong, so we’d want the loss to be higher. Let's plug in y = 0 and 1−σ(w·x+b) = .31
  • #48: We've derived the cross-entropy loss and seen how it applies in our sentiment example. Cross entropy loss is equally important for neural networks.
  • #49: In this lecture we introduce the stochastic gradient descent algorithm used for optimizing the weights for logistic regression and neural networks.
  • #50: Our goal with gradient descent is to find the optimal weights: minimize the loss function we’ve defined for the model. We’ll explicitly represent the fact that the loss function L is parameterized by the weights, which we can refer to in machine learning in general as θ (in the case of logistic regression θ = w, b). And we’ll represent as f (x; θ ) to make the dependence on θ more obvious. So the goal is to find the set of weights which minimizes the loss function, averaged over all examples: EQ
  • #51: How do we find the minimum of a loss function? Gradient descent is a method that finds a minimum of a function by figuring out in which direction (in the space of the parameters θ) the function’s slope is rising the most steeply, and moving in the opposite direction. The intuition is that if you are hiking in a canyon and trying to descend most quickly down to the river at the bottom, you might look around yourself 360 degrees, find the direction where the ground is sloping the steepest, and walk downhill in that direction.
  • #52: For logistic regression, this loss function is conveniently convex. A convex function has just one minimum; there are no local minima to get stuck in, so gradient descent starting from any point is guaranteed to find the minimum. (By contrast, the loss for multi-layer neural networks is non-convex, and gradient descent may get stuck in local minima for neural network training and never find the global optimum.)
  • #53: Although the algorithm (and the concept of gradient) are designed for direction vectors, let’s first consider a visualization of the case where the parameter of our system is just a single scalar w. Given a random initialization of w at some value w1, and assuming the loss function L happened to have this shape, we need the algorithm to tell us whether at the next iteration we should move left (making w2 smaller than w1) or right (making w2 bigger than w1) to reach the minimum. The gradient descent algorithm answers this question by finding the gradient of the loss function at the current point and moving in the opposite direction.
  • #54: The gradient of a function of many variables is a vector pointing in the direction of the greatest increase in a function. The gradient is a multi-variable generalization of the slope, so for a function of one variable like the one in this figure, we can informally think of the gradient as the slope. The dotted line shows the slope of this hypothetical loss function at point w = w1 . You can see that the slope of this dotted line is negative.
  • #55: Thus to find the minimum, gradient descent tells us to go in the opposite direction: moving w in a positive direction.
  • #57: The magnitude of the amount to move in gradient descent is the value of the slope of the loss function (with respect to w) weighted by a learning rate η . A higher (faster) learning rate means that we should move w more on each step. The change we make in our parameter is the learning rate times the gradient (or the slope, in our single-variable example)
  • #58: Now let’s extend the intuition from a function of one scalar variable w to many variables, because we don’t just want to move left or right, we want to know where in the N-dimensional space (of the N parameters that make up θ ) we should move. The gradient is just such a vector; it expresses the directional components of the sharpest slope along each of those N dimensions.
  • #59: If we’re just imagining two weight dimensions (say for one weight w and one bias b), the gradient might be a vector with two orthogonal components, each of which tells us how much the ground slopes in the w dimension and in the b dimension. This figure shows a visualization of the value of a 2-dimensional gradient vector taken at the red point.
  • #60: In an actual logistic regression, the parameter vector w is much longer than 1 or 2, since the input feature vector x can be quite long, and we need a weight wi for each xi. For each dimension/variable wi in w (plus the bias b), the gradient will have a component that tells us the slope with respect to that variable. Essentially we’re asking: “How much would a small change in that variable wi influence the total loss function L?” In each dimension wi, we express the slope as a partial derivative of the loss function with respect to wi. The gradient is then defined as a vector of these partials.
  • #61: Here's a vector of gradients. We’ll represent yˆ as f (x; θ ) to make the dependence on θ more obvious:
  • #62: In order to update θ , we need a definition for the gradient ∇L( f (x; θ ), y). Recall that for logistic regression, the cross-entropy loss function is: EQ It turns out that the derivative of this function for one observation vector x is EQ Note that the gradient with respect to a single weight w j represents a very intuitive value: the difference between the true y and our estimated yˆ = σ (w · x + b) for that observation, multiplied by the corresponding input value x j .
  • #63: Stochastic gradient descent is an online algorithm that minimizes the loss function by computing its gradient after each training example, and nudging θ in the right direction (the opposite direction of the gradient). The algorithm can terminate when it converges (or when the gradient norm < ε), or when progress halts (for example when the loss starts going up on a held-out set).
  • #64: The learning rate η is a hyperparameter that must be adjusted. If it’s too high, the learner will take steps that are too large, overshooting the minimum of the loss function. If it’s too low, the learner will take steps that are too small, and take too long to get to the minimum. It is common to start with a higher learning rate and then slowly decrease it. Hyperparameters are a special kind of parameter for any machine learning model. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.
  • #65: We've now introduced the important stochastic gradient descent algorithm. We'll give some more details in the next lecture.
  • #66: In this lecture we'll walk through an example of stochastic descent and give a few more details.
  • #67: Let’s walk though a single step of the gradient descent algorithm. We’ll use a simplified version of our sentiment classification example as it sees a single observation x, whose correct value is y = 1 (this is a positive review), and with only two features: x1  = 3 (count of positive lexicon words), and x2  = 2 (count of negative lexicon words). Let’s assume the initial weights and bias in θ_0 are all set to 0, and the initial learning rate η is 0.1:
  • #68: Here's our update equation for SGD. In order to do this update to theta, we'll need to know the gradient of the loss function. In our mini example there are three parameters, so the gradient vector has 3 dimensions, for w1, w2, and b. We can compute the first gradient as follows:
  • #76: Note that this observation x happened to be a positive example. We would expect that after seeing more negative examples with high counts of negative words, that the weight w2, the weight for the "negative lexicon feature', would shift to have a negative value.
  • #77: Stochastic gradient descent is called stochastic because it chooses a single random example at a time, moving the weights so as to improve performance on that single example. That can result in very choppy movements, so it’s common to compute the gradient over batches of training instances rather than a single instance. For example in batch training we compute the gradient over the entire dataset. By seeing so many examples, batch training offers a superb estimate of which direction to move the weights, at the cost of spending a lot of time processing every single example in the training set to compute this perfect direction. A compromise is mini-batch training: we train on a group of m examples (perhaps 512, or 1024) that is less than the whole dataset. Mini-batch training also has the advantage of computational efficiency. The mini-batches can easily be vectorized, choosing the size of the mini-batch based on the computational resources. This allows us to process all the exam- ples in one mini-batch in parallel and then accumulate the loss, something that’s not possible with individual or batch training.
  • #78: We've now seen the stochastic gradient descent algorithm and discussed variants like mini-batch training.