2. Where are we?
We have seen the following ideas
– Linear models
– Learning as loss minimization
– Bayesian learning criteria (MAP and MLE estimation)
2
3. This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
3
4. This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
4
5. Logistic Regression: Setup
• The setting
– Binary classification
– Inputs: Feature vectors 𝐱 ∈ ℜ!
– Labels: 𝑦 ∈ {−1, +1}
• Training data
– S = {(𝐱", 𝑦")}, consisting of 𝑚 examples
5
6. Classification, but…
The output 𝑦 is discrete: Either −1 or +1
Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)
Expand hypothesis space to functions whose output is [0 − 1]
• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively make the problem a regression problem
Many hypothesis spaces possible
6
7. Classification, but…
The output 𝑦 is discrete: Either −1 or +1
Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)
Expand hypothesis space to functions whose output is [0 − 1]
• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively, make the problem a regression problem
Many hypothesis spaces possible
7
8. The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function) ¾
What is the domain
and the range of the
sigmoid function?
This is a reasonable choice. We will see why later
8
9. The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function), defined as
This is a reasonable choice. We will see why later
9
10. The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function), defined as
What is the domain
and the range of the
sigmoid function?
This is a reasonable choice. We will see why later
10
18. Predicting probabilities
According to the logistic regression model, we have
Or equivalently
18
Note that we are directly modeling
𝑃(𝑦 | 𝑥) rather than 𝑃(𝑥 |𝑦) and 𝑃(𝑦)
19. Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
• If this is greater than half, predict +1 else predict −1
– What does this correspond to in terms of 𝐰)𝐱?
19
20. Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
• If this is greater than half, predict +1 else predict −1
– What does this correspond to in terms of 𝐰)𝐱?
– Prediction = sgn(𝐰)𝐱)
20
21. This lecture
• Logistic regression
• Training a logistic regression classifier
– First: Maximum likelihood estimation
– Then: Adding priors à Maximum a Posteriori estimation
• Back to loss minimization
21
22. Maximum likelihood estimation
Let’s address the problem of learning
• Training data
– S = {(𝐱", 𝑦")}, consisting of 𝑚 examples
• What we want
– Find a weight vector 𝐰 such that P(S ∣ 𝐰) is maximized
– We know that our examples are drawn independently and
are identically distributed (i.i.d)
– How do we proceed?
22
23. Maximum likelihood estimation
23
The usual trick: Convert products to sums by taking log
Recall that this works only because log is an increasing
function and the maximizer will not change
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
25. Maximum likelihood estimation
25
But (by definition) we know that
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
𝑃 𝑦" 𝐰, 𝐱" = 𝜎 𝑦"𝐰)𝐱" =
1
1 + exp(−𝑦"𝐰)𝐱")
26. Maximum likelihood estimation
26
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
𝑃 𝑦 𝐰, 𝐱 =
1
1 + exp(−y!𝐰"𝐱#)
Equivalent to solving
max
𝐰
A
"
-
−log 1 + exp −𝑦"𝐰)𝐱"
27. Maximum likelihood estimation
27
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
𝑃 𝑦 𝐰, 𝐱 =
1
1 + exp(−y!𝐰"𝐱#)
Equivalent to solving
The goal: Maximum
likelihood training of a
discriminative
probabilistic classifier
under the logistic
model for the posterior
distribution.
max
𝐰
A
"
-
−log 1 + exp −𝑦"𝐰)𝐱"
28. Maximum likelihood estimation
28
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
𝑃 𝑦 𝐰, 𝐱 =
1
1 + exp(−y!𝐰"𝐱#)
Equivalent to solving
max
𝐰
A
"
-
−log 1 + exp −𝑦"𝐰)𝐱"
Equivalent to: Training a linear classifier by minimizing the logistic loss.
The goal: Maximum
likelihood training of a
discriminative
probabilistic classifier
under the logistic
model for the posterior
distribution.
29. Maximum a posteriori estimation
We could also add a prior on the weights
Suppose each weight in the weight vector is drawn
independently from the normal distribution with zero
mean and standard deviation 𝜎
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
29
30. MAP estimation for logistic regression
30
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
Let us work through this procedure again
31. MAP estimation for logistic regression
31
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
What is the goal of MAP estimation?
(In maximum likelihood estimation, we maximized the likelihood of the data)
Let us work through this procedure again
to see what changes from maximum likelihood
estimation
32. MAP estimation for logistic regression
32
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
What is the goal of MAP estimation?
To maximize the posterior probability of the model given the data (i.e. to find the
most probable model, given the data)
𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)
34. MAP estimation for logistic regression
34
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
35. MAP estimation for logistic regression
35
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
We have already expanded out the first term.
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱")
36. MAP estimation for logistic regression
36
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") + A
:+,
!
−𝑤"
;
𝜎; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
Expand the log prior
37. MAP estimation for logistic regression
37
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") + A
:+,
!
−𝑤"
;
𝜎; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
38. MAP estimation for logistic regression
38
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") −
1
𝜎; 𝐰)𝐰
39. MAP estimation for logistic regression
39
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") −
1
𝜎; 𝐰)𝐰
Maximizing a negative function is the same as minimizing the function
40. Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
40
min
𝐰
A
"
-
log(1 + exp(−𝑦"𝐰)𝐱") +
1
𝜎; 𝐰)𝐰
41. Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
41
Where have we seen this before?
min
𝐰
A
"
-
log(1 + exp(−𝑦"𝐰)𝐱") +
1
𝜎; 𝐰)𝐰
42. Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
42
Where have we seen this before?
Exercise: Write down the stochastic gradient descent (SGD) algorithm for this?
Other training algorithms exist. For example, the LBFGS algorithm is an example of
a quasi-Newton method. But gradient based methods like SGD and its variants are
way more commonly used.
min
𝐰
A
"
-
log(1 + exp(−𝑦"𝐰)𝐱") +
1
𝜎; 𝐰)𝐰
43. Logistic regression is…
• A classifier that predicts the probability that the label is
+1 for a particular input
• The discriminative counter-part of the naïve Bayes
classifier
• A discriminative classifier that can be trained via MAP or
MLE estimation
• A discriminative classifier that minimizes the logistic loss
over the training set
43
44. This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
44
45. Learning as loss minimization
• The setup
– Examples x drawn from a fixed, unknown distribution D
– Hidden oracle classifier f labels examples
– We wish to find a hypothesis h that mimics f
• The ideal situation
– Define a function L that penalizes bad hypotheses
– Learning: Pick a function ℎ ∈ 𝐻 to minimize expected loss
• Instead, minimize empirical loss on the training set
45
But distribution D is unknown
47. Empirical loss minimization
Learning = minimize empirical loss on the training set
We need something that biases the learner towards simpler
hypotheses
• Achieved using a regularizer, which penalizes complex
hypotheses
47
Is there a problem here? Overfitting!
48. Regularized loss minimization
• Learning:
• With linear classifiers:
• What is a loss function?
– Loss functions should penalize mistakes
– We are minimizing average loss over the training data
• What is the ideal loss function for classification?
48
(using ℓ! regularization)
49. The 0-1 loss
Penalize classification mistakes between true label y and
prediction y’
• For linear classifiers, the prediction y’ = sgn(wTx)
– Mistake if 𝑦 𝒘𝑇𝒙 ≤ 0
Minimizing 0-1 loss is intractable. Need surrogates
49
50. The loss function zoo
Many loss functions exist
– Perceptron loss
– Hinge loss (SVM)
– Exponential loss (AdaBoost)
– Logistic loss (logistic regression)
50
59. This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
• Connection to Naïve Bayes
59
60. Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
Here, the P’s represent the Naïve Bayes posterior distribution,
and w can be used to calculate the priors and the likelihoods.
That is, 𝑃(𝑦 = 1 | 𝐰, 𝐱) is computed using
𝑃(𝐱 | 𝑦 = 1, 𝐰) and 𝑃(𝑦 = 1 | 𝐰)
60
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
61. Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
61
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
62. Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
Substituting in the above expression, we will get
62
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1
1 + exp(−𝐰)𝐱)
Exercise: Show this formally
63. Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
Substituting in the above expression, we get
63
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1
1 + exp(−𝐰)𝐱)
That is, both naïve Bayes and logistic regression try to
compute the same posterior distribution over the outputs
Naïve Bayes is a generative model.
Logistic Regression is the discriminative version.