7_logistic-regression presentation sur la regression logistique.pdf

Machine Learning
Logistic Regression
1

Where are we?
We have seen the following ideas
– Linear models
– Learning as loss minimization
– Bayesian learning criteria (MAP and MLE estimation)
2

This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
3

This lecture
4

Logistic Regression: Setup
• The setting
– Binary classification
– Inputs: Feature vectors 𝐱 ∈ ℜ!
– Labels: 𝑦 ∈ {−1, +1}
• Training data
– S = {(𝐱", 𝑦")}, consisting of 𝑚 examples
5

Classification, but…
The output 𝑦 is discrete: Either −1 or +1
Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)
Expand hypothesis space to functions whose output is [0 − 1]
• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively make the problem a regression problem
Many hypothesis spaces possible
6

Classification, but…
The output 𝑦 is discrete: Either −1 or +1
Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)
Expand hypothesis space to functions whose output is [0 − 1]
• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively, make the problem a regression problem
Many hypothesis spaces possible
7

The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function) ¾
What is the domain
and the range of the
sigmoid function?
This is a reasonable choice. We will see why later
8

function (the logistic function), defined as
9

function (the logistic function), defined as
What is the domain
and the range of the
sigmoid function?
10

¾(z)
z
11

12
What is its derivative with respect to z?

13
What is its derivative with respect to z?

Predicting probabilities
According to the logistic regression model, we have
14

15

16

Or equivalently
17

Or equivalently
18
Note that we are directly modeling
𝑃(𝑦 | 𝑥) rather than 𝑃(𝑥 |𝑦) and 𝑃(𝑦)

Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
• If this is greater than half, predict +1 else predict −1
– What does this correspond to in terms of 𝐰)𝐱?
19

Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
• If this is greater than half, predict +1 else predict −1
– What does this correspond to in terms of 𝐰)𝐱?
– Prediction = sgn(𝐰)𝐱)
20

This lecture
– First: Maximum likelihood estimation
– Then: Adding priors à Maximum a Posteriori estimation
21

Maximum likelihood estimation
Let’s address the problem of learning
• Training data
– S = {(𝐱", 𝑦")}, consisting of 𝑚 examples
• What we want
– Find a weight vector 𝐰 such that P(S ∣ 𝐰) is maximized
– We know that our examples are drawn independently and
are identically distributed (i.i.d)
– How do we proceed?
22

23
The usual trick: Convert products to sums by taking log
Recall that this works only because log is an increasing
function and the maximizer will not change
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)

24
Equivalent to solving
argmax
𝐰
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)

25
But (by definition) we know that
argmax
𝐰
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
𝑃 𝑦" 𝐰, 𝐱" = 𝜎 𝑦"𝐰)𝐱" =
1
1 + exp(−𝑦"𝐰)𝐱")

26
argmax
𝐰
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
𝑃 𝑦 𝐰, 𝐱 =
1
1 + exp(−y!𝐰"𝐱#)
max
𝐰
A
"
-
−log 1 + exp −𝑦"𝐰)𝐱"

27
argmax
𝐰
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
1
1 + exp(−y!𝐰"𝐱#)
The goal: Maximum
likelihood training of a
discriminative
probabilistic classifier
under the logistic
model for the posterior
distribution.
max
𝐰
A
"
-

28
argmax
𝐰
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
1
1 + exp(−y!𝐰"𝐱#)
max
𝐰
A
"
-
Equivalent to: Training a linear classifier by minimizing the logistic loss.
The goal: Maximum
likelihood training of a
discriminative
probabilistic classifier
under the logistic
model for the posterior
distribution.

Maximum a posteriori estimation
We could also add a prior on the weights
Suppose each weight in the weight vector is drawn
independently from the normal distribution with zero
mean and standard deviation 𝜎
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
29

MAP estimation for logistic regression
30
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
Let us work through this procedure again

31
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
What is the goal of MAP estimation?
(In maximum likelihood estimation, we maximized the likelihood of the data)
Let us work through this procedure again
to see what changes from maximum likelihood
estimation

32
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
What is the goal of MAP estimation?
To maximize the posterior probability of the model given the data (i.e. to find the
most probable model, given the data)
𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)

33
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃(𝐰|𝑆) = argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)

34
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)

35
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
max
𝐰
We have already expanded out the first term.
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱")

36
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") + A
:+,
!
−𝑤"
;
𝜎; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
Expand the log prior

37
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
max
𝐰
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") + A
:+,
!
−𝑤"
;
𝜎; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠

38
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
max
𝐰
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") −
1
𝜎; 𝐰)𝐰

39
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
max
𝐰
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") −
1
𝜎; 𝐰)𝐰
Maximizing a negative function is the same as minimizing the function

Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
40
min
𝐰
A
"
-
log(1 + exp(−𝑦"𝐰)𝐱") +
1
𝜎; 𝐰)𝐰

solving
41
Where have we seen this before?
min
𝐰
A
"
-
1
𝜎; 𝐰)𝐰

solving
42
Where have we seen this before?
Exercise: Write down the stochastic gradient descent (SGD) algorithm for this?
Other training algorithms exist. For example, the LBFGS algorithm is an example of
a quasi-Newton method. But gradient based methods like SGD and its variants are
way more commonly used.
min
𝐰
A
"
-
1
𝜎; 𝐰)𝐰

Logistic regression is…
• A classifier that predicts the probability that the label is
+1 for a particular input
• The discriminative counter-part of the naïve Bayes
classifier
• A discriminative classifier that can be trained via MAP or
MLE estimation
• A discriminative classifier that minimizes the logistic loss
over the training set
43

This lecture
44

Learning as loss minimization
• The setup
– Examples x drawn from a fixed, unknown distribution D
– Hidden oracle classifier f labels examples
– We wish to find a hypothesis h that mimics f
• The ideal situation
– Define a function L that penalizes bad hypotheses
– Learning: Pick a function ℎ ∈ 𝐻 to minimize expected loss
• Instead, minimize empirical loss on the training set
45
But distribution D is unknown

Empirical loss minimization
Learning = minimize empirical loss on the training set
46
Is there a problem here?

Empirical loss minimization
Learning = minimize empirical loss on the training set
We need something that biases the learner towards simpler
hypotheses
• Achieved using a regularizer, which penalizes complex
hypotheses
47
Is there a problem here? Overfitting!

Regularized loss minimization
• Learning:
• With linear classifiers:
• What is a loss function?
– Loss functions should penalize mistakes
– We are minimizing average loss over the training data
• What is the ideal loss function for classification?
48
(using ℓ! regularization)

The 0-1 loss
Penalize classification mistakes between true label y and
prediction y’
• For linear classifiers, the prediction y’ = sgn(wTx)
– Mistake if 𝑦 𝒘𝑇𝒙 ≤ 0
Minimizing 0-1 loss is intractable. Need surrogates
49

The loss function zoo
Many loss functions exist
– Perceptron loss
– Hinge loss (SVM)
– Exponential loss (AdaBoost)
– Logistic loss (logistic regression)
50

52
Zero-one

53
Hinge: SVM
Zero-one

54
Perceptron
Hinge: SVM
Zero-one

55
Perceptron
Hinge: SVM
Exponential: AdaBoost
Zero-one

56
Perceptron
Hinge: SVM
Logistic regression
Exponential: AdaBoost
Zero-one

57
Zoomed out

58
Zoomed out even more

This lecture
• Connection to Naïve Bayes
59

Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
Here, the P’s represent the Naïve Bayes posterior distribution,
and w can be used to calculate the priors and the likelihoods.
That is, 𝑃(𝑦 = 1 | 𝐰, 𝐱) is computed using
𝑃(𝐱 | 𝑦 = 1, 𝐰) and 𝑃(𝑦 = 1 | 𝐰)
60
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱

But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
61
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱

Substituting in the above expression, we will get
62
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1
1 + exp(−𝐰)𝐱)
Exercise: Show this formally

Substituting in the above expression, we get
63
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1
1 + exp(−𝐰)𝐱)
That is, both naïve Bayes and logistic regression try to
compute the same posterior distribution over the outputs
Naïve Bayes is a generative model.
Logistic Regression is the discriminative version.

7_logistic-regression presentation sur la regression logistique.pdf

More Related Content

Similar to 7_logistic-regression presentation sur la regression logistique.pdf (20)

Recently uploaded (20)

7_logistic-regression presentation sur la regression logistique.pdf