Machine Learning
Logistic Regression
1
Where are we?
We have seen the following ideas
– Linear models
– Learning as loss minimization
– Bayesian learning criteria (MAP and MLE estimation)
2
This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
3
This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
4
Logistic Regression: Setup
• The setting
– Binary classification
– Inputs: Feature vectors 𝐱 ∈ ℜ!
– Labels: 𝑦 ∈ {−1, +1}
• Training data
– S = {(𝐱", 𝑦")}, consisting of 𝑚 examples
5
Classification, but…
The output 𝑦 is discrete: Either −1 or +1
Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)
Expand hypothesis space to functions whose output is [0 − 1]
• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively make the problem a regression problem
Many hypothesis spaces possible
6
Classification, but…
The output 𝑦 is discrete: Either −1 or +1
Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)
Expand hypothesis space to functions whose output is [0 − 1]
• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively, make the problem a regression problem
Many hypothesis spaces possible
7
The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function) ¾
What is the domain
and the range of the
sigmoid function?
This is a reasonable choice. We will see why later
8
The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function), defined as
This is a reasonable choice. We will see why later
9
The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function), defined as
What is the domain
and the range of the
sigmoid function?
This is a reasonable choice. We will see why later
10
The Sigmoid function
¾(z)
z
11
The Sigmoid function
12
What is its derivative with respect to z?
The Sigmoid function
13
What is its derivative with respect to z?
Predicting probabilities
According to the logistic regression model, we have
14
Predicting probabilities
According to the logistic regression model, we have
15
Predicting probabilities
According to the logistic regression model, we have
16
Predicting probabilities
According to the logistic regression model, we have
Or equivalently
17
Predicting probabilities
According to the logistic regression model, we have
Or equivalently
18
Note that we are directly modeling
𝑃(𝑦 | 𝑥) rather than 𝑃(𝑥 |𝑦) and 𝑃(𝑦)
Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
• If this is greater than half, predict +1 else predict −1
– What does this correspond to in terms of 𝐰)𝐱?
19
Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
• If this is greater than half, predict +1 else predict −1
– What does this correspond to in terms of 𝐰)𝐱?
– Prediction = sgn(𝐰)𝐱)
20
This lecture
• Logistic regression
• Training a logistic regression classifier
– First: Maximum likelihood estimation
– Then: Adding priors à Maximum a Posteriori estimation
• Back to loss minimization
21
Maximum likelihood estimation
Let’s address the problem of learning
• Training data
– S = {(𝐱", 𝑦")}, consisting of 𝑚 examples
• What we want
– Find a weight vector 𝐰 such that P(S ∣ 𝐰) is maximized
– We know that our examples are drawn independently and
are identically distributed (i.i.d)
– How do we proceed?
22
Maximum likelihood estimation
23
The usual trick: Convert products to sums by taking log
Recall that this works only because log is an increasing
function and the maximizer will not change
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
Maximum likelihood estimation
24
Equivalent to solving
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
Maximum likelihood estimation
25
But (by definition) we know that
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
𝑃 𝑦" 𝐰, 𝐱" = 𝜎 𝑦"𝐰)𝐱" =
1
1 + exp(−𝑦"𝐰)𝐱")
Maximum likelihood estimation
26
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
𝑃 𝑦 𝐰, 𝐱 =
1
1 + exp(−y!𝐰"𝐱#)
Equivalent to solving
max
𝐰
A
"
-
−log 1 + exp −𝑦"𝐰)𝐱"
Maximum likelihood estimation
27
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
𝑃 𝑦 𝐰, 𝐱 =
1
1 + exp(−y!𝐰"𝐱#)
Equivalent to solving
The goal: Maximum
likelihood training of a
discriminative
probabilistic classifier
under the logistic
model for the posterior
distribution.
max
𝐰
A
"
-
−log 1 + exp −𝑦"𝐰)𝐱"
Maximum likelihood estimation
28
argmax
𝐰
𝑃 𝑆 𝐰 = argmax
𝐰
@
"+,
-
𝑃 𝑦" 𝐱", 𝐰)
max
𝐰
A
"
-
log 𝑃 𝑦" 𝐱", 𝐰)
𝑃 𝑦 𝐰, 𝐱 =
1
1 + exp(−y!𝐰"𝐱#)
Equivalent to solving
max
𝐰
A
"
-
−log 1 + exp −𝑦"𝐰)𝐱"
Equivalent to: Training a linear classifier by minimizing the logistic loss.
The goal: Maximum
likelihood training of a
discriminative
probabilistic classifier
under the logistic
model for the posterior
distribution.
Maximum a posteriori estimation
We could also add a prior on the weights
Suppose each weight in the weight vector is drawn
independently from the normal distribution with zero
mean and standard deviation 𝜎
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
29
MAP estimation for logistic regression
30
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
Let us work through this procedure again
MAP estimation for logistic regression
31
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
What is the goal of MAP estimation?
(In maximum likelihood estimation, we maximized the likelihood of the data)
Let us work through this procedure again
to see what changes from maximum likelihood
estimation
MAP estimation for logistic regression
32
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
What is the goal of MAP estimation?
To maximize the posterior probability of the model given the data (i.e. to find the
most probable model, given the data)
𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)
MAP estimation for logistic regression
33
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃(𝐰|𝑆) = argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
MAP estimation for logistic regression
34
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
MAP estimation for logistic regression
35
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
We have already expanded out the first term.
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱")
MAP estimation for logistic regression
36
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") + A
:+,
!
−𝑤"
;
𝜎; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
Expand the log prior
MAP estimation for logistic regression
37
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") + A
:+,
!
−𝑤"
;
𝜎; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
MAP estimation for logistic regression
38
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") −
1
𝜎; 𝐰)𝐰
MAP estimation for logistic regression
39
Learning by solving
𝑝 𝐰 = $
!"#
$
𝑝(𝑤%) = $
!"#
$
1
𝜎 2𝜋
exp
−𝑤%
&
𝜎&
argmax
𝐰
𝑃 𝑆 𝐰 𝑃(𝐰)
Take log to simplify
max
𝐰
log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
max
𝐰
A
"
-
−log(1 + exp(−𝑦"𝐰)𝐱") −
1
𝜎; 𝐰)𝐰
Maximizing a negative function is the same as minimizing the function
Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
40
min
𝐰
A
"
-
log(1 + exp(−𝑦"𝐰)𝐱") +
1
𝜎; 𝐰)𝐰
Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
41
Where have we seen this before?
min
𝐰
A
"
-
log(1 + exp(−𝑦"𝐰)𝐱") +
1
𝜎; 𝐰)𝐰
Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
42
Where have we seen this before?
Exercise: Write down the stochastic gradient descent (SGD) algorithm for this?
Other training algorithms exist. For example, the LBFGS algorithm is an example of
a quasi-Newton method. But gradient based methods like SGD and its variants are
way more commonly used.
min
𝐰
A
"
-
log(1 + exp(−𝑦"𝐰)𝐱") +
1
𝜎; 𝐰)𝐰
Logistic regression is…
• A classifier that predicts the probability that the label is
+1 for a particular input
• The discriminative counter-part of the naïve Bayes
classifier
• A discriminative classifier that can be trained via MAP or
MLE estimation
• A discriminative classifier that minimizes the logistic loss
over the training set
43
This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
44
Learning as loss minimization
• The setup
– Examples x drawn from a fixed, unknown distribution D
– Hidden oracle classifier f labels examples
– We wish to find a hypothesis h that mimics f
• The ideal situation
– Define a function L that penalizes bad hypotheses
– Learning: Pick a function ℎ ∈ 𝐻 to minimize expected loss
• Instead, minimize empirical loss on the training set
45
But distribution D is unknown
Empirical loss minimization
Learning = minimize empirical loss on the training set
46
Is there a problem here?
Empirical loss minimization
Learning = minimize empirical loss on the training set
We need something that biases the learner towards simpler
hypotheses
• Achieved using a regularizer, which penalizes complex
hypotheses
47
Is there a problem here? Overfitting!
Regularized loss minimization
• Learning:
• With linear classifiers:
• What is a loss function?
– Loss functions should penalize mistakes
– We are minimizing average loss over the training data
• What is the ideal loss function for classification?
48
(using ℓ! regularization)
The 0-1 loss
Penalize classification mistakes between true label y and
prediction y’
• For linear classifiers, the prediction y’ = sgn(wTx)
– Mistake if 𝑦 𝒘𝑇𝒙 ≤ 0
Minimizing 0-1 loss is intractable. Need surrogates
49
The loss function zoo
Many loss functions exist
– Perceptron loss
– Hinge loss (SVM)
– Exponential loss (AdaBoost)
– Logistic loss (logistic regression)
50
The loss function zoo
51
The loss function zoo
52
Zero-one
The loss function zoo
53
Hinge: SVM
Zero-one
The loss function zoo
54
Perceptron
Hinge: SVM
Zero-one
The loss function zoo
55
Perceptron
Hinge: SVM
Exponential: AdaBoost
Zero-one
The loss function zoo
56
Perceptron
Hinge: SVM
Logistic regression
Exponential: AdaBoost
Zero-one
The loss function zoo
57
Zoomed out
The loss function zoo
58
Zoomed out even more
This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
• Connection to Naïve Bayes
59
Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
Here, the P’s represent the Naïve Bayes posterior distribution,
and w can be used to calculate the priors and the likelihoods.
That is, 𝑃(𝑦 = 1 | 𝐰, 𝐱) is computed using
𝑃(𝐱 | 𝑦 = 1, 𝐰) and 𝑃(𝑦 = 1 | 𝐰)
60
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
61
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
Substituting in the above expression, we will get
62
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1
1 + exp(−𝐰)𝐱)
Exercise: Show this formally
Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
Substituting in the above expression, we get
63
log
𝑃(𝑦 = −1|𝐱, 𝐰)
𝑃(𝑦 = +1|𝐱, 𝐰)
= 𝐰)𝐱
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1
1 + exp(−𝐰)𝐱)
That is, both naïve Bayes and logistic regression try to
compute the same posterior distribution over the outputs
Naïve Bayes is a generative model.
Logistic Regression is the discriminative version.

More Related Content

PPTX
Lec05.pptx
PPTX
Logistic-regression-Supervised-MachineLearning.pptx
PPTX
5_LR_Apr_7_2021.pptx in nature language processing
PDF
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
PDF
3ml.pdf
PDF
Logistic regression in Machine Learning
PPTX
Supervised Machine Learning Algorithms
PPTX
Logistic regression
Lec05.pptx
Logistic-regression-Supervised-MachineLearning.pptx
5_LR_Apr_7_2021.pptx in nature language processing
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
3ml.pdf
Logistic regression in Machine Learning
Supervised Machine Learning Algorithms
Logistic regression

Similar to 7_logistic-regression presentation sur la regression logistique.pdf (20)

PDF
Module -6.pdf Machine Learning Types and examples
PPTX
Classification Algortyhm of Machine Learning
PDF
Cheatsheet supervised-learning
PPTX
lec+5+_part+1 cloud .pptx
PPTX
Difference between logistic regression shallow neural network and deep neura...
PPTX
MACHINE LEARNING Unit -2 Algorithm.pptx
PPTX
Introduction to Classification . pptx
PDF
Classification Techniques for Machine Learning
PDF
Logistic-Regression - Machine learning model
PPTX
Linear Regression and Logistic Regression in ML
PPTX
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
PDF
Lecture2 xing
PDF
Logistic regression, machine learning algorithms
PPTX
07 logistic regression and stochastic gradient descent
PDF
Data classification sammer
PPTX
Group 20_Logistic Regression devara.pptx
PPTX
Logistic Regression in machine learning ppt
PDF
Chapter3 hundred page machine learning
PPTX
logistic_regression_intro_simple_lang.pptx
DOCX
Logistic Regression in machine learning.docx
Module -6.pdf Machine Learning Types and examples
Classification Algortyhm of Machine Learning
Cheatsheet supervised-learning
lec+5+_part+1 cloud .pptx
Difference between logistic regression shallow neural network and deep neura...
MACHINE LEARNING Unit -2 Algorithm.pptx
Introduction to Classification . pptx
Classification Techniques for Machine Learning
Logistic-Regression - Machine learning model
Linear Regression and Logistic Regression in ML
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Lecture2 xing
Logistic regression, machine learning algorithms
07 logistic regression and stochastic gradient descent
Data classification sammer
Group 20_Logistic Regression devara.pptx
Logistic Regression in machine learning ppt
Chapter3 hundred page machine learning
logistic_regression_intro_simple_lang.pptx
Logistic Regression in machine learning.docx
Ad

Recently uploaded (20)

PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPT
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
A powerpoint on colorectal cancer with brief background
PPTX
limit test definition and all limit tests
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Packaging materials of fruits and vegetables
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPTX
Microbes in human welfare class 12 .pptx
PPTX
Understanding the Circulatory System……..
PPTX
Introcution to Microbes Burton's Biology for the Health
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
Hypertension_Training_materials_English_2024[1] (1).pptx
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
Animal tissues, epithelial, muscle, connective, nervous tissue
A powerpoint on colorectal cancer with brief background
limit test definition and all limit tests
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Packaging materials of fruits and vegetables
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
Microbes in human welfare class 12 .pptx
Understanding the Circulatory System……..
Introcution to Microbes Burton's Biology for the Health
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
BODY FLUIDS AND CIRCULATION class 11 .pptx
Ad

7_logistic-regression presentation sur la regression logistique.pdf

  • 2. Where are we? We have seen the following ideas – Linear models – Learning as loss minimization – Bayesian learning criteria (MAP and MLE estimation) 2
  • 3. This lecture • Logistic regression • Training a logistic regression classifier • Back to loss minimization 3
  • 4. This lecture • Logistic regression • Training a logistic regression classifier • Back to loss minimization 4
  • 5. Logistic Regression: Setup • The setting – Binary classification – Inputs: Feature vectors 𝐱 ∈ ℜ! – Labels: 𝑦 ∈ {−1, +1} • Training data – S = {(𝐱", 𝑦")}, consisting of 𝑚 examples 5
  • 6. Classification, but… The output 𝑦 is discrete: Either −1 or +1 Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱) Expand hypothesis space to functions whose output is [0 − 1] • Original problem: ℜ! → {−1, +1} • Modified problem: ℜ! → [0 − 1] • Effectively make the problem a regression problem Many hypothesis spaces possible 6
  • 7. Classification, but… The output 𝑦 is discrete: Either −1 or +1 Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱) Expand hypothesis space to functions whose output is [0 − 1] • Original problem: ℜ! → {−1, +1} • Modified problem: ℜ! → [0 − 1] • Effectively, make the problem a regression problem Many hypothesis spaces possible 7
  • 8. The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function) ¾ What is the domain and the range of the sigmoid function? This is a reasonable choice. We will see why later 8
  • 9. The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function), defined as This is a reasonable choice. We will see why later 9
  • 10. The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function), defined as What is the domain and the range of the sigmoid function? This is a reasonable choice. We will see why later 10
  • 12. The Sigmoid function 12 What is its derivative with respect to z?
  • 13. The Sigmoid function 13 What is its derivative with respect to z?
  • 14. Predicting probabilities According to the logistic regression model, we have 14
  • 15. Predicting probabilities According to the logistic regression model, we have 15
  • 16. Predicting probabilities According to the logistic regression model, we have 16
  • 17. Predicting probabilities According to the logistic regression model, we have Or equivalently 17
  • 18. Predicting probabilities According to the logistic regression model, we have Or equivalently 18 Note that we are directly modeling 𝑃(𝑦 | 𝑥) rather than 𝑃(𝑥 |𝑦) and 𝑃(𝑦)
  • 19. Predicting a label with logistic regression • Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰) • If this is greater than half, predict +1 else predict −1 – What does this correspond to in terms of 𝐰)𝐱? 19
  • 20. Predicting a label with logistic regression • Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰) • If this is greater than half, predict +1 else predict −1 – What does this correspond to in terms of 𝐰)𝐱? – Prediction = sgn(𝐰)𝐱) 20
  • 21. This lecture • Logistic regression • Training a logistic regression classifier – First: Maximum likelihood estimation – Then: Adding priors à Maximum a Posteriori estimation • Back to loss minimization 21
  • 22. Maximum likelihood estimation Let’s address the problem of learning • Training data – S = {(𝐱", 𝑦")}, consisting of 𝑚 examples • What we want – Find a weight vector 𝐰 such that P(S ∣ 𝐰) is maximized – We know that our examples are drawn independently and are identically distributed (i.i.d) – How do we proceed? 22
  • 23. Maximum likelihood estimation 23 The usual trick: Convert products to sums by taking log Recall that this works only because log is an increasing function and the maximizer will not change argmax 𝐰 𝑃 𝑆 𝐰 = argmax 𝐰 @ "+, - 𝑃 𝑦" 𝐱", 𝐰)
  • 24. Maximum likelihood estimation 24 Equivalent to solving argmax 𝐰 𝑃 𝑆 𝐰 = argmax 𝐰 @ "+, - 𝑃 𝑦" 𝐱", 𝐰) max 𝐰 A " - log 𝑃 𝑦" 𝐱", 𝐰)
  • 25. Maximum likelihood estimation 25 But (by definition) we know that argmax 𝐰 𝑃 𝑆 𝐰 = argmax 𝐰 @ "+, - 𝑃 𝑦" 𝐱", 𝐰) max 𝐰 A " - log 𝑃 𝑦" 𝐱", 𝐰) 𝑃 𝑦" 𝐰, 𝐱" = 𝜎 𝑦"𝐰)𝐱" = 1 1 + exp(−𝑦"𝐰)𝐱")
  • 26. Maximum likelihood estimation 26 argmax 𝐰 𝑃 𝑆 𝐰 = argmax 𝐰 @ "+, - 𝑃 𝑦" 𝐱", 𝐰) max 𝐰 A " - log 𝑃 𝑦" 𝐱", 𝐰) 𝑃 𝑦 𝐰, 𝐱 = 1 1 + exp(−y!𝐰"𝐱#) Equivalent to solving max 𝐰 A " - −log 1 + exp −𝑦"𝐰)𝐱"
  • 27. Maximum likelihood estimation 27 argmax 𝐰 𝑃 𝑆 𝐰 = argmax 𝐰 @ "+, - 𝑃 𝑦" 𝐱", 𝐰) max 𝐰 A " - log 𝑃 𝑦" 𝐱", 𝐰) 𝑃 𝑦 𝐰, 𝐱 = 1 1 + exp(−y!𝐰"𝐱#) Equivalent to solving The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution. max 𝐰 A " - −log 1 + exp −𝑦"𝐰)𝐱"
  • 28. Maximum likelihood estimation 28 argmax 𝐰 𝑃 𝑆 𝐰 = argmax 𝐰 @ "+, - 𝑃 𝑦" 𝐱", 𝐰) max 𝐰 A " - log 𝑃 𝑦" 𝐱", 𝐰) 𝑃 𝑦 𝐰, 𝐱 = 1 1 + exp(−y!𝐰"𝐱#) Equivalent to solving max 𝐰 A " - −log 1 + exp −𝑦"𝐰)𝐱" Equivalent to: Training a linear classifier by minimizing the logistic loss. The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution.
  • 29. Maximum a posteriori estimation We could also add a prior on the weights Suppose each weight in the weight vector is drawn independently from the normal distribution with zero mean and standard deviation 𝜎 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& 29
  • 30. MAP estimation for logistic regression 30 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& Let us work through this procedure again
  • 31. MAP estimation for logistic regression 31 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& What is the goal of MAP estimation? (In maximum likelihood estimation, we maximized the likelihood of the data) Let us work through this procedure again to see what changes from maximum likelihood estimation
  • 32. MAP estimation for logistic regression 32 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& What is the goal of MAP estimation? To maximize the posterior probability of the model given the data (i.e. to find the most probable model, given the data) 𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)
  • 33. MAP estimation for logistic regression 33 Learning by solving 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& argmax 𝐰 𝑃(𝐰|𝑆) = argmax 𝐰 𝑃 𝑆 𝐰 𝑃(𝐰)
  • 34. MAP estimation for logistic regression 34 Learning by solving 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& argmax 𝐰 𝑃 𝑆 𝐰 𝑃(𝐰) Take log to simplify max 𝐰 log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
  • 35. MAP estimation for logistic regression 35 Learning by solving 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& argmax 𝐰 𝑃 𝑆 𝐰 𝑃(𝐰) Take log to simplify max 𝐰 log 𝑃 𝑆 𝐰 + log 𝑃(𝐰) We have already expanded out the first term. A " - −log(1 + exp(−𝑦"𝐰)𝐱")
  • 36. MAP estimation for logistic regression 36 Learning by solving 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& argmax 𝐰 𝑃 𝑆 𝐰 𝑃(𝐰) Take log to simplify max 𝐰 log 𝑃 𝑆 𝐰 + log 𝑃(𝐰) A " - −log(1 + exp(−𝑦"𝐰)𝐱") + A :+, ! −𝑤" ; 𝜎; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠 Expand the log prior
  • 37. MAP estimation for logistic regression 37 Learning by solving 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& argmax 𝐰 𝑃 𝑆 𝐰 𝑃(𝐰) Take log to simplify max 𝐰 log 𝑃 𝑆 𝐰 + log 𝑃(𝐰) max 𝐰 A " - −log(1 + exp(−𝑦"𝐰)𝐱") + A :+, ! −𝑤" ; 𝜎; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
  • 38. MAP estimation for logistic regression 38 Learning by solving 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& argmax 𝐰 𝑃 𝑆 𝐰 𝑃(𝐰) Take log to simplify max 𝐰 log 𝑃 𝑆 𝐰 + log 𝑃(𝐰) max 𝐰 A " - −log(1 + exp(−𝑦"𝐰)𝐱") − 1 𝜎; 𝐰)𝐰
  • 39. MAP estimation for logistic regression 39 Learning by solving 𝑝 𝐰 = $ !"# $ 𝑝(𝑤%) = $ !"# $ 1 𝜎 2𝜋 exp −𝑤% & 𝜎& argmax 𝐰 𝑃 𝑆 𝐰 𝑃(𝐰) Take log to simplify max 𝐰 log 𝑃 𝑆 𝐰 + log 𝑃(𝐰) max 𝐰 A " - −log(1 + exp(−𝑦"𝐰)𝐱") − 1 𝜎; 𝐰)𝐰 Maximizing a negative function is the same as minimizing the function
  • 40. Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving 40 min 𝐰 A " - log(1 + exp(−𝑦"𝐰)𝐱") + 1 𝜎; 𝐰)𝐰
  • 41. Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving 41 Where have we seen this before? min 𝐰 A " - log(1 + exp(−𝑦"𝐰)𝐱") + 1 𝜎; 𝐰)𝐰
  • 42. Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving 42 Where have we seen this before? Exercise: Write down the stochastic gradient descent (SGD) algorithm for this? Other training algorithms exist. For example, the LBFGS algorithm is an example of a quasi-Newton method. But gradient based methods like SGD and its variants are way more commonly used. min 𝐰 A " - log(1 + exp(−𝑦"𝐰)𝐱") + 1 𝜎; 𝐰)𝐰
  • 43. Logistic regression is… • A classifier that predicts the probability that the label is +1 for a particular input • The discriminative counter-part of the naïve Bayes classifier • A discriminative classifier that can be trained via MAP or MLE estimation • A discriminative classifier that minimizes the logistic loss over the training set 43
  • 44. This lecture • Logistic regression • Training a logistic regression classifier • Back to loss minimization 44
  • 45. Learning as loss minimization • The setup – Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f • The ideal situation – Define a function L that penalizes bad hypotheses – Learning: Pick a function ℎ ∈ 𝐻 to minimize expected loss • Instead, minimize empirical loss on the training set 45 But distribution D is unknown
  • 46. Empirical loss minimization Learning = minimize empirical loss on the training set 46 Is there a problem here?
  • 47. Empirical loss minimization Learning = minimize empirical loss on the training set We need something that biases the learner towards simpler hypotheses • Achieved using a regularizer, which penalizes complex hypotheses 47 Is there a problem here? Overfitting!
  • 48. Regularized loss minimization • Learning: • With linear classifiers: • What is a loss function? – Loss functions should penalize mistakes – We are minimizing average loss over the training data • What is the ideal loss function for classification? 48 (using ℓ! regularization)
  • 49. The 0-1 loss Penalize classification mistakes between true label y and prediction y’ • For linear classifiers, the prediction y’ = sgn(wTx) – Mistake if 𝑦 𝒘𝑇𝒙 ≤ 0 Minimizing 0-1 loss is intractable. Need surrogates 49
  • 50. The loss function zoo Many loss functions exist – Perceptron loss – Hinge loss (SVM) – Exponential loss (AdaBoost) – Logistic loss (logistic regression) 50
  • 52. The loss function zoo 52 Zero-one
  • 53. The loss function zoo 53 Hinge: SVM Zero-one
  • 54. The loss function zoo 54 Perceptron Hinge: SVM Zero-one
  • 55. The loss function zoo 55 Perceptron Hinge: SVM Exponential: AdaBoost Zero-one
  • 56. The loss function zoo 56 Perceptron Hinge: SVM Logistic regression Exponential: AdaBoost Zero-one
  • 57. The loss function zoo 57 Zoomed out
  • 58. The loss function zoo 58 Zoomed out even more
  • 59. This lecture • Logistic regression • Training a logistic regression classifier • Back to loss minimization • Connection to Naïve Bayes 59
  • 60. Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function Here, the P’s represent the Naïve Bayes posterior distribution, and w can be used to calculate the priors and the likelihoods. That is, 𝑃(𝑦 = 1 | 𝐰, 𝐱) is computed using 𝑃(𝐱 | 𝑦 = 1, 𝐰) and 𝑃(𝑦 = 1 | 𝐰) 60 log 𝑃(𝑦 = −1|𝐱, 𝐰) 𝑃(𝑦 = +1|𝐱, 𝐰) = 𝐰)𝐱
  • 61. Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰) 61 log 𝑃(𝑦 = −1|𝐱, 𝐰) 𝑃(𝑦 = +1|𝐱, 𝐰) = 𝐰)𝐱
  • 62. Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰) Substituting in the above expression, we will get 62 log 𝑃(𝑦 = −1|𝐱, 𝐰) 𝑃(𝑦 = +1|𝐱, 𝐰) = 𝐰)𝐱 𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 = 1 1 + exp(−𝐰)𝐱) Exercise: Show this formally
  • 63. Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰) Substituting in the above expression, we get 63 log 𝑃(𝑦 = −1|𝐱, 𝐰) 𝑃(𝑦 = +1|𝐱, 𝐰) = 𝐰)𝐱 𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 = 1 1 + exp(−𝐰)𝐱) That is, both naïve Bayes and logistic regression try to compute the same posterior distribution over the outputs Naïve Bayes is a generative model. Logistic Regression is the discriminative version.