SlideShare a Scribd company logo
EE 615: Pattern Recognition & Machine Learning Fall 2016
Lecture 5 — August 18
Lecturer: Dinesh Garg Scribe: Harsha Vardhan Tetali
5.1 Characterizing Bayes Optimal Hypothesis (Contd..)
5.1.1 Binary Classification with Loss Matrix
For a classification problem, as seen in the last lecture, loss function can also be given in
the form of a loss matrix whose entries capture the fact - how significant a particular type of
classification error is. For example, consider the scenario of determining whether a patient
has cancer or not on the basis of various clinical test reports. If an algorithm classifies a
non-cancer patient as having cancer, it would be considered as still okay to some extent
because in the later stages of treatment the truth will be unveiled. However, if a patient
suffering from cancer is classified as not having cancer then it costs a life. Numerically, the
former error type is given lower weight in the loss matrix as compared to the later error
type. For the sake of illustration, let us assume that the cost of misclassifying a patient who
does not have cancer is 10 (i.e classifier predicts that the patient has cancer even when he
does not suffer from cancer), and the cost of misclassifying a patient having cancer is 100 (i.e
classifier predicts that the patient does not have cancer even though he suffers from cancer).
The table below capture this particular loss matrix.
True Label
Predicted Label
h(x) = 0 h(x) = 1
0 0 10
1 100 0
Table 5.1. Loss Matrix
In this table, we have denoted the case of not suffering from cancer as 0, and the case
of patient having cancer as 1. The non-diagonal elements of this table represent the costs
of misclassification. The diagonal elements are 0 because there is no penalty for the correct
classification. In order to compute the Bayes optimal hypothesis for this scenario, we start
with the expression for Q(h, x) given as follows:
Q(h, x) = (h, (x, y = 0))p(y = 0|x)dy + (h, (x, y = 1))p(y = 1|x) (5.1)
Note, for the binary classification problem, we always have
p(y = 1|x) + p(y = 0|x) = 1
5-1
EE 615 Lecture 5 — August 18 Fall 2016
Consider the case of h(x) = 0 as the result of the classifier. Q(h, x) for h(x) = 0 can be
written from the table as follows:
Q(h, x) = 0 · p(y = 0|x)dy + 100 · p(y = 1|x) (5.2)
Similarly Q(h, x) for h(x) = 1 can be written from the table as follows:
Q(h, x) = 10 · p(y = 0|x)dy + 0 · p(y = 1|x) (5.3)
For h(x) = 0 to be the outcome of the Bayes hypothesis, we need to have the following (so
as to minimize Q(h, x)).
10 p(y = 0|x) > 100 p(y = 1|x)
⇒ p(y = 0|x) > 10 p(y = 1|x)
But since the total probability sums to 1 we have,
1 − p(y = 1|x) > 10 p(y = 1|x)
⇒ p(y = 1|x) <
1
11
Equivalently,
p(y = 0|x) >
10
11
(5.4)
Thus the Bayes optimal classifier for the given problem is,
hBayes(x) =
0 if p(y = 1|x) < 1
11
1 otherwise
(5.5)
5.1.2 Multi-class Classification with Lass Matrix
Now let us derive the Bayes optimal hypothesis for the multi-class classification problem
under a given loss matrix. For the sake of illustration, let us consider the following loss
matrix for a k-class classification problem. Observe that the diagonal elements are all zero.
True
Predicted ˆC1
ˆC2 . . . ˆCk
C1 0 c12 c13 c14
C2 c21 0 c23 c24
. . . . . . . . . . . . . . .
Ck ck1 ck2 . . . 0
Table 5.2. Loss Matrix for k-class Classification Problem
5-2
EE 615 Lecture 5 — August 18 Fall 2016
To find a Bayes optimal hypothesis for this problem setup, we write the expression for Q(h, x)
as follows:
Q(h, x) =
k
i=1
(h, (x, y))p(y = Ci|x) (5.6)
It is easy to convince that the hypothesis that minimizes above expression would be given
by the following expression.
hBayes(x) = arg min
j
k
i=1
cij p(y = Ci|x) (5.7)
5.1.3 Regression with Squared Loss (i.e. q = 2)
Let us derive the Bayes optimal hypothesis for the liner regression under squared loss (aka
least squared regression). Recall, the squared loss function is given by
sq(h, (x, y)) = |y − h(x)|2
(5.8)
Like earlier, let us write down the expression for Q(h, x) which we should be minimizing in
order to find Bayes optimal hypothesis. For this problem scenario, the function Q(h, x) is
given as follows:
Q(h, (x)) =
y
(y − h(x))2
p(y|x)dy (5.9)
In the above equation, we make the following substitution h(x) = t (because is x fixed).
Q(t) =
y
(y − t)2
p(y|x)dy (5.10)
Now, we shall minimize the above expression w.r.t. t. For this, we take the partial derivative
of Q(t) with respect to t and equate it to zero. That is,
∂Q
∂t
=
y
∂
∂t
(y − t)2
p(y|x)dy = 0 (5.11)
The above equation implies the following.
y
2(y − t)p(y|x)dy = 0
⇒
y
t · p(y|x)dy =
y
y · p(y|x)dy
⇒ t =
y
y · p(y|x)dy
⇒ t = E[y|x]
⇒ hBayes(x) = E[y|x]
The previous calculation implies that the Bayes optimum rule for the least square regression
problem is given by the conditional mean of the target variable.
5-3
EE 615 Lecture 5 — August 18 Fall 2016
5.1.4 Regression with Absolute Loss (i.e. q = 1)
Note, the loss function for this scenario is given as follows
(h, (x, y)) = |h(x) − y| (5.12)
Now let us write the expression for Q(t) as follows (by substituting h(x) = t)
Q(t) =
∞
y=−∞
|t − y| p(y|x)dy
Differentiating the above expression with respect to t and equating the result to zero would
mean the following:
∞
y=−∞
d |t − y|
dt
p(y|x)dy = 0
⇒
∞
y=−∞
Sign(|topt − y|)p(y|x)dy = 0
⇒
y>topt
p(y|x)dy =
y≤topt
p(y|x)dy
This signifies that topt is the median of the distribution of the conditional distribution y | x.
Therefore the median of the conditional distribution y | x signifies the optimal Bayes rule
for q = 1. That is,
hBayes(x) = Median[y | x]
5-4

More Related Content

PPT
PPT
Chapter11
PPT
Chapter13
PPT
Chapter15
PPT
Chapter12
PPT
PPT
Chapter14
PPT

What's hot (20)

PPTX
random variables-descriptive and contincuous
PDF
Linear Regression Ordinary Least Squares Distributed Calculation Example
PPTX
Probability Distribution
PDF
Survival analysis 1
DOCX
How normal distribution is used in heights, blood pressure, measurement error...
PPT
PPT
Hypergeometric distribution
PDF
Business statistics homework help
PPTX
Non Linear Equation
PDF
Missing value imputation (slide)
PPTX
Binomial distribution
PPT
PPTX
Mean, variance, and standard deviation of a Discrete Random Variable
PDF
Data classification sammer
PPTX
CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...
PPTX
1.1 mean, variance and standard deviation
PDF
Characterization of student’s t distribution with some application to finance
PPTX
Bayesian statistics
PDF
Least Squares Regression Method | Edureka
PDF
A. spanos slides ch14-2013 (4)
random variables-descriptive and contincuous
Linear Regression Ordinary Least Squares Distributed Calculation Example
Probability Distribution
Survival analysis 1
How normal distribution is used in heights, blood pressure, measurement error...
Hypergeometric distribution
Business statistics homework help
Non Linear Equation
Missing value imputation (slide)
Binomial distribution
Mean, variance, and standard deviation of a Discrete Random Variable
Data classification sammer
CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...
1.1 mean, variance and standard deviation
Characterization of student’s t distribution with some application to finance
Bayesian statistics
Least Squares Regression Method | Edureka
A. spanos slides ch14-2013 (4)
Ad

Similar to Lecture 5 Statistical Learning Theory (20)

PDF
categorical data analysis Chapter 6b.pdf
PPTX
Naive Bayes Presentation
PDF
Probability cheatsheet
PDF
Logistic-Regression - Machine learning model
PDF
When is undersampling effective in unbalanced classification tasks?
PPTX
Math Assignment Help
PPT
optimal graph realization
PDF
Probability Cheatsheet.pdf
PDF
Machine learning (2)
PDF
Probability cheatsheet
PPTX
Fuzzy logic andits Applications
PDF
Linear models for classification
PDF
2012 mdsp pr13 support vector machine
PDF
NB classifier_Detailed pdf you can use it
PDF
NB classifier to use your next exam aslo
PPTX
Stats chapter 8
PPTX
Stats chapter 8
PDF
Introduction to Evidential Neural Networks
PPTX
Bagging_and_Boosting.pptx
PDF
MLT_KCS055 (Unit-2 Notes).pdfNNNNNNNNNNNNNNNN
categorical data analysis Chapter 6b.pdf
Naive Bayes Presentation
Probability cheatsheet
Logistic-Regression - Machine learning model
When is undersampling effective in unbalanced classification tasks?
Math Assignment Help
optimal graph realization
Probability Cheatsheet.pdf
Machine learning (2)
Probability cheatsheet
Fuzzy logic andits Applications
Linear models for classification
2012 mdsp pr13 support vector machine
NB classifier_Detailed pdf you can use it
NB classifier to use your next exam aslo
Stats chapter 8
Stats chapter 8
Introduction to Evidential Neural Networks
Bagging_and_Boosting.pptx
MLT_KCS055 (Unit-2 Notes).pdfNNNNNNNNNNNNNNNN
Ad

Recently uploaded (20)

PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Lesson notes of climatology university.
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Pharma ospi slides which help in ospi learning
PDF
RMMM.pdf make it easy to upload and study
PPTX
master seminar digital applications in india
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Classroom Observation Tools for Teachers
PPTX
Presentation on HIE in infants and its manifestations
Chinmaya Tiranga quiz Grand Finale.pdf
Lesson notes of climatology university.
Pharmacology of Heart Failure /Pharmacotherapy of CHF
202450812 BayCHI UCSC-SV 20250812 v17.pptx
GDM (1) (1).pptx small presentation for students
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Abdominal Access Techniques with Prof. Dr. R K Mishra
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
O7-L3 Supply Chain Operations - ICLT Program
human mycosis Human fungal infections are called human mycosis..pptx
Pharma ospi slides which help in ospi learning
RMMM.pdf make it easy to upload and study
master seminar digital applications in india
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Supply Chain Operations Speaking Notes -ICLT Program
Classroom Observation Tools for Teachers
Presentation on HIE in infants and its manifestations

Lecture 5 Statistical Learning Theory

  • 1. EE 615: Pattern Recognition & Machine Learning Fall 2016 Lecture 5 — August 18 Lecturer: Dinesh Garg Scribe: Harsha Vardhan Tetali 5.1 Characterizing Bayes Optimal Hypothesis (Contd..) 5.1.1 Binary Classification with Loss Matrix For a classification problem, as seen in the last lecture, loss function can also be given in the form of a loss matrix whose entries capture the fact - how significant a particular type of classification error is. For example, consider the scenario of determining whether a patient has cancer or not on the basis of various clinical test reports. If an algorithm classifies a non-cancer patient as having cancer, it would be considered as still okay to some extent because in the later stages of treatment the truth will be unveiled. However, if a patient suffering from cancer is classified as not having cancer then it costs a life. Numerically, the former error type is given lower weight in the loss matrix as compared to the later error type. For the sake of illustration, let us assume that the cost of misclassifying a patient who does not have cancer is 10 (i.e classifier predicts that the patient has cancer even when he does not suffer from cancer), and the cost of misclassifying a patient having cancer is 100 (i.e classifier predicts that the patient does not have cancer even though he suffers from cancer). The table below capture this particular loss matrix. True Label Predicted Label h(x) = 0 h(x) = 1 0 0 10 1 100 0 Table 5.1. Loss Matrix In this table, we have denoted the case of not suffering from cancer as 0, and the case of patient having cancer as 1. The non-diagonal elements of this table represent the costs of misclassification. The diagonal elements are 0 because there is no penalty for the correct classification. In order to compute the Bayes optimal hypothesis for this scenario, we start with the expression for Q(h, x) given as follows: Q(h, x) = (h, (x, y = 0))p(y = 0|x)dy + (h, (x, y = 1))p(y = 1|x) (5.1) Note, for the binary classification problem, we always have p(y = 1|x) + p(y = 0|x) = 1 5-1
  • 2. EE 615 Lecture 5 — August 18 Fall 2016 Consider the case of h(x) = 0 as the result of the classifier. Q(h, x) for h(x) = 0 can be written from the table as follows: Q(h, x) = 0 · p(y = 0|x)dy + 100 · p(y = 1|x) (5.2) Similarly Q(h, x) for h(x) = 1 can be written from the table as follows: Q(h, x) = 10 · p(y = 0|x)dy + 0 · p(y = 1|x) (5.3) For h(x) = 0 to be the outcome of the Bayes hypothesis, we need to have the following (so as to minimize Q(h, x)). 10 p(y = 0|x) > 100 p(y = 1|x) ⇒ p(y = 0|x) > 10 p(y = 1|x) But since the total probability sums to 1 we have, 1 − p(y = 1|x) > 10 p(y = 1|x) ⇒ p(y = 1|x) < 1 11 Equivalently, p(y = 0|x) > 10 11 (5.4) Thus the Bayes optimal classifier for the given problem is, hBayes(x) = 0 if p(y = 1|x) < 1 11 1 otherwise (5.5) 5.1.2 Multi-class Classification with Lass Matrix Now let us derive the Bayes optimal hypothesis for the multi-class classification problem under a given loss matrix. For the sake of illustration, let us consider the following loss matrix for a k-class classification problem. Observe that the diagonal elements are all zero. True Predicted ˆC1 ˆC2 . . . ˆCk C1 0 c12 c13 c14 C2 c21 0 c23 c24 . . . . . . . . . . . . . . . Ck ck1 ck2 . . . 0 Table 5.2. Loss Matrix for k-class Classification Problem 5-2
  • 3. EE 615 Lecture 5 — August 18 Fall 2016 To find a Bayes optimal hypothesis for this problem setup, we write the expression for Q(h, x) as follows: Q(h, x) = k i=1 (h, (x, y))p(y = Ci|x) (5.6) It is easy to convince that the hypothesis that minimizes above expression would be given by the following expression. hBayes(x) = arg min j k i=1 cij p(y = Ci|x) (5.7) 5.1.3 Regression with Squared Loss (i.e. q = 2) Let us derive the Bayes optimal hypothesis for the liner regression under squared loss (aka least squared regression). Recall, the squared loss function is given by sq(h, (x, y)) = |y − h(x)|2 (5.8) Like earlier, let us write down the expression for Q(h, x) which we should be minimizing in order to find Bayes optimal hypothesis. For this problem scenario, the function Q(h, x) is given as follows: Q(h, (x)) = y (y − h(x))2 p(y|x)dy (5.9) In the above equation, we make the following substitution h(x) = t (because is x fixed). Q(t) = y (y − t)2 p(y|x)dy (5.10) Now, we shall minimize the above expression w.r.t. t. For this, we take the partial derivative of Q(t) with respect to t and equate it to zero. That is, ∂Q ∂t = y ∂ ∂t (y − t)2 p(y|x)dy = 0 (5.11) The above equation implies the following. y 2(y − t)p(y|x)dy = 0 ⇒ y t · p(y|x)dy = y y · p(y|x)dy ⇒ t = y y · p(y|x)dy ⇒ t = E[y|x] ⇒ hBayes(x) = E[y|x] The previous calculation implies that the Bayes optimum rule for the least square regression problem is given by the conditional mean of the target variable. 5-3
  • 4. EE 615 Lecture 5 — August 18 Fall 2016 5.1.4 Regression with Absolute Loss (i.e. q = 1) Note, the loss function for this scenario is given as follows (h, (x, y)) = |h(x) − y| (5.12) Now let us write the expression for Q(t) as follows (by substituting h(x) = t) Q(t) = ∞ y=−∞ |t − y| p(y|x)dy Differentiating the above expression with respect to t and equating the result to zero would mean the following: ∞ y=−∞ d |t − y| dt p(y|x)dy = 0 ⇒ ∞ y=−∞ Sign(|topt − y|)p(y|x)dy = 0 ⇒ y>topt p(y|x)dy = y≤topt p(y|x)dy This signifies that topt is the median of the distribution of the conditional distribution y | x. Therefore the median of the conditional distribution y | x signifies the optimal Bayes rule for q = 1. That is, hBayes(x) = Median[y | x] 5-4