Lecture 5 Statistical Learning Theory

EE 615: Pattern Recognition & Machine Learning Fall 2016
Lecture 5 — August 18
Lecturer: Dinesh Garg Scribe: Harsha Vardhan Tetali
5.1 Characterizing Bayes Optimal Hypothesis (Contd..)
5.1.1 Binary Classification with Loss Matrix
For a classification problem, as seen in the last lecture, loss function can also be given in
the form of a loss matrix whose entries capture the fact - how significant a particular type of
classification error is. For example, consider the scenario of determining whether a patient
has cancer or not on the basis of various clinical test reports. If an algorithm classifies a
non-cancer patient as having cancer, it would be considered as still okay to some extent
because in the later stages of treatment the truth will be unveiled. However, if a patient
suffering from cancer is classified as not having cancer then it costs a life. Numerically, the
former error type is given lower weight in the loss matrix as compared to the later error
type. For the sake of illustration, let us assume that the cost of misclassifying a patient who
does not have cancer is 10 (i.e classifier predicts that the patient has cancer even when he
does not suffer from cancer), and the cost of misclassifying a patient having cancer is 100 (i.e
classifier predicts that the patient does not have cancer even though he suffers from cancer).
The table below capture this particular loss matrix.
True Label
Predicted Label
h(x) = 0 h(x) = 1
0 0 10
1 100 0
Table 5.1. Loss Matrix
In this table, we have denoted the case of not suffering from cancer as 0, and the case
of patient having cancer as 1. The non-diagonal elements of this table represent the costs
of misclassification. The diagonal elements are 0 because there is no penalty for the correct
classification. In order to compute the Bayes optimal hypothesis for this scenario, we start
with the expression for Q(h, x) given as follows:
Q(h, x) = (h, (x, y = 0))p(y = 0|x)dy + (h, (x, y = 1))p(y = 1|x) (5.1)
Note, for the binary classification problem, we always have
p(y = 1|x) + p(y = 0|x) = 1
5-1

EE 615 Lecture 5 — August 18 Fall 2016
Consider the case of h(x) = 0 as the result of the classifier. Q(h, x) for h(x) = 0 can be
written from the table as follows:
Q(h, x) = 0 · p(y = 0|x)dy + 100 · p(y = 1|x) (5.2)
Similarly Q(h, x) for h(x) = 1 can be written from the table as follows:
Q(h, x) = 10 · p(y = 0|x)dy + 0 · p(y = 1|x) (5.3)
For h(x) = 0 to be the outcome of the Bayes hypothesis, we need to have the following (so
as to minimize Q(h, x)).
10 p(y = 0|x) > 100 p(y = 1|x)
⇒ p(y = 0|x) > 10 p(y = 1|x)
But since the total probability sums to 1 we have,
1 − p(y = 1|x) > 10 p(y = 1|x)
⇒ p(y = 1|x) <
1
11
Equivalently,
p(y = 0|x) >
10
11
(5.4)
Thus the Bayes optimal classifier for the given problem is,
hBayes(x) =
0 if p(y = 1|x) < 1
11
1 otherwise
(5.5)
5.1.2 Multi-class Classification with Lass Matrix
Now let us derive the Bayes optimal hypothesis for the multi-class classification problem
under a given loss matrix. For the sake of illustration, let us consider the following loss
matrix for a k-class classification problem. Observe that the diagonal elements are all zero.
True
Predicted ˆC1
ˆC2 . . . ˆCk
C1 0 c12 c13 c14
C2 c21 0 c23 c24
. . . . . . . . . . . . . . .
Ck ck1 ck2 . . . 0
Table 5.2. Loss Matrix for k-class Classification Problem
5-2

To find a Bayes optimal hypothesis for this problem setup, we write the expression for Q(h, x)
as follows:
Q(h, x) =
k
i=1
(h, (x, y))p(y = Ci|x) (5.6)
It is easy to convince that the hypothesis that minimizes above expression would be given
by the following expression.
hBayes(x) = arg min
j
k
i=1
cij p(y = Ci|x) (5.7)
5.1.3 Regression with Squared Loss (i.e. q = 2)
Let us derive the Bayes optimal hypothesis for the liner regression under squared loss (aka
least squared regression). Recall, the squared loss function is given by
sq(h, (x, y)) = |y − h(x)|2
(5.8)
Like earlier, let us write down the expression for Q(h, x) which we should be minimizing in
order to find Bayes optimal hypothesis. For this problem scenario, the function Q(h, x) is
given as follows:
Q(h, (x)) =
y
(y − h(x))2
p(y|x)dy (5.9)
In the above equation, we make the following substitution h(x) = t (because is x fixed).
Q(t) =
y
(y − t)2
p(y|x)dy (5.10)
Now, we shall minimize the above expression w.r.t. t. For this, we take the partial derivative
of Q(t) with respect to t and equate it to zero. That is,
∂Q
∂t
=
y
∂
∂t
(y − t)2
p(y|x)dy = 0 (5.11)
The above equation implies the following.
y
2(y − t)p(y|x)dy = 0
⇒
y
t · p(y|x)dy =
y
y · p(y|x)dy
⇒ t =
y
y · p(y|x)dy
⇒ t = E[y|x]
⇒ hBayes(x) = E[y|x]
The previous calculation implies that the Bayes optimum rule for the least square regression
problem is given by the conditional mean of the target variable.
5-3

Lecture 5 Statistical Learning Theory

More Related Content

What's hot (20)

Similar to Lecture 5 Statistical Learning Theory (20)

Recently uploaded (20)

Lecture 5 Statistical Learning Theory