SlideShare a Scribd company logo
Machine Learning for Data Mining
Introduction to Bayesian Classifiers
Andres Mendez-Vazquez
August 3, 2015
1 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
2 / 71
Classification problem
Training Data
Samples of the form (d, h(d))
d
Where d are the data objects to classify (inputs)
h (d)
h(d) are the correct class info for d, h(d) ∈ 1, . . . K
3 / 71
Classification problem
Training Data
Samples of the form (d, h(d))
d
Where d are the data objects to classify (inputs)
h (d)
h(d) are the correct class info for d, h(d) ∈ 1, . . . K
3 / 71
Classification problem
Training Data
Samples of the form (d, h(d))
d
Where d are the data objects to classify (inputs)
h (d)
h(d) are the correct class info for d, h(d) ∈ 1, . . . K
3 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
4 / 71
Classification Problem
Goal
Given dnew, provide h(dnew)
The Machinery in General looks...
Supervised
Learning
Training Info: Desired/Trget Output
INPUT OUTPUT
5 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
6 / 71
Naive Bayes Model
Task for two classes
Let ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that class
P (ω1) for class 1.
P (ω2) for class 2.
The Rule for classification is the following one
P (ωi|x) =
P (x|ωi) P (ωi)
P (x)
(1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classes
Let ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that class
P (ω1) for class 1.
P (ω2) for class 2.
The Rule for classification is the following one
P (ωi|x) =
P (x|ωi) P (ωi)
P (x)
(1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classes
Let ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that class
P (ω1) for class 1.
P (ω2) for class 2.
The Rule for classification is the following one
P (ωi|x) =
P (x|ωi) P (ωi)
P (x)
(1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classes
Let ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that class
P (ω1) for class 1.
P (ω2) for class 2.
The Rule for classification is the following one
P (ωi|x) =
P (x|ωi) P (ωi)
P (x)
(1)
Remark: Bayes to the next level.
7 / 71
In Informal English
We have that
posterior =
likelihood × prior − information
evidence
(2)
Basically
One: If we can observe x.
Two: we can convert the prior-information to the posterior information.
8 / 71
In Informal English
We have that
posterior =
likelihood × prior − information
evidence
(2)
Basically
One: If we can observe x.
Two: we can convert the prior-information to the posterior information.
8 / 71
In Informal English
We have that
posterior =
likelihood × prior − information
evidence
(2)
Basically
One: If we can observe x.
Two: we can convert the prior-information to the posterior information.
8 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
We have the following terms...
Likelihood
We call p (x|ωi) the likelihood of ωi given x:
This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi
is the “likely” class of x.
Prior Probability
It is the known probability of a given class.
Remark: Because, we lack information about this class, we tend to
use the uniform distribution.
However: We can use other tricks for it.
Evidence
The evidence factor can be seen as a scale factor that guarantees that the
posterior probability sum to one.
9 / 71
The most important term in all this
The factor
likelihood × prior − information (3)
10 / 71
Example
We have the likelihood of two classes
11 / 71
Example
We have the posterior of two classes when P (ω1) = 2
3
and P (ω2) = 1
3
12 / 71
Naive Bayes Model
In the case of two classes
P (x) =
2
i=1
p (x, ωi) =
2
i=1
p (x|ωi) P (ωi) (4)
13 / 71
Error in this rule
We have that
P (error|x) =
P (ω1|x) if we decide ω2
P (ω2|x) if we decide ω1
(5)
Thus, we have that
P (error) =
ˆ ∞
−∞
P (error, x) dx =
ˆ ∞
−∞
P (error|x) p (x) dx (6)
14 / 71
Error in this rule
We have that
P (error|x) =
P (ω1|x) if we decide ω2
P (ω2|x) if we decide ω1
(5)
Thus, we have that
P (error) =
ˆ ∞
−∞
P (error, x) dx =
ˆ ∞
−∞
P (error|x) p (x) dx (6)
14 / 71
Classification Rule
Thus, we have the Bayes Classification Rule
1 If P (ω1|x) > P (ω2|x) x is classified to ω1
2 If P (ω1|x) < P (ω2|x) x is classified to ω2
15 / 71
Classification Rule
Thus, we have the Bayes Classification Rule
1 If P (ω1|x) > P (ω2|x) x is classified to ω1
2 If P (ω1|x) < P (ω2|x) x is classified to ω2
15 / 71
What if we remove the normalization factor?
Remember
P (ω1|x) + P (ω2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule
1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1
2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2
16 / 71
What if we remove the normalization factor?
Remember
P (ω1|x) + P (ω2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule
1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1
2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2
16 / 71
What if we remove the normalization factor?
Remember
P (ω1|x) + P (ω2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule
1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1
2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2
16 / 71
We have several cases
If for some x we have P (x|ω1) = P (x|ω2)
The final decision relies completely from the prior probability.
On the Other hand if P (ω1) = P (ω2), the “state” is equally probable
In this case the decision is based entirely on the likelihoods P (x|ωi).
17 / 71
We have several cases
If for some x we have P (x|ω1) = P (x|ω2)
The final decision relies completely from the prior probability.
On the Other hand if P (ω1) = P (ω2), the “state” is equally probable
In this case the decision is based entirely on the likelihoods P (x|ωi).
17 / 71
How the Rule looks like
If P (ω1) = P (ω2) the Rule depends on the term p (x|ωi)
18 / 71
The Error in the Second Case of Naive Bayes
Error in equiprobable classes
P (error) =
1
2
x0ˆ
−∞
p (x|ω2) dx +
1
2
∞ˆ
x0
p (x|ω1) dx (8)
Remark: P (ω1) = P (ω2) = 1
2
19 / 71
What do we want to prove?
Something Notable
Bayesian classifier is optimal with respect to minimizing the
classification error probability.
20 / 71
Proof
Step 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)
= P (ω1)
ˆ
R2
p (x|ω1) dx + P (ω2)
ˆ
R1
p (x|ω2) dx
21 / 71
Proof
Step 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)
= P (ω1)
ˆ
R2
p (x|ω1) dx + P (ω2)
ˆ
R1
p (x|ω2) dx
21 / 71
Proof
Step 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)
= P (ω1)
ˆ
R2
p (x|ω1) dx + P (ω2)
ˆ
R1
p (x|ω2) dx
21 / 71
Proof
Step 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)
= P (ω1)
ˆ
R2
p (x|ω1) dx + P (ω2)
ˆ
R1
p (x|ω2) dx
21 / 71
Proof
It is more
Pe = P (ω1)
ˆ
R2
p (ω1, x)
P (ω1)
dx + P (ω2)
ˆ
R1
p (ω2, x)
P (ω2)
dx (10)
Finally
Pe =
ˆ
R2
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
22 / 71
Proof
It is more
Pe = P (ω1)
ˆ
R2
p (ω1, x)
P (ω1)
dx + P (ω2)
ˆ
R1
p (ω2, x)
P (ω2)
dx (10)
Finally
Pe =
ˆ
R2
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
22 / 71
Proof
It is more
Pe = P (ω1)
ˆ
R2
p (ω1, x)
P (ω1)
dx + P (ω2)
ˆ
R1
p (ω2, x)
P (ω2)
dx (10)
Finally
Pe =
ˆ
R2
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
22 / 71
Proof
Thus
P (ω1) =
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R2
p (ω1|x) p (x) dx (12)
Now, we have...
P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx =
ˆ
R2
p (ω1|x) p (x) dx (13)
Then
Pe = P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (14)
23 / 71
Proof
Thus
P (ω1) =
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R2
p (ω1|x) p (x) dx (12)
Now, we have...
P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx =
ˆ
R2
p (ω1|x) p (x) dx (13)
Then
Pe = P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (14)
23 / 71
Proof
Thus
P (ω1) =
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R2
p (ω1|x) p (x) dx (12)
Now, we have...
P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx =
ˆ
R2
p (ω1|x) p (x) dx (13)
Then
Pe = P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx +
ˆ
R1
p (ω2|x) p (x) dx (14)
23 / 71
Graphically P (ω1): Thanks Edith 2013 Class!!!
In Red
24 / 71
Thus we have´
R1
p (ω1|x) p (x) dx =
´
R1
p (ω1, x) dx = PR1
(ω1)
Thus
25 / 71
Finally
Finally
Pe = P (ω1) −
ˆ
R1
[p (ω1|x) − p (ω2|x)] p (x) dx (15)
Thus, we have
Pe =


P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx


 +
ˆ
R1
p (ω2|x) p (x) dx
26 / 71
Finally
Finally
Pe = P (ω1) −
ˆ
R1
[p (ω1|x) − p (ω2|x)] p (x) dx (15)
Thus, we have
Pe =


P (ω1) −
ˆ
R1
p (ω1|x) p (x) dx


 +
ˆ
R1
p (ω2|x) p (x) dx
26 / 71
Pe for a non optimal rule
A great idea Edith!!!
27 / 71
Which decision function for minimizing the error
A single number in this case
28 / 71
Error is minimized by the Bayesian Naive Rule
Thus
The probability of error is minimized at the region of space in which:
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
29 / 71
Error is minimized by the Bayesian Naive Rule
Thus
The probability of error is minimized at the region of space in which:
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
29 / 71
Error is minimized by the Bayesian Naive Rule
Thus
The probability of error is minimized at the region of space in which:
R1 : P (ω1|x) > P (ω2|x)
R2 : P (ω2|x) > P (ω1|x)
29 / 71
Pe for an optimal rule
A great idea Edith!!!
30 / 71
For M classes ω1, ω2, ..., ωM
We have that vector x is in ωi
P (ωi|x) > P (ωj|x) ∀j = i (16)
Something Notable
It turns out that such a choice also minimizes the classification error
probability.
31 / 71
For M classes ω1, ω2, ..., ωM
We have that vector x is in ωi
P (ωi|x) > P (ωj|x) ∀j = i (16)
Something Notable
It turns out that such a choice also minimizes the classification error
probability.
31 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
32 / 71
Minimizing the Risk
Something Notable
The classification error probability is not always the best criterion to be
adopted for minimization.
All the errors get the same importance
However
Certain errors are more important than others.
For example
Really serious, a doctor makes a wrong decision and a malign tumor
gets classified as benign.
Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something Notable
The classification error probability is not always the best criterion to be
adopted for minimization.
All the errors get the same importance
However
Certain errors are more important than others.
For example
Really serious, a doctor makes a wrong decision and a malign tumor
gets classified as benign.
Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something Notable
The classification error probability is not always the best criterion to be
adopted for minimization.
All the errors get the same importance
However
Certain errors are more important than others.
For example
Really serious, a doctor makes a wrong decision and a malign tumor
gets classified as benign.
Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something Notable
The classification error probability is not always the best criterion to be
adopted for minimization.
All the errors get the same importance
However
Certain errors are more important than others.
For example
Really serious, a doctor makes a wrong decision and a malign tumor
gets classified as benign.
Not so serious, a benign tumor gets classified as malign.
33 / 71
It is based on the following idea
In order to measure the predictive performance of a function
f : X → Y
We use a loss function
: Y × Y → R+
(17)
A non-negative function that quantifies how bad the prediction f (x) is the
true label y.
Thus, we can say that
(f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
It is based on the following idea
In order to measure the predictive performance of a function
f : X → Y
We use a loss function
: Y × Y → R+
(17)
A non-negative function that quantifies how bad the prediction f (x) is the
true label y.
Thus, we can say that
(f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
It is based on the following idea
In order to measure the predictive performance of a function
f : X → Y
We use a loss function
: Y × Y → R+
(17)
A non-negative function that quantifies how bad the prediction f (x) is the
true label y.
Thus, we can say that
(f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
Example
In classification
In the classification case, binary or otherwise, a natural loss function is the
0-1 loss where y = f (x) :
y , y = 1 y = y (18)
35 / 71
Example
In classification
In the classification case, binary or otherwise, a natural loss function is the
0-1 loss where y = f (x) :
y , y = 1 y = y (18)
35 / 71
Furthermore
For regression problems, some natural choices
1 Squared loss: (y , y) = (y − y)2
2 Absolute loss: (y , y) = |y − y|
Thus, given the loss function, we can define the risk as
R (f ) = E(X,Y ) [ (f (X) , Y )] (19)
Although we cannot see the expected risk, we can use the sample to
estimate the following
ˆR (f ) =
1
n
N
i=1
(f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices
1 Squared loss: (y , y) = (y − y)2
2 Absolute loss: (y , y) = |y − y|
Thus, given the loss function, we can define the risk as
R (f ) = E(X,Y ) [ (f (X) , Y )] (19)
Although we cannot see the expected risk, we can use the sample to
estimate the following
ˆR (f ) =
1
n
N
i=1
(f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices
1 Squared loss: (y , y) = (y − y)2
2 Absolute loss: (y , y) = |y − y|
Thus, given the loss function, we can define the risk as
R (f ) = E(X,Y ) [ (f (X) , Y )] (19)
Although we cannot see the expected risk, we can use the sample to
estimate the following
ˆR (f ) =
1
n
N
i=1
(f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices
1 Squared loss: (y , y) = (y − y)2
2 Absolute loss: (y , y) = |y − y|
Thus, given the loss function, we can define the risk as
R (f ) = E(X,Y ) [ (f (X) , Y )] (19)
Although we cannot see the expected risk, we can use the sample to
estimate the following
ˆR (f ) =
1
n
N
i=1
(f (xi) , yi) (20)
36 / 71
Thus
Risk Minimization
Minimizing the empirical risk over a fixed class F ⊆ YX of functions leads
to a very important learning rule, named empirical risk minimization
(ERM):
ˆfN = arg min
f ∈F
ˆR (f ) (21)
If we knew the distribution P and do not restrict ourselves F, the
best function would be
f ∗
= arg min
f
R (f ) (22)
37 / 71
Thus
Risk Minimization
Minimizing the empirical risk over a fixed class F ⊆ YX of functions leads
to a very important learning rule, named empirical risk minimization
(ERM):
ˆfN = arg min
f ∈F
ˆR (f ) (21)
If we knew the distribution P and do not restrict ourselves F, the
best function would be
f ∗
= arg min
f
R (f ) (22)
37 / 71
Thus
For classification with 0-1 loss, f ∗
is called the Bayesian Classifier
f ∗
= arg min
y∈Y
P (Y = y|X = x) (23)
38 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classes
λ12 is the loss value if the sample is in class 1, but the function
classified as in class 2.
λ21 is the loss value if the sample is in class 1, but the function
classified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆ
R2
p (x, ω1) dx + λ21
ˆ
R1
p (x, ω2) dx
=λ12
ˆ
R2
p (x|ω1) P (ω1) dx + λ21
ˆ
R1
p (x|ω2) P (ω2) dx
39 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)
ˆ
R2
p (x|ω1) dx + λ21P (ω2)
ˆ
R1
p (x|ω2) dx (24)
Where the λ terms work as a weight factor for each error
Thus, we could have something like λ12 > λ21!!!
Then
Errors due to the assignment of patterns originating from class 1 to class 2
will have a larger effect on the cost function than the errors associated
with the second term in the summation.
40 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)
ˆ
R2
p (x|ω1) dx + λ21P (ω2)
ˆ
R1
p (x|ω2) dx (24)
Where the λ terms work as a weight factor for each error
Thus, we could have something like λ12 > λ21!!!
Then
Errors due to the assignment of patterns originating from class 1 to class 2
will have a larger effect on the cost function than the errors associated
with the second term in the summation.
40 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)
ˆ
R2
p (x|ω1) dx + λ21P (ω2)
ˆ
R1
p (x|ω2) dx (24)
Where the λ terms work as a weight factor for each error
Thus, we could have something like λ12 > λ21!!!
Then
Errors due to the assignment of patterns originating from class 1 to class 2
will have a larger effect on the cost function than the errors associated
with the second term in the summation.
40 / 71
Now, Consider a M class problem
We have
Rj, j = 1, ..., M be the regions where the classes ωj live.
Now, think the following error
Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is
misclassified.
Now, for this error we associate the term λki (loss)
With that we have a loss matrix L where (k, i) location corresponds to
such a loss.
41 / 71
Now, Consider a M class problem
We have
Rj, j = 1, ..., M be the regions where the classes ωj live.
Now, think the following error
Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is
misclassified.
Now, for this error we associate the term λki (loss)
With that we have a loss matrix L where (k, i) location corresponds to
such a loss.
41 / 71
Now, Consider a M class problem
We have
Rj, j = 1, ..., M be the regions where the classes ωj live.
Now, think the following error
Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is
misclassified.
Now, for this error we associate the term λki (loss)
With that we have a loss matrix L where (k, i) location corresponds to
such a loss.
41 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =
M
i=1
λki
ˆ
Ri
p (x|ωk) dx (25)
We want to minimize the global risk
r =
M
k=1
rkP (ωk) =
M
i=1
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (26)
For this
We want to select the set of partition regions Rj.
42 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =
M
i=1
λki
ˆ
Ri
p (x|ωk) dx (25)
We want to minimize the global risk
r =
M
k=1
rkP (ωk) =
M
i=1
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (26)
For this
We want to select the set of partition regions Rj.
42 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =
M
i=1
λki
ˆ
Ri
p (x|ωk) dx (25)
We want to minimize the global risk
r =
M
k=1
rkP (ωk) =
M
i=1
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (26)
For this
We want to select the set of partition regions Rj.
42 / 71
How do we do that?
We minimize each integral
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (27)
We need to minimize
M
k=1
λkip (x|ωk) P (ωk) (28)
We can do the following
If x ∈ Ri, then
li =
M
k=1
λkip (x|ωk) P (ωk) < lj =
M
k=1
λkjp (x|ωk) P (ωk) (29)
for all j = i.
43 / 71
How do we do that?
We minimize each integral
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (27)
We need to minimize
M
k=1
λkip (x|ωk) P (ωk) (28)
We can do the following
If x ∈ Ri, then
li =
M
k=1
λkip (x|ωk) P (ωk) < lj =
M
k=1
λkjp (x|ωk) P (ωk) (29)
for all j = i.
43 / 71
How do we do that?
We minimize each integral
ˆ
Ri
M
k=1
λkip (x|ωk) P (ωk) dx (27)
We need to minimize
M
k=1
λkip (x|ωk) P (ωk) (28)
We can do the following
If x ∈ Ri, then
li =
M
k=1
λkip (x|ωk) P (ωk) < lj =
M
k=1
λkjp (x|ωk) P (ωk) (29)
for all j = i.
43 / 71
Remarks
When we have Kronecker’s delta
δki =
0 if k = i
1 if k = i
(30)
We can do the following
λki = 1 − δki (31)
We finish with
M
k=1,k=i
p (x|ωk) P (ωk) <
M
k=1,k=j
p (x|ωk) P (ωk) (32)
44 / 71
Remarks
When we have Kronecker’s delta
δki =
0 if k = i
1 if k = i
(30)
We can do the following
λki = 1 − δki (31)
We finish with
M
k=1,k=i
p (x|ωk) P (ωk) <
M
k=1,k=j
p (x|ωk) P (ωk) (32)
44 / 71
Remarks
When we have Kronecker’s delta
δki =
0 if k = i
1 if k = i
(30)
We can do the following
λki = 1 − δki (31)
We finish with
M
k=1,k=i
p (x|ωk) P (ωk) <
M
k=1,k=j
p (x|ωk) P (ωk) (32)
44 / 71
Then
We have that
p (x|ωj) P (ωj) < p (x|ωi) P (ωi) (33)
for all j = i.
45 / 71
Special Case: The Two-Class Case
We get
l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)
l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)
We assign x to ω1 if l1 < l2
(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)
If we assume that λii < λij - Correct Decisions are penalized much
less than wrong ones
x ∈ ω1 (ω2) if l12 ≡
p (x|ω1)
p (x|ω2)
> (<)
P (ω2)
P (ω1)
(λ21 − λ22)
(λ12 − λ11)
(35)
46 / 71
Special Case: The Two-Class Case
We get
l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)
l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)
We assign x to ω1 if l1 < l2
(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)
If we assume that λii < λij - Correct Decisions are penalized much
less than wrong ones
x ∈ ω1 (ω2) if l12 ≡
p (x|ω1)
p (x|ω2)
> (<)
P (ω2)
P (ω1)
(λ21 − λ22)
(λ12 − λ11)
(35)
46 / 71
Special Case: The Two-Class Case
We get
l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)
l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)
We assign x to ω1 if l1 < l2
(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)
If we assume that λii < λij - Correct Decisions are penalized much
less than wrong ones
x ∈ ω1 (ω2) if l12 ≡
p (x|ω1)
p (x|ω2)
> (<)
P (ω2)
P (ω1)
(λ21 − λ22)
(λ12 − λ11)
(35)
46 / 71
Special Case: The Two-Class Case
Definition
l12 is known as the likelihood ratio and the preceding test as the likelihood
ratio test.
47 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
48 / 71
Decision Surface
Because the R1 and R2 are contiguous
The separating surface between both of them is described by
P (ω1|x) − P (ω2|x) = 0 (36)
Thus, we define the decision function as
g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37)
49 / 71
Decision Surface
Because the R1 and R2 are contiguous
The separating surface between both of them is described by
P (ω1|x) − P (ω2|x) = 0 (36)
Thus, we define the decision function as
g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37)
49 / 71
Which decision function for the Naive Bayes
A single number in this case
50 / 71
In general
First
Instead of working with probabilities, we work with an equivalent function
of them gi (x) = f (P (ωi|x)).
Classic Example the Monotonically increasing
f (P (ωi|x)) = ln P (ωi|x).
The decision test is now
classify x in ωi if gi (x) > gj (x) ∀j = i.
The decision surfaces, separating contiguous regions, are described by
gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j
51 / 71
In general
First
Instead of working with probabilities, we work with an equivalent function
of them gi (x) = f (P (ωi|x)).
Classic Example the Monotonically increasing
f (P (ωi|x)) = ln P (ωi|x).
The decision test is now
classify x in ωi if gi (x) > gj (x) ∀j = i.
The decision surfaces, separating contiguous regions, are described by
gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j
51 / 71
In general
First
Instead of working with probabilities, we work with an equivalent function
of them gi (x) = f (P (ωi|x)).
Classic Example the Monotonically increasing
f (P (ωi|x)) = ln P (ωi|x).
The decision test is now
classify x in ωi if gi (x) > gj (x) ∀j = i.
The decision surfaces, separating contiguous regions, are described by
gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j
51 / 71
In general
First
Instead of working with probabilities, we work with an equivalent function
of them gi (x) = f (P (ωi|x)).
Classic Example the Monotonically increasing
f (P (ωi|x)) = ln P (ωi|x).
The decision test is now
classify x in ωi if gi (x) > gj (x) ∀j = i.
The decision surfaces, separating contiguous regions, are described by
gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j
51 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
52 / 71
Gaussian Distribution
We can use the Gaussian distribution
p (x|ωi) =
1
(2π)
l/2
|Σi|
1/2
exp −
1
2
(x − µi)T
Σ−1
i (x − µi) (38)
Example
53 / 71
Gaussian Distribution
We can use the Gaussian distribution
p (x|ωi) =
1
(2π)
l/2
|Σi|
1/2
exp −
1
2
(x − µi)T
Σ−1
i (x − µi) (38)
Example
53 / 71
Some Properties
About Σ
It is the covariance matrix between variables.
Thus
It is positive definite.
Symmetric.
The inverse exists.
54 / 71
Some Properties
About Σ
It is the covariance matrix between variables.
Thus
It is positive definite.
Symmetric.
The inverse exists.
54 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
55 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σ =
1 0
0 1
It simple the unit Gaussian with mean µ
56 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σ =
1 0
0 1
It simple the unit Gaussian with mean µ
56 / 71
The Covariance Σ as a Rotation
Look at the following Covariance
Σ =
4 0
0 1
Actually, it flatten the circle through the x − axis
57 / 71
The Covariance Σ as a Rotation
Look at the following Covariance
Σ =
4 0
0 1
Actually, it flatten the circle through the x − axis
57 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σa = RΣbRT with R =
cos θ − sin θ
− sin θ cos θ
It allows to rotate the axises
58 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σa = RΣbRT with R =
cos θ − sin θ
− sin θ cos θ
It allows to rotate the axises
58 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2
We know that the pdf of correct classification is
p (x, ω1) = p (x|ωi) P (ωi)!!!
Thus
It is possible to generate the following decision function:
gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)
Thus
gi (x) = −
1
2
(x − µi)T
Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2
We know that the pdf of correct classification is
p (x, ω1) = p (x|ωi) P (ωi)!!!
Thus
It is possible to generate the following decision function:
gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)
Thus
gi (x) = −
1
2
(x − µi)T
Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2
We know that the pdf of correct classification is
p (x, ω1) = p (x|ωi) P (ωi)!!!
Thus
It is possible to generate the following decision function:
gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)
Thus
gi (x) = −
1
2
(x − µi)T
Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
60 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability law
p (x|ωj)
We call those samples as
i.i.d. — independent identically distributed random variables.
We assume in addition
p (x|ωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability law
p (x|ωj)
We call those samples as
i.i.d. — independent identically distributed random variables.
We assume in addition
p (x|ωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability law
p (x|ωj)
We call those samples as
i.i.d. — independent identically distributed random variables.
We assume in addition
p (x|ωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
For example
p (x|ωj) ∼ N µj, Σj (41)
In our case
We will assume that there is no dependence between classes!!!
62 / 71
Given a series of classes ω1, ω2, ..., ωM
For example
p (x|ωj) ∼ N µj, Σj (41)
In our case
We will assume that there is no dependence between classes!!!
62 / 71
Now
Suppose that ωj contains n samples x1, x2, ..., xn
p (x1, x2, ..., xn|θj) =
n
j=1
p (xj|θj) (42)
We can see then the function p (x1, x2, ..., xn|θj) as a function of
L (θj) =
n
j=1
p (xj|θj) (43)
63 / 71
Now
Suppose that ωj contains n samples x1, x2, ..., xn
p (x1, x2, ..., xn|θj) =
n
j=1
p (xj|θj) (42)
We can see then the function p (x1, x2, ..., xn|θj) as a function of
L (θj) =
n
j=1
p (xj|θj) (43)
63 / 71
Example
L (θj) = log n
j=1 p (xj|θj)
64 / 71
Outline
1 Introduction
Supervised Learning
Naive Bayes
The Naive Bayes Model
The Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision Surfaces
Introduction
Gaussian Distribution
Influence of the Covariance Σ
Maximum Likelihood Principle
Maximum Likelihood on a Gaussian
65 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −
n
2
ln |Σi| −
1
2


n
j=1
(xj − µi)T
Σ−1
i (xj − µi)

 + c2 (44)
We know that
dxT Ax
dx
= Ax + AT
x,
dAx
dx
= A (45)
Thus, we expand equation44
−
n
2
ln |Σi| −
1
2
n
j=1
xj
T
Σ−1
i xj − 2xj
T
Σ−1
i µi + µi
T
Σ−1
i µi + c2 (46)
66 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −
n
2
ln |Σi| −
1
2


n
j=1
(xj − µi)T
Σ−1
i (xj − µi)

 + c2 (44)
We know that
dxT Ax
dx
= Ax + AT
x,
dAx
dx
= A (45)
Thus, we expand equation44
−
n
2
ln |Σi| −
1
2
n
j=1
xj
T
Σ−1
i xj − 2xj
T
Σ−1
i µi + µi
T
Σ−1
i µi + c2 (46)
66 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −
n
2
ln |Σi| −
1
2


n
j=1
(xj − µi)T
Σ−1
i (xj − µi)

 + c2 (44)
We know that
dxT Ax
dx
= Ax + AT
x,
dAx
dx
= A (45)
Thus, we expand equation44
−
n
2
ln |Σi| −
1
2
n
j=1
xj
T
Σ−1
i xj − 2xj
T
Σ−1
i µi + µi
T
Σ−1
i µi + c2 (46)
66 / 71
Maximum Likelihood
Then
∂ ln L (ωi)
∂µi
=
n
j=1
Σ−1
i (xj − µi) = 0
nΣ−1
i

−µi +
1
n
n
j=1
xj

 = 0
ˆµi =
1
n
n
j=1
xj
67 / 71
Maximum Likelihood
Then, we derive with respect to Σi
For this we use the following tricks:
1
∂ log|Σ|
∂Σ−1 = − 1
|Σ| · |Σ| (Σ)T
= −Σ
2
∂Tr[AB]
∂A = ∂Tr[BA]
∂A = BT
3 Trace(of a number)=the number
4 Tr(AT B) = Tr BAT
Thus
f (Σi) = −
n
2
ln |ΣI | −
1
2
n
j=1
(xj − µi)T
Σ−1
i (xj − µi) + c1 (47)
68 / 71
Maximum Likelihood
Thus
f (Σi) = −
n
2
ln |Σi|−
1
2
n
j=1
Trace (xj − µi)T
Σ−1
i (xj − µi) +c1 (48)
Tricks!!!
f (Σi) = −
n
2
ln |Σi|−
1
2
n
j=1
Trace Σ−1
i (xj − µi) (xj − µi)T
+c1 (49)
69 / 71
Maximum Likelihood
Thus
f (Σi) = −
n
2
ln |Σi|−
1
2
n
j=1
Trace (xj − µi)T
Σ−1
i (xj − µi) +c1 (48)
Tricks!!!
f (Σi) = −
n
2
ln |Σi|−
1
2
n
j=1
Trace Σ−1
i (xj − µi) (xj − µi)T
+c1 (49)
69 / 71
Maximum Likelihood
Derivative with respect to Σ
∂f (Σi)
∂Σi
=
n
2
Σi −
1
2
n
j=1
(xj − µi) (xj − µi)T T
(50)
Thus, when making it equal to zero
ˆΣi =
1
n
n
j=1
(xj − µi) (xj − µi)T
(51)
70 / 71
Maximum Likelihood
Derivative with respect to Σ
∂f (Σi)
∂Σi
=
n
2
Σi −
1
2
n
j=1
(xj − µi) (xj − µi)T T
(50)
Thus, when making it equal to zero
ˆΣi =
1
n
n
j=1
(xj − µi) (xj − µi)T
(51)
70 / 71
Exercises
Duda and Hart
Chapter 3
3.1, 3.2, 3.3, 3.13
Theodoridis
Chapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71
Exercises
Duda and Hart
Chapter 3
3.1, 3.2, 3.3, 3.13
Theodoridis
Chapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71
Exercises
Duda and Hart
Chapter 3
3.1, 3.2, 3.3, 3.13
Theodoridis
Chapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71

More Related Content

PPSX
Perceptron (neural network)
PDF
Python Anaconda Tutorial | Edureka
PPTX
Operating system critical section
PDF
Python Programming Tutorial | Edureka
PDF
Daa notes 1
PPTX
An introduction to Prolog language slide
PPT
Distributed systems scheduling
PPT
Os module 2 c
Perceptron (neural network)
Python Anaconda Tutorial | Edureka
Operating system critical section
Python Programming Tutorial | Edureka
Daa notes 1
An introduction to Prolog language slide
Distributed systems scheduling
Os module 2 c

What's hot (20)

PDF
27 NP Completness
DOCX
Python unit 3 and Unit 4
PPTX
Print input-presentation
PPTX
Introduction to R programming
PPTX
INHERITANCE IN JAVA.pptx
PPTX
Butterfly optimization algorithm
DOCX
Exceptions handling notes in JAVA
PPTX
Solutions to byzantine agreement problem
PPT
Recurrences
PPTX
Tic tac toe simple ai game
PPT
PPTX
Python variables and data types.pptx
PPTX
Type checking compiler construction Chapter #6
PPTX
Dynamic method dispatch
DOC
Branch and bound
PPT
Dbms ii mca-ch5-ch6-relational algebra-2013
PPTX
System call (Fork +Exec)
PPTX
Moore Mealy Machine Conversion
27 NP Completness
Python unit 3 and Unit 4
Print input-presentation
Introduction to R programming
INHERITANCE IN JAVA.pptx
Butterfly optimization algorithm
Exceptions handling notes in JAVA
Solutions to byzantine agreement problem
Recurrences
Tic tac toe simple ai game
Python variables and data types.pptx
Type checking compiler construction Chapter #6
Dynamic method dispatch
Branch and bound
Dbms ii mca-ch5-ch6-relational algebra-2013
System call (Fork +Exec)
Moore Mealy Machine Conversion
Ad

Viewers also liked (16)

PDF
07 Machine Learning - Expectation Maximization
PDF
08 Machine Learning Maximum Aposteriori
PDF
A Semi-naive Bayes Classifier with Grouping of Cases
PDF
Wikipedia, Dead Authors, Naive Bayes and Python
PDF
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
PPT
Modified naive bayes model for improved web page classification
PDF
02. naive bayes classifier revision
PPTX
"Naive Bayes Classifier" @ Papers We Love Bucharest
PDF
Naive Bayes Classifier
PDF
Naive Bayes
PPTX
Sentiment analysis using naive bayes classifier
PDF
Lecture10 - Naïve Bayes
PPTX
Naive Bayes Presentation
PPTX
Naive bayes
PDF
Modeling Social Data, Lecture 6: Classification with Naive Bayes
PDF
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
07 Machine Learning - Expectation Maximization
08 Machine Learning Maximum Aposteriori
A Semi-naive Bayes Classifier with Grouping of Cases
Wikipedia, Dead Authors, Naive Bayes and Python
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
Modified naive bayes model for improved web page classification
02. naive bayes classifier revision
"Naive Bayes Classifier" @ Papers We Love Bucharest
Naive Bayes Classifier
Naive Bayes
Sentiment analysis using naive bayes classifier
Lecture10 - Naïve Bayes
Naive Bayes Presentation
Naive bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
Ad

Similar to 06 Machine Learning - Naive Bayes (20)

PDF
Bayesian data analysis1
PPT
Bayes Classification
PDF
Bayesian Learning - Naive Bayes Algorithm
PPT
bayes answer jejisiowwoowwksknejejrjejej
PPT
bayesNaive.ppt
PPT
bayesNaive.ppt
PPT
bayesNaive algorithm in machine learning
PDF
Bayesian Learning- part of machine learning
PPTX
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
PPTX
UNIT II (7).pptx
PDF
Module - 4 Machine Learning -22ISE62.pdf
PPT
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
PPTX
Naive Bayes.pptx
PPT
BAYESIAN theorem and implementation of i
PDF
NBaysian classifier, Naive Bayes classifier
PPT
2.3 bayesian classification
PPTX
Unit 2 Machine Learning it's most important topic of basic
PDF
Bayes Theorem.pdf
PDF
19BayesTheoremClassification19BayesTheoremClassification.ppt
PPT
UNIT2_NaiveBayes algorithms used in machine learning
Bayesian data analysis1
Bayes Classification
Bayesian Learning - Naive Bayes Algorithm
bayes answer jejisiowwoowwksknejejrjejej
bayesNaive.ppt
bayesNaive.ppt
bayesNaive algorithm in machine learning
Bayesian Learning- part of machine learning
Bayesian Learning by Dr.C.R.Dhivyaa Kongu Engineering College
UNIT II (7).pptx
Module - 4 Machine Learning -22ISE62.pdf
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Naive Bayes.pptx
BAYESIAN theorem and implementation of i
NBaysian classifier, Naive Bayes classifier
2.3 bayesian classification
Unit 2 Machine Learning it's most important topic of basic
Bayes Theorem.pdf
19BayesTheoremClassification19BayesTheoremClassification.ppt
UNIT2_NaiveBayes algorithms used in machine learning

More from Andres Mendez-Vazquez (20)

PDF
2.03 bayesian estimation
PDF
05 linear transformations
PDF
01.04 orthonormal basis_eigen_vectors
PDF
01.03 squared matrices_and_other_issues
PDF
01.02 linear equations
PDF
01.01 vector spaces
PDF
06 recurrent neural_networks
PDF
05 backpropagation automatic_differentiation
PDF
Zetta global
PDF
01 Introduction to Neural Networks and Deep Learning
PDF
25 introduction reinforcement_learning
PDF
Neural Networks and Deep Learning Syllabus
PDF
Introduction to artificial_intelligence_syllabus
PDF
Ideas 09 22_2018
PDF
Ideas about a Bachelor in Machine Learning/Data Sciences
PDF
Analysis of Algorithms Syllabus
PDF
20 k-means, k-center, k-meoids and variations
PDF
18.1 combining models
PDF
17 vapnik chervonenkis dimension
PDF
A basic introduction to learning
2.03 bayesian estimation
05 linear transformations
01.04 orthonormal basis_eigen_vectors
01.03 squared matrices_and_other_issues
01.02 linear equations
01.01 vector spaces
06 recurrent neural_networks
05 backpropagation automatic_differentiation
Zetta global
01 Introduction to Neural Networks and Deep Learning
25 introduction reinforcement_learning
Neural Networks and Deep Learning Syllabus
Introduction to artificial_intelligence_syllabus
Ideas 09 22_2018
Ideas about a Bachelor in Machine Learning/Data Sciences
Analysis of Algorithms Syllabus
20 k-means, k-center, k-meoids and variations
18.1 combining models
17 vapnik chervonenkis dimension
A basic introduction to learning

Recently uploaded (20)

DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPT
Project quality management in manufacturing
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
composite construction of structures.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Geodesy 1.pptx...............................................
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
PPT on Performance Review to get promotions
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
DOCX
573137875-Attendance-Management-System-original
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
R24 SURVEYING LAB MANUAL for civil enggi
Project quality management in manufacturing
CH1 Production IntroductoryConcepts.pptx
composite construction of structures.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Mechanical Engineering MATERIALS Selection
Geodesy 1.pptx...............................................
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPT on Performance Review to get promotions
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
bas. eng. economics group 4 presentation 1.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
573137875-Attendance-Management-System-original
CYBER-CRIMES AND SECURITY A guide to understanding
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Model Code of Practice - Construction Work - 21102022 .pdf

06 Machine Learning - Naive Bayes

  • 1. Machine Learning for Data Mining Introduction to Bayesian Classifiers Andres Mendez-Vazquez August 3, 2015 1 / 71
  • 2. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 2 / 71
  • 3. Classification problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) ∈ 1, . . . K 3 / 71
  • 4. Classification problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) ∈ 1, . . . K 3 / 71
  • 5. Classification problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) ∈ 1, . . . K 3 / 71
  • 6. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 4 / 71
  • 7. Classification Problem Goal Given dnew, provide h(dnew) The Machinery in General looks... Supervised Learning Training Info: Desired/Trget Output INPUT OUTPUT 5 / 71
  • 8. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 6 / 71
  • 9. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classification is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71
  • 10. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classification is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71
  • 11. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classification is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71
  • 12. Naive Bayes Model Task for two classes Let ω1, ω2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (ω1) for class 1. P (ω2) for class 2. The Rule for classification is the following one P (ωi|x) = P (x|ωi) P (ωi) P (x) (1) Remark: Bayes to the next level. 7 / 71
  • 13. In Informal English We have that posterior = likelihood × prior − information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  • 14. In Informal English We have that posterior = likelihood × prior − information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  • 15. In Informal English We have that posterior = likelihood × prior − information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  • 16. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 17. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 18. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 19. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 20. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 21. We have the following terms... Likelihood We call p (x|ωi) the likelihood of ωi given x: This indicates that given a category ωi: If p (x|ωi) is “large”, then ωi is the “likely” class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  • 22. The most important term in all this The factor likelihood × prior − information (3) 10 / 71
  • 23. Example We have the likelihood of two classes 11 / 71
  • 24. Example We have the posterior of two classes when P (ω1) = 2 3 and P (ω2) = 1 3 12 / 71
  • 25. Naive Bayes Model In the case of two classes P (x) = 2 i=1 p (x, ωi) = 2 i=1 p (x|ωi) P (ωi) (4) 13 / 71
  • 26. Error in this rule We have that P (error|x) = P (ω1|x) if we decide ω2 P (ω2|x) if we decide ω1 (5) Thus, we have that P (error) = ˆ ∞ −∞ P (error, x) dx = ˆ ∞ −∞ P (error|x) p (x) dx (6) 14 / 71
  • 27. Error in this rule We have that P (error|x) = P (ω1|x) if we decide ω2 P (ω2|x) if we decide ω1 (5) Thus, we have that P (error) = ˆ ∞ −∞ P (error, x) dx = ˆ ∞ −∞ P (error|x) p (x) dx (6) 14 / 71
  • 28. Classification Rule Thus, we have the Bayes Classification Rule 1 If P (ω1|x) > P (ω2|x) x is classified to ω1 2 If P (ω1|x) < P (ω2|x) x is classified to ω2 15 / 71
  • 29. Classification Rule Thus, we have the Bayes Classification Rule 1 If P (ω1|x) > P (ω2|x) x is classified to ω1 2 If P (ω1|x) < P (ω2|x) x is classified to ω2 15 / 71
  • 30. What if we remove the normalization factor? Remember P (ω1|x) + P (ω2|x) = 1 (7) We are able to obtain the new Bayes Classification Rule 1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1 2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2 16 / 71
  • 31. What if we remove the normalization factor? Remember P (ω1|x) + P (ω2|x) = 1 (7) We are able to obtain the new Bayes Classification Rule 1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1 2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2 16 / 71
  • 32. What if we remove the normalization factor? Remember P (ω1|x) + P (ω2|x) = 1 (7) We are able to obtain the new Bayes Classification Rule 1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1 2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2 16 / 71
  • 33. We have several cases If for some x we have P (x|ω1) = P (x|ω2) The final decision relies completely from the prior probability. On the Other hand if P (ω1) = P (ω2), the “state” is equally probable In this case the decision is based entirely on the likelihoods P (x|ωi). 17 / 71
  • 34. We have several cases If for some x we have P (x|ω1) = P (x|ω2) The final decision relies completely from the prior probability. On the Other hand if P (ω1) = P (ω2), the “state” is equally probable In this case the decision is based entirely on the likelihoods P (x|ωi). 17 / 71
  • 35. How the Rule looks like If P (ω1) = P (ω2) the Rule depends on the term p (x|ωi) 18 / 71
  • 36. The Error in the Second Case of Naive Bayes Error in equiprobable classes P (error) = 1 2 x0ˆ −∞ p (x|ω2) dx + 1 2 ∞ˆ x0 p (x|ω1) dx (8) Remark: P (ω1) = P (ω2) = 1 2 19 / 71
  • 37. What do we want to prove? Something Notable Bayesian classifier is optimal with respect to minimizing the classification error probability. 20 / 71
  • 38. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71
  • 39. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71
  • 40. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71
  • 41. Proof Step 1 R1 be the region of the feature space in which we decide in favor of ω1 R2 be the region of the feature space in which we decide in favor of ω2 Step 2 Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9) Thus Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2) = P (ω1) ˆ R2 p (x|ω1) dx + P (ω2) ˆ R1 p (x|ω2) dx 21 / 71
  • 42. Proof It is more Pe = P (ω1) ˆ R2 p (ω1, x) P (ω1) dx + P (ω2) ˆ R1 p (ω2, x) P (ω2) dx (10) Finally Pe = ˆ R2 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (11) Now, we choose the Bayes Classification Rule R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 22 / 71
  • 43. Proof It is more Pe = P (ω1) ˆ R2 p (ω1, x) P (ω1) dx + P (ω2) ˆ R1 p (ω2, x) P (ω2) dx (10) Finally Pe = ˆ R2 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (11) Now, we choose the Bayes Classification Rule R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 22 / 71
  • 44. Proof It is more Pe = P (ω1) ˆ R2 p (ω1, x) P (ω1) dx + P (ω2) ˆ R1 p (ω2, x) P (ω2) dx (10) Finally Pe = ˆ R2 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (11) Now, we choose the Bayes Classification Rule R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 22 / 71
  • 45. Proof Thus P (ω1) = ˆ R1 p (ω1|x) p (x) dx + ˆ R2 p (ω1|x) p (x) dx (12) Now, we have... P (ω1) − ˆ R1 p (ω1|x) p (x) dx = ˆ R2 p (ω1|x) p (x) dx (13) Then Pe = P (ω1) − ˆ R1 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (14) 23 / 71
  • 46. Proof Thus P (ω1) = ˆ R1 p (ω1|x) p (x) dx + ˆ R2 p (ω1|x) p (x) dx (12) Now, we have... P (ω1) − ˆ R1 p (ω1|x) p (x) dx = ˆ R2 p (ω1|x) p (x) dx (13) Then Pe = P (ω1) − ˆ R1 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (14) 23 / 71
  • 47. Proof Thus P (ω1) = ˆ R1 p (ω1|x) p (x) dx + ˆ R2 p (ω1|x) p (x) dx (12) Now, we have... P (ω1) − ˆ R1 p (ω1|x) p (x) dx = ˆ R2 p (ω1|x) p (x) dx (13) Then Pe = P (ω1) − ˆ R1 p (ω1|x) p (x) dx + ˆ R1 p (ω2|x) p (x) dx (14) 23 / 71
  • 48. Graphically P (ω1): Thanks Edith 2013 Class!!! In Red 24 / 71
  • 49. Thus we have´ R1 p (ω1|x) p (x) dx = ´ R1 p (ω1, x) dx = PR1 (ω1) Thus 25 / 71
  • 50. Finally Finally Pe = P (ω1) − ˆ R1 [p (ω1|x) − p (ω2|x)] p (x) dx (15) Thus, we have Pe =   P (ω1) − ˆ R1 p (ω1|x) p (x) dx    + ˆ R1 p (ω2|x) p (x) dx 26 / 71
  • 51. Finally Finally Pe = P (ω1) − ˆ R1 [p (ω1|x) − p (ω2|x)] p (x) dx (15) Thus, we have Pe =   P (ω1) − ˆ R1 p (ω1|x) p (x) dx    + ˆ R1 p (ω2|x) p (x) dx 26 / 71
  • 52. Pe for a non optimal rule A great idea Edith!!! 27 / 71
  • 53. Which decision function for minimizing the error A single number in this case 28 / 71
  • 54. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 29 / 71
  • 55. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 29 / 71
  • 56. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (ω1|x) > P (ω2|x) R2 : P (ω2|x) > P (ω1|x) 29 / 71
  • 57. Pe for an optimal rule A great idea Edith!!! 30 / 71
  • 58. For M classes ω1, ω2, ..., ωM We have that vector x is in ωi P (ωi|x) > P (ωj|x) ∀j = i (16) Something Notable It turns out that such a choice also minimizes the classification error probability. 31 / 71
  • 59. For M classes ω1, ω2, ..., ωM We have that vector x is in ωi P (ωi|x) > P (ωj|x) ∀j = i (16) Something Notable It turns out that such a choice also minimizes the classification error probability. 31 / 71
  • 60. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 32 / 71
  • 61. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71
  • 62. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71
  • 63. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71
  • 64. Minimizing the Risk Something Notable The classification error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classified as benign. Not so serious, a benign tumor gets classified as malign. 33 / 71
  • 65. It is based on the following idea In order to measure the predictive performance of a function f : X → Y We use a loss function : Y × Y → R+ (17) A non-negative function that quantifies how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
  • 66. It is based on the following idea In order to measure the predictive performance of a function f : X → Y We use a loss function : Y × Y → R+ (17) A non-negative function that quantifies how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
  • 67. It is based on the following idea In order to measure the predictive performance of a function f : X → Y We use a loss function : Y × Y → R+ (17) A non-negative function that quantifies how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
  • 68. Example In classification In the classification case, binary or otherwise, a natural loss function is the 0-1 loss where y = f (x) : y , y = 1 y = y (18) 35 / 71
  • 69. Example In classification In the classification case, binary or otherwise, a natural loss function is the 0-1 loss where y = f (x) : y , y = 1 y = y (18) 35 / 71
  • 70. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can define the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
  • 71. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can define the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
  • 72. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can define the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
  • 73. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y − y)2 2 Absolute loss: (y , y) = |y − y| Thus, given the loss function, we can define the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following ˆR (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
  • 74. Thus Risk Minimization Minimizing the empirical risk over a fixed class F ⊆ YX of functions leads to a very important learning rule, named empirical risk minimization (ERM): ˆfN = arg min f ∈F ˆR (f ) (21) If we knew the distribution P and do not restrict ourselves F, the best function would be f ∗ = arg min f R (f ) (22) 37 / 71
  • 75. Thus Risk Minimization Minimizing the empirical risk over a fixed class F ⊆ YX of functions leads to a very important learning rule, named empirical risk minimization (ERM): ˆfN = arg min f ∈F ˆR (f ) (21) If we knew the distribution P and do not restrict ourselves F, the best function would be f ∗ = arg min f R (f ) (22) 37 / 71
  • 76. Thus For classification with 0-1 loss, f ∗ is called the Bayesian Classifier f ∗ = arg min y∈Y P (Y = y|X = x) (23) 38 / 71
  • 77. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 78. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 79. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 80. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 81. The New Risk Function Now, we have the following loss functions for two classes λ12 is the loss value if the sample is in class 1, but the function classified as in class 2. λ21 is the loss value if the sample is in class 1, but the function classified as in class 2. We can generate the new risk function r =λ12Pe (x, ω1) + λ21Pe (x, ω2) =λ12 ˆ R2 p (x, ω1) dx + λ21 ˆ R1 p (x, ω2) dx =λ12 ˆ R2 p (x|ω1) P (ω1) dx + λ21 ˆ R1 p (x|ω2) P (ω2) dx 39 / 71
  • 82. The New Risk Function The new risk function to be minimized r = λ12P (ω1) ˆ R2 p (x|ω1) dx + λ21P (ω2) ˆ R1 p (x|ω2) dx (24) Where the λ terms work as a weight factor for each error Thus, we could have something like λ12 > λ21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger effect on the cost function than the errors associated with the second term in the summation. 40 / 71
  • 83. The New Risk Function The new risk function to be minimized r = λ12P (ω1) ˆ R2 p (x|ω1) dx + λ21P (ω2) ˆ R1 p (x|ω2) dx (24) Where the λ terms work as a weight factor for each error Thus, we could have something like λ12 > λ21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger effect on the cost function than the errors associated with the second term in the summation. 40 / 71
  • 84. The New Risk Function The new risk function to be minimized r = λ12P (ω1) ˆ R2 p (x|ω1) dx + λ21P (ω2) ˆ R1 p (x|ω2) dx (24) Where the λ terms work as a weight factor for each error Thus, we could have something like λ12 > λ21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger effect on the cost function than the errors associated with the second term in the summation. 40 / 71
  • 85. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes ωj live. Now, think the following error Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is misclassified. Now, for this error we associate the term λki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
  • 86. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes ωj live. Now, think the following error Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is misclassified. Now, for this error we associate the term λki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
  • 87. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes ωj live. Now, think the following error Assume x belong to class ωk, but lies in Ri with i = k −→ the vector is misclassified. Now, for this error we associate the term λki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
  • 88. Thus, we have a general loss associated to each class ωi Definition rk = M i=1 λki ˆ Ri p (x|ωk) dx (25) We want to minimize the global risk r = M k=1 rkP (ωk) = M i=1 ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
  • 89. Thus, we have a general loss associated to each class ωi Definition rk = M i=1 λki ˆ Ri p (x|ωk) dx (25) We want to minimize the global risk r = M k=1 rkP (ωk) = M i=1 ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
  • 90. Thus, we have a general loss associated to each class ωi Definition rk = M i=1 λki ˆ Ri p (x|ωk) dx (25) We want to minimize the global risk r = M k=1 rkP (ωk) = M i=1 ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
  • 91. How do we do that? We minimize each integral ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (27) We need to minimize M k=1 λkip (x|ωk) P (ωk) (28) We can do the following If x ∈ Ri, then li = M k=1 λkip (x|ωk) P (ωk) < lj = M k=1 λkjp (x|ωk) P (ωk) (29) for all j = i. 43 / 71
  • 92. How do we do that? We minimize each integral ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (27) We need to minimize M k=1 λkip (x|ωk) P (ωk) (28) We can do the following If x ∈ Ri, then li = M k=1 λkip (x|ωk) P (ωk) < lj = M k=1 λkjp (x|ωk) P (ωk) (29) for all j = i. 43 / 71
  • 93. How do we do that? We minimize each integral ˆ Ri M k=1 λkip (x|ωk) P (ωk) dx (27) We need to minimize M k=1 λkip (x|ωk) P (ωk) (28) We can do the following If x ∈ Ri, then li = M k=1 λkip (x|ωk) P (ωk) < lj = M k=1 λkjp (x|ωk) P (ωk) (29) for all j = i. 43 / 71
  • 94. Remarks When we have Kronecker’s delta δki = 0 if k = i 1 if k = i (30) We can do the following λki = 1 − δki (31) We finish with M k=1,k=i p (x|ωk) P (ωk) < M k=1,k=j p (x|ωk) P (ωk) (32) 44 / 71
  • 95. Remarks When we have Kronecker’s delta δki = 0 if k = i 1 if k = i (30) We can do the following λki = 1 − δki (31) We finish with M k=1,k=i p (x|ωk) P (ωk) < M k=1,k=j p (x|ωk) P (ωk) (32) 44 / 71
  • 96. Remarks When we have Kronecker’s delta δki = 0 if k = i 1 if k = i (30) We can do the following λki = 1 − δki (31) We finish with M k=1,k=i p (x|ωk) P (ωk) < M k=1,k=j p (x|ωk) P (ωk) (32) 44 / 71
  • 97. Then We have that p (x|ωj) P (ωj) < p (x|ωi) P (ωi) (33) for all j = i. 45 / 71
  • 98. Special Case: The Two-Class Case We get l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2) l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2) We assign x to ω1 if l1 < l2 (λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34) If we assume that λii < λij - Correct Decisions are penalized much less than wrong ones x ∈ ω1 (ω2) if l12 ≡ p (x|ω1) p (x|ω2) > (<) P (ω2) P (ω1) (λ21 − λ22) (λ12 − λ11) (35) 46 / 71
  • 99. Special Case: The Two-Class Case We get l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2) l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2) We assign x to ω1 if l1 < l2 (λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34) If we assume that λii < λij - Correct Decisions are penalized much less than wrong ones x ∈ ω1 (ω2) if l12 ≡ p (x|ω1) p (x|ω2) > (<) P (ω2) P (ω1) (λ21 − λ22) (λ12 − λ11) (35) 46 / 71
  • 100. Special Case: The Two-Class Case We get l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2) l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2) We assign x to ω1 if l1 < l2 (λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34) If we assume that λii < λij - Correct Decisions are penalized much less than wrong ones x ∈ ω1 (ω2) if l12 ≡ p (x|ω1) p (x|ω2) > (<) P (ω2) P (ω1) (λ21 − λ22) (λ12 − λ11) (35) 46 / 71
  • 101. Special Case: The Two-Class Case Definition l12 is known as the likelihood ratio and the preceding test as the likelihood ratio test. 47 / 71
  • 102. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 48 / 71
  • 103. Decision Surface Because the R1 and R2 are contiguous The separating surface between both of them is described by P (ω1|x) − P (ω2|x) = 0 (36) Thus, we define the decision function as g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37) 49 / 71
  • 104. Decision Surface Because the R1 and R2 are contiguous The separating surface between both of them is described by P (ω1|x) − P (ω2|x) = 0 (36) Thus, we define the decision function as g12 (x) = P (ω1|x) − P (ω2|x) = 0 (37) 49 / 71
  • 105. Which decision function for the Naive Bayes A single number in this case 50 / 71
  • 106. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71
  • 107. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71
  • 108. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71
  • 109. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (ωi|x)). Classic Example the Monotonically increasing f (P (ωi|x)) = ln P (ωi|x). The decision test is now classify x in ωi if gi (x) > gj (x) ∀j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) − gj (x) i, j = 1, 2, ..., M i = j 51 / 71
  • 110. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 52 / 71
  • 111. Gaussian Distribution We can use the Gaussian distribution p (x|ωi) = 1 (2π) l/2 |Σi| 1/2 exp − 1 2 (x − µi)T Σ−1 i (x − µi) (38) Example 53 / 71
  • 112. Gaussian Distribution We can use the Gaussian distribution p (x|ωi) = 1 (2π) l/2 |Σi| 1/2 exp − 1 2 (x − µi)T Σ−1 i (x − µi) (38) Example 53 / 71
  • 113. Some Properties About Σ It is the covariance matrix between variables. Thus It is positive definite. Symmetric. The inverse exists. 54 / 71
  • 114. Some Properties About Σ It is the covariance matrix between variables. Thus It is positive definite. Symmetric. The inverse exists. 54 / 71
  • 115. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 55 / 71
  • 116. Influence of the Covariance Σ Look at the following Covariance Σ = 1 0 0 1 It simple the unit Gaussian with mean µ 56 / 71
  • 117. Influence of the Covariance Σ Look at the following Covariance Σ = 1 0 0 1 It simple the unit Gaussian with mean µ 56 / 71
  • 118. The Covariance Σ as a Rotation Look at the following Covariance Σ = 4 0 0 1 Actually, it flatten the circle through the x − axis 57 / 71
  • 119. The Covariance Σ as a Rotation Look at the following Covariance Σ = 4 0 0 1 Actually, it flatten the circle through the x − axis 57 / 71
  • 120. Influence of the Covariance Σ Look at the following Covariance Σa = RΣbRT with R = cos θ − sin θ − sin θ cos θ It allows to rotate the axises 58 / 71
  • 121. Influence of the Covariance Σ Look at the following Covariance Σa = RΣbRT with R = cos θ − sin θ − sin θ cos θ It allows to rotate the axises 58 / 71
  • 122. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classification is p (x, ω1) = p (x|ωi) P (ωi)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39) Thus gi (x) = − 1 2 (x − µi)T Σ−1 i (x − µi) + ln P (ωi) + ci (40) 59 / 71
  • 123. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classification is p (x, ω1) = p (x|ωi) P (ωi)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39) Thus gi (x) = − 1 2 (x − µi)T Σ−1 i (x − µi) + ln P (ωi) + ci (40) 59 / 71
  • 124. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classification is p (x, ω1) = p (x|ωi) P (ωi)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39) Thus gi (x) = − 1 2 (x − µi)T Σ−1 i (x − µi) + ln P (ωi) + ci (40) 59 / 71
  • 125. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 60 / 71
  • 126. Given a series of classes ω1, ω2, ..., ωM We assume for each class ωj The samples are drawn independetly according to the probability law p (x|ωj) We call those samples as i.i.d. — independent identically distributed random variables. We assume in addition p (x|ωj) has a known parametric form with vector θj of parameters. 61 / 71
  • 127. Given a series of classes ω1, ω2, ..., ωM We assume for each class ωj The samples are drawn independetly according to the probability law p (x|ωj) We call those samples as i.i.d. — independent identically distributed random variables. We assume in addition p (x|ωj) has a known parametric form with vector θj of parameters. 61 / 71
  • 128. Given a series of classes ω1, ω2, ..., ωM We assume for each class ωj The samples are drawn independetly according to the probability law p (x|ωj) We call those samples as i.i.d. — independent identically distributed random variables. We assume in addition p (x|ωj) has a known parametric form with vector θj of parameters. 61 / 71
  • 129. Given a series of classes ω1, ω2, ..., ωM For example p (x|ωj) ∼ N µj, Σj (41) In our case We will assume that there is no dependence between classes!!! 62 / 71
  • 130. Given a series of classes ω1, ω2, ..., ωM For example p (x|ωj) ∼ N µj, Σj (41) In our case We will assume that there is no dependence between classes!!! 62 / 71
  • 131. Now Suppose that ωj contains n samples x1, x2, ..., xn p (x1, x2, ..., xn|θj) = n j=1 p (xj|θj) (42) We can see then the function p (x1, x2, ..., xn|θj) as a function of L (θj) = n j=1 p (xj|θj) (43) 63 / 71
  • 132. Now Suppose that ωj contains n samples x1, x2, ..., xn p (x1, x2, ..., xn|θj) = n j=1 p (xj|θj) (42) We can see then the function p (x1, x2, ..., xn|θj) as a function of L (θj) = n j=1 p (xj|θj) (43) 63 / 71
  • 133. Example L (θj) = log n j=1 p (xj|θj) 64 / 71
  • 134. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Influence of the Covariance Σ Maximum Likelihood Principle Maximum Likelihood on a Gaussian 65 / 71
  • 135. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (ωi) = − n 2 ln |Σi| − 1 2   n j=1 (xj − µi)T Σ−1 i (xj − µi)   + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 − n 2 ln |Σi| − 1 2 n j=1 xj T Σ−1 i xj − 2xj T Σ−1 i µi + µi T Σ−1 i µi + c2 (46) 66 / 71
  • 136. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (ωi) = − n 2 ln |Σi| − 1 2   n j=1 (xj − µi)T Σ−1 i (xj − µi)   + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 − n 2 ln |Σi| − 1 2 n j=1 xj T Σ−1 i xj − 2xj T Σ−1 i µi + µi T Σ−1 i µi + c2 (46) 66 / 71
  • 137. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (ωi) = − n 2 ln |Σi| − 1 2   n j=1 (xj − µi)T Σ−1 i (xj − µi)   + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 − n 2 ln |Σi| − 1 2 n j=1 xj T Σ−1 i xj − 2xj T Σ−1 i µi + µi T Σ−1 i µi + c2 (46) 66 / 71
  • 138. Maximum Likelihood Then ∂ ln L (ωi) ∂µi = n j=1 Σ−1 i (xj − µi) = 0 nΣ−1 i  −µi + 1 n n j=1 xj   = 0 ˆµi = 1 n n j=1 xj 67 / 71
  • 139. Maximum Likelihood Then, we derive with respect to Σi For this we use the following tricks: 1 ∂ log|Σ| ∂Σ−1 = − 1 |Σ| · |Σ| (Σ)T = −Σ 2 ∂Tr[AB] ∂A = ∂Tr[BA] ∂A = BT 3 Trace(of a number)=the number 4 Tr(AT B) = Tr BAT Thus f (Σi) = − n 2 ln |ΣI | − 1 2 n j=1 (xj − µi)T Σ−1 i (xj − µi) + c1 (47) 68 / 71
  • 140. Maximum Likelihood Thus f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace (xj − µi)T Σ−1 i (xj − µi) +c1 (48) Tricks!!! f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace Σ−1 i (xj − µi) (xj − µi)T +c1 (49) 69 / 71
  • 141. Maximum Likelihood Thus f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace (xj − µi)T Σ−1 i (xj − µi) +c1 (48) Tricks!!! f (Σi) = − n 2 ln |Σi|− 1 2 n j=1 Trace Σ−1 i (xj − µi) (xj − µi)T +c1 (49) 69 / 71
  • 142. Maximum Likelihood Derivative with respect to Σ ∂f (Σi) ∂Σi = n 2 Σi − 1 2 n j=1 (xj − µi) (xj − µi)T T (50) Thus, when making it equal to zero ˆΣi = 1 n n j=1 (xj − µi) (xj − µi)T (51) 70 / 71
  • 143. Maximum Likelihood Derivative with respect to Σ ∂f (Σi) ∂Σi = n 2 Σi − 1 2 n j=1 (xj − µi) (xj − µi)T T (50) Thus, when making it equal to zero ˆΣi = 1 n n j=1 (xj − µi) (xj − µi)T (51) 70 / 71
  • 144. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71
  • 145. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71
  • 146. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71