SlideShare a Scribd company logo
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
1
LECTURE 7: Kernel Density Estimation
g Non-parametric Density Estimation
g Histograms
g Parzen Windows
g Smooth Kernels
g Product Kernel Density Estimation
g The Naïve Bayes Classifier
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
2
Non-parametric density estimation
g In the previous two lectures we have assumed that either
n The likelihoods p(x|ωi) were known (Likelihood Ratio Test) or
n At least the parametric form of the likelihoods were known (Parameter
Estimation)
g The methods that will be presented in the next two lectures do not
afford such luxuries
n Instead, they attempt to estimate the density directly from the data without
making any parametric assumptions about the underlying distribution
n Sounds challenging? You bet!
x1
x
2
P(x1, x2| ωi)
NON-PARAMETRIC
DENSITY ESTIMATION
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
1
1
1
1
1
1
1
1
1
1
1
1
1
11
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
3
33
3
3
33
3
3
3
3
3
4
4
4
4
4
4
4
4
4
44
4
4
4
4
4
4
4
5
5
5
5
5
5
5
55
5
5
55
5
5
5
6
6
6
6
6 6
6
6
6
6
6
6
6
77
77
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
88
8
8
8
8
8
8
8
8
8
9
9
9
99
9
9
9
9
9
99
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
1*
1*
1*
1*
2*
2*
2*
2*
2*
2*
3*
3*
3*
3*
3*
3*
3*
4*
5*
5*
5*
5*
5*
6*
6*
6*
6*
6*
6*
6*
7*
7*
7*
7*
8*
8*
8*
9*
9*
9*
9*
9*
10*
10*
10*
10*
10*
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
1
1
1
1
1
1
1
1
1
1
1
1
1
11
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
3
33
3
3
33
3
3
3
3
3
4
4
4
4
4
4
4
4
4
44
4
4
4
4
4
4
4
5
5
5
5
5
5
5
55
5
5
55
5
5
5
6
6
6
6
6 6
6
6
6
6
6
6
6
77
77
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
88
8
8
8
8
8
8
8
8
8
9
9
9
99
9
9
9
9
9
99
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
1*
1*
1*
1*
2*
2*
2*
2*
2*
2*
3*
3*
3*
3*
3*
3*
3*
4*
5*
5*
5*
5*
5*
6*
6*
6*
6*
6*
6*
6*
7*
7*
7*
7*
8*
8*
8*
9*
9*
9*
9*
9*
10*
10*
10*
10*
10*
x1
x2
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
3
The histogram
g The simplest form of non-parametric D.E. is the familiar histogram
n Divide the sample space into a number of bins and approximate the density at the center of
each bin by the fraction of points in the training data that fall into the corresponding bin
g The histogram requires two “parameters” to be defined: bin width and starting position of the first bin
g The histogram is a very simple form of D.E., but it has several drawbacks
n The final shape of the density estimate depends on the starting position of the bins
g For multivariate data, the final shape of the density is also affected by the orientation of the bins
n The discontinuities of the estimate are not due to the underlying density, they are only an
artifact of the chosen bin locations
g These discontinuities make it very difficult, without experience, to grasp the structure of the data
n A much more serious problem is the curse of dimensionality, since the number of bins grows
exponentially with the number of dimensions
g In high dimensions we would require a very large number of examples or else most of the bins would
be empty
g All these drawbacks make the histogram unsuitable for most practical
applications except for rapid visualization of results in one or two dimensions
n Therefore, we will not spend more time looking at the histogram
[ ]
[ ]
x
containing
bin
of
width
x
as
bin
same
in
x
of
number
N
1
(x)
P
(k
H =
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
4
Non-parametric density estimation, general formulation (1)
g Let us return to the basic definition of probability to get a solid idea of
what we are trying to accomplish
n The probability that a vector x, drawn from a distribution p(x), will fall in a given
region ℜ of the sample space is
n Suppose now that N vectors {x(1, x(2, …, x(N} are drawn from the distribution. The
probability that k of these N vectors fall in ℜ is given by the binomial distribution
n It can be shown (from the properties of the binomial p.m.f.) that the mean and
variance of the ratio k/N are
n Therefore, as N→∞, the distribution becomes sharper (the variance gets
smaller) so we can expect that a good estimate of the probability P can be
obtained from the mean fraction of the points that fall within ℜ
∫
ℜ
= )dx'
p(x'
P
( ) k
N
k
P)
(1
P
k
N
k
P −
−








=
( )
N
P
1
P
P
N
k
E
N
k
Var
and
P
N
k
E
2
−
=














−
=






=






N
k
P ≅
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
5
Non-parametric density estimation, general formulation (2)
n On the other hand, if we assume that ℜ is so small that p(x) does not vary
appreciably within it, then
g where V is the volume enclosed by region ℜ
n Merging with the previous result we obtain
n This estimate becomes more accurate as we increase the number of sample
points N and shrink the volume V
g In practice the value of N (the total number of examples) is fixed
n In order to improve the accuracy of the estimate p(x) we could let V approach
zero but then the region ℜ would then become so small that it would enclose no
examples
n This means that, in practice, we will have to find a compromise value for the
volume V
g Large enough to include enough examples within ℜ
g Small enough to support the assumption that p(x) is constant within ℜ
NV
k
p(x)
N
k
P
p(x)V
)dx'
p(x'
P
≅
⇒





≅
≅
= ∫
ℜ
p(x)V
)dx'
p(x' ≅
∫
ℜ
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
6
Non-parametric density estimation, general formulation (3)
g In conclusion, the general expression for non-parametric density
estimation becomes
g When applying this result to practical density estimation problems,
two basic approaches can be adopted
n We can choose a fixed value of the volume V and determine k from the data.
This leads to methods commonly referred to as Kernel Density Estimation
(KDE), which are the subject of this lecture
n We can choose a fixed value of k and determine the corresponding volume V
from the data. This gives rise to the k Nearest Neighbor (kNN) approach, which
will be covered in the next lecture
g It can be shown that both kNN and KDE converge to the true
probability density as N→∞, provided that V shrinks with N, and k
grows with N appropriately





≅
V
inside
examples
of
number
the
is
k
examples
of
number
total
the
is
N
x
g
surroundin
volume
the
is
V
where
NV
k
p(x)
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
7
Parzen windows (1)
g Suppose that the region ℜ that encloses the
k examples is a hypercube with sides of
length h centered at the estimation point x
n Then its volume is given by V=hD, where D is the
number of dimensions
g To find the number of examples that fall
within this region we define a kernel
function K(u)
n This kernel, which corresponds to a unit
hypercube centered at the origin, is known as a
Parzen window or the naïve estimator
n The quantity K((x-x(n)/h) is then equal to unity if
the point x(n is inside a hypercube of side h
centered on x, and zero otherwise
( )


 =
∀
<
=
otherwise
0
D
1,..,
j
1/2
u
1
u
K j
x
h
h
h
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
8
Parzen windows (2)
g The total number of points inside the
hypercube is then
g Substituting back into the expression
for the density estimate
n Notice that the Parzen window density
estimate resembles the histogram, with
the exception that the bin locations are
determined by the data points
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p
∑
=







 −
=
N
1
n
n
(
h
x
x
K
k
Volume
1 / V
K(x-x(1)=1
K(x-x(2)=1
K(x-x(3)=1
K(x-x(4)=0
x(1
x(2
x(3
x(4
x
x(1
x(2
x(3
x(4
Volume
1 / V
K(x-x(1)=1
K(x-x(1)=1
K(x-x(2)=1
K(x-x(2)=1
K(x-x(3)=1
K(x-x(3)=1
K(x-x(4)=0
K(x-x(4)=0
x(1
x(2
x(3
x(4
x
x(1
x(2
x(3
x(4
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
9
Parzen windows (3)
g To understand the role of the kernel function we compute the
expectation of the probability estimate p(x)
n where we have assumed that the vectors x(n are drawn independently from the
true density p(x)
g We can see that the expectation of the estimated density pKDE(x) is a
convolution of the true density p(x) with the kernel function
n The width h of the kernel plays the role of a smoothing parameter: the wider the
kernel function, the smoother the estimate pKDE(x)
g For h→0, the kernel approaches a Dirac delta function and pKDE(x)
approaches the true density
n However, in practice we have a finite number of points, so h cannot be made
arbitrarily small, since the density estimate pKDE(x) would then degenerate to a set
of impulses located at the training data points
( )
[ ]
∫
∑





 −
=













 −
=
=













 −
=
=
)dx'
p(x'
h
x'
x
K
h
1
h
x
x
K
E
h
1
h
x
x
K
E
Nh
1
x
p
E
D
(n
D
N
1
n
(n
D
KDE
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
10
Numeric exercise
g Given the dataset below, use Parzen windows to estimate the density
p(x) at y=3,10,15. Use a bandwidth of h=4
n X = {x(1, x(2,…x(N} = {4, 5, 5, 6, 12, 14, 15, 15, 16, 17}
g Solution
n Let’s first draw the dataset to get an idea of what numerical results we should
expect
n Let’s now estimate p(y=3):
n Similarly
[ ] 0.025
4
10
1
0
0
0
0
0
0
0
0
0
1
4
10
1
4
17
3
K
...
4
6
3
K
4
5
3
K
4
5
3
K
4
4
3
K
4
10
1
h
x
y
K
Nh
1
3)
(y
p
1
13/4
-
1
-
1/2
-
1/2
-
1/4
-
1
N
1
n
(n
D
KDE
=
×
=
+
+
+
+
+
+
+
+
+
×
=
=















 −
+
+





 −
+





 −
+





 −
+





 −
×
=







 −
=
= ∑
=
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
5 10 15 x
p(x)
y=3 y=10
y=15
[ ] 0
4
10
0
0
0
0
0
0
0
0
0
0
0
4
10
1
10)
(y
p 1
KDE =
×
=
+
+
+
+
+
+
+
+
+
×
=
=
[ ] 0.1
4
10
4
0
1
1
1
1
0
0
0
0
0
4
10
1
15)
(y
p 1
KDE =
×
=
+
+
+
+
+
+
+
+
+
×
=
=
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
11
Smooth kernels (1)
g The Parzen window has several drawbacks
n Yields density estimates that have discontinuities
n Weights equally all the points xi, regardless of their distance to the estimation
point x
g It is easy to to overcome some of these difficulties by generalizing the
Parzen window with a smooth kernel function K(u) which satisfies the
condition
n Usually, but not not always, K(u) will be a radially symmetric and unimodal
probability density function, such as the multivariate Gaussian density function
n where the expression of the density estimate remains the same as with Parzen
windows
( ) 1
dx
x
K
D
R
=
∫
( )
( )






−
= x
x
2
1
exp
π
2
1
x
K T
2
/
D
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
12
Smooth kernels (2)
g Just as the Parzen window estimate can be considered a sum of boxes
centered at the observations, the smooth kernel estimate is a sum of
“bumps” placed at the data points
n The kernel function determines the shape of the bumps
n The parameter h, also called the smoothing parameter or bandwidth,
determines their width
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
x
P
K
DE(x);
h=3
Density
estimate
Data
points
Kernel
functions
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
x
P
K
DE(x);
h=3
Density
estimate
Data
points
Kernel
functions
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
13
Choosing the bandwidth: univariate case (1)
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
x
P
KDE
(x);
h=5.0
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
x
P
KDE
(x);
h=10.0
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
x
P
KDE
(x);
h=2.5
-10 -5 0 5 10 15 20 25 30 35 40
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
x
P
KDE
(x);
h=1.0
g The problem of choosing the bandwidth is crucial in density estimation
n A large bandwidth will over-smooth the density and mask the structure in the data
n A small bandwidth will yield a density estimate that is spiky and very hard to interpret
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
14
Choosing the bandwidth: univariate case (2)
g We would like to find a value of the smoothing parameter that minimizes the error between
the estimated density and the true density
n A natural measure is the mean square error at the estimation point x, defined by
g This expression is an example of the bias-variance tradeoff that we saw earlier in the
course: the bias can be reduced at the expense of the variance, and vice versa
n The bias of an estimate is the systematic error incurred in the estimation
n The variance of an estimate is the random error incurred in the estimation
g The bias-variance dilemma applied to bandwidth selection simply means that
n A large bandwidth will reduce the differences among the estimates of pKDE(x) for different data sets (the
variance) but it will increase the bias of pKDE(x) with respect to the true density p(x)
n A small bandwidth will reduce the bias of pKDE(x), at the expense of a larger variance in the estimates pKDE(x)
( ) ( ) ( )
( )
[ ] ( ) ( )
[ ]
{ } ( )
( )
4
43
4
42
1
4
4
4 3
4
4
4 2
1
variance
KDE
2
bias
KDE
2
KDE
KDE
x x
p
var
x
p
x
p
E
x
p
x
p
E
p
MSE +
−
=
−
=
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x
P
KDE
(x);
h=0.1
VARIANCE
-3 -2 -1 0 1 2 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
P
KDE
(x);
h=2.0
-3 -2 -1 0 1 2 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
P
KDE
(x);
h=2.0
True density
Multiple kernel
density estimates
BIAS
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
15
Bandwidth selection methods, univariate case (3)
g Subjective choice
n The natural way for choosing the smoothing parameter is to plot out several curves and
choose the estimate that is most in accordance with one’s prior (subjective) ideas
n However, this method is not practical in pattern recognition since we typically have high-
dimensional data
g Reference to a standard distribution
n Assume a standard density function and find the value of the bandwidth that minimizes the
integral of the square error (MISE)
n If we assume that the true distribution is a Gaussian density and we use a Gaussian kernel,
it can be shown that the optimal value of the bandwidth becomes
g where σ is the sample variance and N is the number of training examples
( )
( )
{ } ( ) ( )
( )
[ ]
{ }
∫ −
=
= dx
x
p
x
p
E
argmin
x
p
MISE
argmin
h
2
KDE
h
KDE
h
opt
5
/
1
opt N
σ
06
.
1
h −
=
From [Silverman, 1986]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
16
Bandwidth selection methods, univariate case (4)
n Better results can be obtained if we use a robust measure of the spread instead of the
sample variance and we reduce the coefficient 1.06 to better cope with multimodal densities.
The optimal bandwidth then becomes
g IQR is the interquartile range, a robust estimate of the spread. It is computed as one half the
difference between the 75th
percentile (Q3) and the 25th
percentile (Q1). The formula for semi-
interquartile range is therefore: (Q3-Q1)/2
n A percentile rank is the proportion of examples in a distribution that a specific example is greater than or equal to
g Likelihood cross-validation
n The ML estimate of h is degenerate since it yields hML=0, a density estimate with Dirac delta
functions at each training data point
n A practical alternative is to maximize the “pseudo-likelihood” computed using leave-one-out
cross-validation






=
= −
34
.
1
IQR
,
σ
min
A
where
AN
9
.
0
h 5
/
1
opt
From [Silverman, 1986]
( )
( ) ( ) ∑
∑
≠
=
−
=
−







 −
−
=






=
N
m
n
1,
m
(m
(n
(n
n
N
1
n
(n
n
h
MLCV
h
x
x
K
h
1
N
1
x
p
where
x
p
log
N
1
argmax
h
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
17
Multivariate density estimation
g The derived expression of the estimate PKDE(x) for multiple dimensions was
n Notice that the bandwidth h is the same for all the axes, so this density estimate will be
weight all the axis equally
g However, if the spread of the data is much greater in one of the coordinate
directions than the others, we should use a vector of smoothing parameters or
even a full covariance matrix, which complicates the procedure
g There are two basic alternatives to solve the scaling problem without having to
use a more general kernel density estimate
n Pre-scale each axis (normalize to unit variance, for instance)
n Pre-whiten the data (linearly transform to have unit covariance matrix), estimate the
density, and then transform back [Fukunaga]
g The whitening transform is simply y=Λ-1/2
MT
x,
where Λ and M are the eigenvalue and eigenvector
matrices of the sample covariance of x
g Fukunaga’s method is equivalent to using a
hyper-ellipsoidal kernel
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
18
Product kernels
g A very popular method for performing multivariate density estimation is the
product kernel, defined as
n The product kernel consists of the product of one-dimensional kernels
g Typically the same kernel function is used in each dimension ( Kd(x)=K(x) ), and
only the bandwidths are allowed to differ
n Bandwidth selection can then be performed with any of the methods presented for univariate
density estimation
g It is important to notice that although the expression of K(x,x(n,h1,…hD) uses
kernel independence, this does not imply that any type of feature independence
is being assumed
n A density estimation method that assumed feature independence would have the following
expression
n Notice how the order of the summation and product are reversed compared to the product
kernel
( ) ( )
( ) ∏
∑
=
=







 −
⋅
⋅
⋅
=
=
D
1
d d
(n
d
D
1
D
1
(n
N
1
i
D
1
(n
PKDE
h
(d)
x
x(d)
K
h
h
1
h
,...,
h
,
x
x,
K
where
h
,...,
h
,
x
x,
K
N
1
x
p
( ) ∏ ∑
= =
− 














 −
=
D
1
d
N
1
i d
(n
d
d
IND
FEAT
h
(d)
x
x(d)
K
Nh
1
x
p
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
19
Product kernel, example 1
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
g This example shows the product kernel density estimate of a bivariate unimodal Gaussian
distribution
n 100 data points were drawn from the distribution
n The figures show the true density (left) and the estimates using h=1.06σN-1/5
(middle) and h=0.9AN-1/5
(right)
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
20
Product kernel, example 2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
g This example shows the product kernel density estimate of a bivariate bimodal Gaussian
distribution
n 100 data points were drawn from the distribution
n The figures show the true density (left) and the estimates using h=1.06σN-1/5
(middle) and h=0.9AN-1/5
(right)
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
21
Naïve Bayes classifier
g Recall that the Bayes classifier is given by the following family of discriminant functions
g Using Bayes rule, these discriminant functions can be expressed as
n where P(ωi) is our prior knowledge and P(x|ωi) is obtained through density estimation
g Although we have presented density estimation methods that allow us to estimate the
multivariate likelihood P(x|ωi), the curse of dimensionality makes it a very tough problem!
g One highly practical simplification of the Bayes classifier is the so-called Naïve Bayes
classifier
n The Naïve Bayes classifier makes the assumption that the features are class-conditionally independent
g It is important to notice that this assumption is not as rigid as assuming independent features
n Merging this expression into the discriminant function yields the decision rule for the Naïve Bayes classifier
g The main advantage of the Naïve Bayes classifier is that we only need to compute the
univariate densities P(x(d)|ωi), which is a much easier problem than estimating the
multivariate density P(x|ωi)
n Despite its simplicity, the Naïve Bayes has been shown to have comparable performance to artificial neural
networks and decision tree learning in some domains
x)
|
P(ω
(x)
g
where
i
j
)
x
(
g
)
x
(
g
if
ω
choose
i
i
j
i
i
=
≠
∀
>
)
)P(ω
ω
|
P(x
x)
|
P(ω
(x)
g i
i
i
i ∝
=
∏
=
=
D
1
d
i
i )
ω
|
P(x(d)
)
ω
|
P(x
Naïve Bayes
Classifier
Naïve Bayes
Classifier
( )
∏
=
=
D
1
d
i
i
NB
,
i ω
|
)
d
(
x
P
)
P(ω
)
x
(
g
∏
=
=
D
1
d
P(x(d))
P(x)

More Related Content

PPTX
Basic Relationships between Pixels- Digital Image Processing
PPTX
Chapter 3 image enhancement (spatial domain)
PPTX
Turtle Graphics in CG
PPTX
Image enhancement lecture
PDF
SVM for Regression
PPTX
Linear Discriminant Analysis (LDA)
PDF
Lecture 3 image sampling and quantization
Basic Relationships between Pixels- Digital Image Processing
Chapter 3 image enhancement (spatial domain)
Turtle Graphics in CG
Image enhancement lecture
SVM for Regression
Linear Discriminant Analysis (LDA)
Lecture 3 image sampling and quantization

What's hot (20)

PPTX
Fundamentals and image compression models
PPTX
Module 31
PPTX
Color image processing
PPTX
Regression ppt.pptx
PPSX
Image Enhancement in Spatial Domain
PPTX
Digital Image Processing
PPTX
learn about Direct View Storage Tube.pptx
PPTX
Histogram Processing
PPTX
Kernel density estimation (kde)
PDF
Logistic regression in Machine Learning
PPT
5.1 mining data streams
PPT
Bayes Classification
PPTX
Logistic regression
PDF
CSC446: Pattern Recognition (LN6)
PPT
Bayesian networks
PPTX
Image processing second unit Notes
PPT
Clustering
PDF
Linear Regression With R
PPTX
Logistic Regression.pptx
PPTX
Fundamentals and image compression models
Module 31
Color image processing
Regression ppt.pptx
Image Enhancement in Spatial Domain
Digital Image Processing
learn about Direct View Storage Tube.pptx
Histogram Processing
Kernel density estimation (kde)
Logistic regression in Machine Learning
5.1 mining data streams
Bayes Classification
Logistic regression
CSC446: Pattern Recognition (LN6)
Bayesian networks
Image processing second unit Notes
Clustering
Linear Regression With R
Logistic Regression.pptx
Ad

Similar to Kernel estimation(ref) (20)

PDF
2012 mdsp pr08 nonparametric approach
PDF
Lecture 8
PDF
Delayed acceptance for Metropolis-Hastings algorithms
PDF
Expectation propagation
PPT
Clustering_Unsupervised learning Unsupervised learning.ppt
PDF
2012 mdsp pr04 monte carlo
PDF
Bayesian Statistics in High Dimensions Lecture 1: Curve and surface estimation
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
CSC446: Pattern Recognition (LN7)
PDF
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
PDF
PDF
Unbiased Bayes for Big Data
PPTX
Clique and sting
PPT
Probability distribution
PDF
A nonlinear approximation of the Bayesian Update formula
PDF
Introduction to Evidential Neural Networks
PPTX
Anomaly detection using deep one class classifier
PPT
2-Linear Transformations and least squares.ppt
PDF
2 random variables notes 2p3
PPTX
Dirichlet processes and Applications
2012 mdsp pr08 nonparametric approach
Lecture 8
Delayed acceptance for Metropolis-Hastings algorithms
Expectation propagation
Clustering_Unsupervised learning Unsupervised learning.ppt
2012 mdsp pr04 monte carlo
Bayesian Statistics in High Dimensions Lecture 1: Curve and surface estimation
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
CSC446: Pattern Recognition (LN7)
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Unbiased Bayes for Big Data
Clique and sting
Probability distribution
A nonlinear approximation of the Bayesian Update formula
Introduction to Evidential Neural Networks
Anomaly detection using deep one class classifier
2-Linear Transformations and least squares.ppt
2 random variables notes 2p3
Dirichlet processes and Applications
Ad

More from Zahra Amini (11)

PDF
Dip azimifar enhancement_l05_2020
PDF
Dip azimifar enhancement_l04_2020
PDF
Dip azimifar enhancement_l03_2020
PDF
Dip azimifar enhancement_l02_2020
PDF
Dip azimifar enhancement_l01_2020
PDF
Ch 1-3 nn learning 1-7
PDF
Ch 1-2 NN classifier
PDF
Ch 1-1 introduction
PDF
Dr azimifar pattern recognition lect4
PDF
Dr azimifar pattern recognition lect2
PDF
Dr azimifar pattern recognition lect1
Dip azimifar enhancement_l05_2020
Dip azimifar enhancement_l04_2020
Dip azimifar enhancement_l03_2020
Dip azimifar enhancement_l02_2020
Dip azimifar enhancement_l01_2020
Ch 1-3 nn learning 1-7
Ch 1-2 NN classifier
Ch 1-1 introduction
Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect2
Dr azimifar pattern recognition lect1

Recently uploaded (20)

PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
web development for engineering and engineering
DOCX
573137875-Attendance-Management-System-original
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
Construction Project Organization Group 2.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Sustainable Sites - Green Building Construction
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Artificial Intelligence
PDF
Digital Logic Computer Design lecture notes
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Model Code of Practice - Construction Work - 21102022 .pdf
additive manufacturing of ss316l using mig welding
Operating System & Kernel Study Guide-1 - converted.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
web development for engineering and engineering
573137875-Attendance-Management-System-original
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Construction Project Organization Group 2.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
bas. eng. economics group 4 presentation 1.pptx
Foundation to blockchain - A guide to Blockchain Tech
OOP with Java - Java Introduction (Basics)
Sustainable Sites - Green Building Construction
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Artificial Intelligence
Digital Logic Computer Design lecture notes
Automation-in-Manufacturing-Chapter-Introduction.pdf

Kernel estimation(ref)

  • 1. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 1 LECTURE 7: Kernel Density Estimation g Non-parametric Density Estimation g Histograms g Parzen Windows g Smooth Kernels g Product Kernel Density Estimation g The Naïve Bayes Classifier
  • 2. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 2 Non-parametric density estimation g In the previous two lectures we have assumed that either n The likelihoods p(x|ωi) were known (Likelihood Ratio Test) or n At least the parametric form of the likelihoods were known (Parameter Estimation) g The methods that will be presented in the next two lectures do not afford such luxuries n Instead, they attempt to estimate the density directly from the data without making any parametric assumptions about the underlying distribution n Sounds challenging? You bet! x1 x 2 P(x1, x2| ωi) NON-PARAMETRIC DENSITY ESTIMATION 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 1 1 1 1 1 1 1 1 1 1 1 1 1 11 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 3 33 3 3 33 3 3 3 3 3 4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 5 5 5 5 5 5 5 55 5 5 55 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 77 77 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 88 8 8 8 8 8 8 8 8 8 9 9 9 99 9 9 9 9 9 99 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1* 1* 1* 1* 2* 2* 2* 2* 2* 2* 3* 3* 3* 3* 3* 3* 3* 4* 5* 5* 5* 5* 5* 6* 6* 6* 6* 6* 6* 6* 7* 7* 7* 7* 8* 8* 8* 9* 9* 9* 9* 9* 10* 10* 10* 10* 10* 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 1 1 1 1 1 1 1 1 1 1 1 1 1 11 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 3 33 3 3 33 3 3 3 3 3 4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 5 5 5 5 5 5 5 55 5 5 55 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 77 77 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 88 8 8 8 8 8 8 8 8 8 9 9 9 99 9 9 9 9 9 99 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1* 1* 1* 1* 2* 2* 2* 2* 2* 2* 3* 3* 3* 3* 3* 3* 3* 4* 5* 5* 5* 5* 5* 6* 6* 6* 6* 6* 6* 6* 7* 7* 7* 7* 8* 8* 8* 9* 9* 9* 9* 9* 10* 10* 10* 10* 10* x1 x2
  • 3. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 3 The histogram g The simplest form of non-parametric D.E. is the familiar histogram n Divide the sample space into a number of bins and approximate the density at the center of each bin by the fraction of points in the training data that fall into the corresponding bin g The histogram requires two “parameters” to be defined: bin width and starting position of the first bin g The histogram is a very simple form of D.E., but it has several drawbacks n The final shape of the density estimate depends on the starting position of the bins g For multivariate data, the final shape of the density is also affected by the orientation of the bins n The discontinuities of the estimate are not due to the underlying density, they are only an artifact of the chosen bin locations g These discontinuities make it very difficult, without experience, to grasp the structure of the data n A much more serious problem is the curse of dimensionality, since the number of bins grows exponentially with the number of dimensions g In high dimensions we would require a very large number of examples or else most of the bins would be empty g All these drawbacks make the histogram unsuitable for most practical applications except for rapid visualization of results in one or two dimensions n Therefore, we will not spend more time looking at the histogram [ ] [ ] x containing bin of width x as bin same in x of number N 1 (x) P (k H =
  • 4. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 4 Non-parametric density estimation, general formulation (1) g Let us return to the basic definition of probability to get a solid idea of what we are trying to accomplish n The probability that a vector x, drawn from a distribution p(x), will fall in a given region ℜ of the sample space is n Suppose now that N vectors {x(1, x(2, …, x(N} are drawn from the distribution. The probability that k of these N vectors fall in ℜ is given by the binomial distribution n It can be shown (from the properties of the binomial p.m.f.) that the mean and variance of the ratio k/N are n Therefore, as N→∞, the distribution becomes sharper (the variance gets smaller) so we can expect that a good estimate of the probability P can be obtained from the mean fraction of the points that fall within ℜ ∫ ℜ = )dx' p(x' P ( ) k N k P) (1 P k N k P − −         = ( ) N P 1 P P N k E N k Var and P N k E 2 − =               − =       =       N k P ≅ From [Bishop, 1995]
  • 5. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 5 Non-parametric density estimation, general formulation (2) n On the other hand, if we assume that ℜ is so small that p(x) does not vary appreciably within it, then g where V is the volume enclosed by region ℜ n Merging with the previous result we obtain n This estimate becomes more accurate as we increase the number of sample points N and shrink the volume V g In practice the value of N (the total number of examples) is fixed n In order to improve the accuracy of the estimate p(x) we could let V approach zero but then the region ℜ would then become so small that it would enclose no examples n This means that, in practice, we will have to find a compromise value for the volume V g Large enough to include enough examples within ℜ g Small enough to support the assumption that p(x) is constant within ℜ NV k p(x) N k P p(x)V )dx' p(x' P ≅ ⇒      ≅ ≅ = ∫ ℜ p(x)V )dx' p(x' ≅ ∫ ℜ From [Bishop, 1995]
  • 6. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 6 Non-parametric density estimation, general formulation (3) g In conclusion, the general expression for non-parametric density estimation becomes g When applying this result to practical density estimation problems, two basic approaches can be adopted n We can choose a fixed value of the volume V and determine k from the data. This leads to methods commonly referred to as Kernel Density Estimation (KDE), which are the subject of this lecture n We can choose a fixed value of k and determine the corresponding volume V from the data. This gives rise to the k Nearest Neighbor (kNN) approach, which will be covered in the next lecture g It can be shown that both kNN and KDE converge to the true probability density as N→∞, provided that V shrinks with N, and k grows with N appropriately      ≅ V inside examples of number the is k examples of number total the is N x g surroundin volume the is V where NV k p(x) From [Bishop, 1995]
  • 7. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 7 Parzen windows (1) g Suppose that the region ℜ that encloses the k examples is a hypercube with sides of length h centered at the estimation point x n Then its volume is given by V=hD, where D is the number of dimensions g To find the number of examples that fall within this region we define a kernel function K(u) n This kernel, which corresponds to a unit hypercube centered at the origin, is known as a Parzen window or the naïve estimator n The quantity K((x-x(n)/h) is then equal to unity if the point x(n is inside a hypercube of side h centered on x, and zero otherwise ( )    = ∀ < = otherwise 0 D 1,.., j 1/2 u 1 u K j x h h h From [Bishop, 1995]
  • 8. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 8 Parzen windows (2) g The total number of points inside the hypercube is then g Substituting back into the expression for the density estimate n Notice that the Parzen window density estimate resembles the histogram, with the exception that the bin locations are determined by the data points ∑ =         − = N 1 n (n D KDE h x x K Nh 1 (x) p ∑ =         − = N 1 n n ( h x x K k Volume 1 / V K(x-x(1)=1 K(x-x(2)=1 K(x-x(3)=1 K(x-x(4)=0 x(1 x(2 x(3 x(4 x x(1 x(2 x(3 x(4 Volume 1 / V K(x-x(1)=1 K(x-x(1)=1 K(x-x(2)=1 K(x-x(2)=1 K(x-x(3)=1 K(x-x(3)=1 K(x-x(4)=0 K(x-x(4)=0 x(1 x(2 x(3 x(4 x x(1 x(2 x(3 x(4 From [Bishop, 1995]
  • 9. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 9 Parzen windows (3) g To understand the role of the kernel function we compute the expectation of the probability estimate p(x) n where we have assumed that the vectors x(n are drawn independently from the true density p(x) g We can see that the expectation of the estimated density pKDE(x) is a convolution of the true density p(x) with the kernel function n The width h of the kernel plays the role of a smoothing parameter: the wider the kernel function, the smoother the estimate pKDE(x) g For h→0, the kernel approaches a Dirac delta function and pKDE(x) approaches the true density n However, in practice we have a finite number of points, so h cannot be made arbitrarily small, since the density estimate pKDE(x) would then degenerate to a set of impulses located at the training data points ( ) [ ] ∫ ∑       − =               − = =               − = = )dx' p(x' h x' x K h 1 h x x K E h 1 h x x K E Nh 1 x p E D (n D N 1 n (n D KDE From [Bishop, 1995]
  • 10. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 10 Numeric exercise g Given the dataset below, use Parzen windows to estimate the density p(x) at y=3,10,15. Use a bandwidth of h=4 n X = {x(1, x(2,…x(N} = {4, 5, 5, 6, 12, 14, 15, 15, 16, 17} g Solution n Let’s first draw the dataset to get an idea of what numerical results we should expect n Let’s now estimate p(y=3): n Similarly [ ] 0.025 4 10 1 0 0 0 0 0 0 0 0 0 1 4 10 1 4 17 3 K ... 4 6 3 K 4 5 3 K 4 5 3 K 4 4 3 K 4 10 1 h x y K Nh 1 3) (y p 1 13/4 - 1 - 1/2 - 1/2 - 1/4 - 1 N 1 n (n D KDE = × = + + + + + + + + + × = =                 − + +       − +       − +       − +       − × =         − = = ∑ = 4 3 4 2 1 4 3 4 2 1 4 3 4 2 1 4 3 4 2 1 4 3 4 2 1 5 10 15 x p(x) y=3 y=10 y=15 [ ] 0 4 10 0 0 0 0 0 0 0 0 0 0 0 4 10 1 10) (y p 1 KDE = × = + + + + + + + + + × = = [ ] 0.1 4 10 4 0 1 1 1 1 0 0 0 0 0 4 10 1 15) (y p 1 KDE = × = + + + + + + + + + × = =
  • 11. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 11 Smooth kernels (1) g The Parzen window has several drawbacks n Yields density estimates that have discontinuities n Weights equally all the points xi, regardless of their distance to the estimation point x g It is easy to to overcome some of these difficulties by generalizing the Parzen window with a smooth kernel function K(u) which satisfies the condition n Usually, but not not always, K(u) will be a radially symmetric and unimodal probability density function, such as the multivariate Gaussian density function n where the expression of the density estimate remains the same as with Parzen windows ( ) 1 dx x K D R = ∫ ( ) ( )       − = x x 2 1 exp π 2 1 x K T 2 / D ∑ =         − = N 1 n (n D KDE h x x K Nh 1 (x) p From [Bishop, 1995]
  • 12. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 12 Smooth kernels (2) g Just as the Parzen window estimate can be considered a sum of boxes centered at the observations, the smooth kernel estimate is a sum of “bumps” placed at the data points n The kernel function determines the shape of the bumps n The parameter h, also called the smoothing parameter or bandwidth, determines their width -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 x P K DE(x); h=3 Density estimate Data points Kernel functions -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 x P K DE(x); h=3 Density estimate Data points Kernel functions
  • 13. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 13 Choosing the bandwidth: univariate case (1) -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 x P KDE (x); h=5.0 -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 x P KDE (x); h=10.0 -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 x P KDE (x); h=2.5 -10 -5 0 5 10 15 20 25 30 35 40 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 x P KDE (x); h=1.0 g The problem of choosing the bandwidth is crucial in density estimation n A large bandwidth will over-smooth the density and mask the structure in the data n A small bandwidth will yield a density estimate that is spiky and very hard to interpret
  • 14. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 14 Choosing the bandwidth: univariate case (2) g We would like to find a value of the smoothing parameter that minimizes the error between the estimated density and the true density n A natural measure is the mean square error at the estimation point x, defined by g This expression is an example of the bias-variance tradeoff that we saw earlier in the course: the bias can be reduced at the expense of the variance, and vice versa n The bias of an estimate is the systematic error incurred in the estimation n The variance of an estimate is the random error incurred in the estimation g The bias-variance dilemma applied to bandwidth selection simply means that n A large bandwidth will reduce the differences among the estimates of pKDE(x) for different data sets (the variance) but it will increase the bias of pKDE(x) with respect to the true density p(x) n A small bandwidth will reduce the bias of pKDE(x), at the expense of a larger variance in the estimates pKDE(x) ( ) ( ) ( ) ( ) [ ] ( ) ( ) [ ] { } ( ) ( ) 4 43 4 42 1 4 4 4 3 4 4 4 2 1 variance KDE 2 bias KDE 2 KDE KDE x x p var x p x p E x p x p E p MSE + − = − = -3 -2 -1 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x P KDE (x); h=0.1 VARIANCE -3 -2 -1 0 1 2 3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 x P KDE (x); h=2.0 -3 -2 -1 0 1 2 3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 x P KDE (x); h=2.0 True density Multiple kernel density estimates BIAS
  • 15. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 15 Bandwidth selection methods, univariate case (3) g Subjective choice n The natural way for choosing the smoothing parameter is to plot out several curves and choose the estimate that is most in accordance with one’s prior (subjective) ideas n However, this method is not practical in pattern recognition since we typically have high- dimensional data g Reference to a standard distribution n Assume a standard density function and find the value of the bandwidth that minimizes the integral of the square error (MISE) n If we assume that the true distribution is a Gaussian density and we use a Gaussian kernel, it can be shown that the optimal value of the bandwidth becomes g where σ is the sample variance and N is the number of training examples ( ) ( ) { } ( ) ( ) ( ) [ ] { } ∫ − = = dx x p x p E argmin x p MISE argmin h 2 KDE h KDE h opt 5 / 1 opt N σ 06 . 1 h − = From [Silverman, 1986]
  • 16. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 16 Bandwidth selection methods, univariate case (4) n Better results can be obtained if we use a robust measure of the spread instead of the sample variance and we reduce the coefficient 1.06 to better cope with multimodal densities. The optimal bandwidth then becomes g IQR is the interquartile range, a robust estimate of the spread. It is computed as one half the difference between the 75th percentile (Q3) and the 25th percentile (Q1). The formula for semi- interquartile range is therefore: (Q3-Q1)/2 n A percentile rank is the proportion of examples in a distribution that a specific example is greater than or equal to g Likelihood cross-validation n The ML estimate of h is degenerate since it yields hML=0, a density estimate with Dirac delta functions at each training data point n A practical alternative is to maximize the “pseudo-likelihood” computed using leave-one-out cross-validation       = = − 34 . 1 IQR , σ min A where AN 9 . 0 h 5 / 1 opt From [Silverman, 1986] ( ) ( ) ( ) ∑ ∑ ≠ = − = −         − − =       = N m n 1, m (m (n (n n N 1 n (n n h MLCV h x x K h 1 N 1 x p where x p log N 1 argmax h
  • 17. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 17 Multivariate density estimation g The derived expression of the estimate PKDE(x) for multiple dimensions was n Notice that the bandwidth h is the same for all the axes, so this density estimate will be weight all the axis equally g However, if the spread of the data is much greater in one of the coordinate directions than the others, we should use a vector of smoothing parameters or even a full covariance matrix, which complicates the procedure g There are two basic alternatives to solve the scaling problem without having to use a more general kernel density estimate n Pre-scale each axis (normalize to unit variance, for instance) n Pre-whiten the data (linearly transform to have unit covariance matrix), estimate the density, and then transform back [Fukunaga] g The whitening transform is simply y=Λ-1/2 MT x, where Λ and M are the eigenvalue and eigenvector matrices of the sample covariance of x g Fukunaga’s method is equivalent to using a hyper-ellipsoidal kernel ∑ =         − = N 1 n (n D KDE h x x K Nh 1 (x) p
  • 18. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 18 Product kernels g A very popular method for performing multivariate density estimation is the product kernel, defined as n The product kernel consists of the product of one-dimensional kernels g Typically the same kernel function is used in each dimension ( Kd(x)=K(x) ), and only the bandwidths are allowed to differ n Bandwidth selection can then be performed with any of the methods presented for univariate density estimation g It is important to notice that although the expression of K(x,x(n,h1,…hD) uses kernel independence, this does not imply that any type of feature independence is being assumed n A density estimation method that assumed feature independence would have the following expression n Notice how the order of the summation and product are reversed compared to the product kernel ( ) ( ) ( ) ∏ ∑ = =         − ⋅ ⋅ ⋅ = = D 1 d d (n d D 1 D 1 (n N 1 i D 1 (n PKDE h (d) x x(d) K h h 1 h ,..., h , x x, K where h ,..., h , x x, K N 1 x p ( ) ∏ ∑ = = −                 − = D 1 d N 1 i d (n d d IND FEAT h (d) x x(d) K Nh 1 x p
  • 19. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 19 Product kernel, example 1 -2 -1 0 1 2 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 x1 x 2 -2 -1 0 1 2 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 x1 x 2 -2 -1 0 1 2 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 x1 x 2 g This example shows the product kernel density estimate of a bivariate unimodal Gaussian distribution n 100 data points were drawn from the distribution n The figures show the true density (left) and the estimates using h=1.06σN-1/5 (middle) and h=0.9AN-1/5 (right)
  • 20. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 20 Product kernel, example 2 -2 0 2 4 6 -4 -2 0 2 4 6 8 x1 x 2 -2 0 2 4 6 -4 -2 0 2 4 6 8 x1 x 2 -2 0 2 4 6 -4 -2 0 2 4 6 8 x1 x 2 g This example shows the product kernel density estimate of a bivariate bimodal Gaussian distribution n 100 data points were drawn from the distribution n The figures show the true density (left) and the estimates using h=1.06σN-1/5 (middle) and h=0.9AN-1/5 (right)
  • 21. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 21 Naïve Bayes classifier g Recall that the Bayes classifier is given by the following family of discriminant functions g Using Bayes rule, these discriminant functions can be expressed as n where P(ωi) is our prior knowledge and P(x|ωi) is obtained through density estimation g Although we have presented density estimation methods that allow us to estimate the multivariate likelihood P(x|ωi), the curse of dimensionality makes it a very tough problem! g One highly practical simplification of the Bayes classifier is the so-called Naïve Bayes classifier n The Naïve Bayes classifier makes the assumption that the features are class-conditionally independent g It is important to notice that this assumption is not as rigid as assuming independent features n Merging this expression into the discriminant function yields the decision rule for the Naïve Bayes classifier g The main advantage of the Naïve Bayes classifier is that we only need to compute the univariate densities P(x(d)|ωi), which is a much easier problem than estimating the multivariate density P(x|ωi) n Despite its simplicity, the Naïve Bayes has been shown to have comparable performance to artificial neural networks and decision tree learning in some domains x) | P(ω (x) g where i j ) x ( g ) x ( g if ω choose i i j i i = ≠ ∀ > ) )P(ω ω | P(x x) | P(ω (x) g i i i i ∝ = ∏ = = D 1 d i i ) ω | P(x(d) ) ω | P(x Naïve Bayes Classifier Naïve Bayes Classifier ( ) ∏ = = D 1 d i i NB , i ω | ) d ( x P ) P(ω ) x ( g ∏ = = D 1 d P(x(d)) P(x)