Kernel estimation(ref)

Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
1
LECTURE 7: Kernel Density Estimation
g Non-parametric Density Estimation
g Histograms
g Parzen Windows
g Smooth Kernels
g Product Kernel Density Estimation
g The Naïve Bayes Classifier

2
Non-parametric density estimation
g In the previous two lectures we have assumed that either
n The likelihoods p(x|ωi) were known (Likelihood Ratio Test) or
n At least the parametric form of the likelihoods were known (Parameter
Estimation)
g The methods that will be presented in the next two lectures do not
afford such luxuries
n Instead, they attempt to estimate the density directly from the data without
making any parametric assumptions about the underlying distribution
n Sounds challenging? You bet!
x1
x
2
P(x1, x2| ωi)
NON-PARAMETRIC
DENSITY ESTIMATION
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
1
1
1
1
1
1
1
1
1
1
1
1
1
11
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
3
33
3
3
33
3
3
3
3
3
4
4
4
4
4
4
4
4
4
44
4
4
4
4
4
4
4
5
5
5
5
5
5
5
55
5
5
55
5
5
5
6
6
6
6
6 6
6
6
6
6
6
6
6
77
77
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
88
8
8
8
8
8
8
8
8
8
9
9
9
99
9
9
9
9
9
99
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
1*
1*
1*
1*
2*
2*
2*
2*
2*
2*
3*
3*
3*
3*
3*
3*
3*
4*
5*
5*
5*
5*
5*
6*
6*
6*
6*
6*
6*
6*
7*
7*
7*
7*
8*
8*
8*
9*
9*
9*
9*
9*
10*
10*
10*
10*
10*
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
1
1
1
1
1
1
1
1
1
1
1
1
1
11
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
3
33
3
3
33
3
3
3
3
3
4
4
4
4
4
4
4
4
4
44
4
4
4
4
4
4
4
5
5
5
5
5
5
5
55
5
5
55
5
5
5
6
6
6
6
6 6
6
6
6
6
6
6
6
77
77
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
88
8
8
8
8
8
8
8
8
8
9
9
9
99
9
9
9
9
9
99
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
1*
1*
1*
1*
2*
2*
2*
2*
2*
2*
3*
3*
3*
3*
3*
3*
3*
4*
5*
5*
5*
5*
5*
6*
6*
6*
6*
6*
6*
6*
7*
7*
7*
7*
8*
8*
8*
9*
9*
9*
9*
9*
10*
10*
10*
10*
10*
x1
x2

3
The histogram
g The simplest form of non-parametric D.E. is the familiar histogram
n Divide the sample space into a number of bins and approximate the density at the center of
each bin by the fraction of points in the training data that fall into the corresponding bin
g The histogram requires two “parameters” to be defined: bin width and starting position of the first bin
g The histogram is a very simple form of D.E., but it has several drawbacks
n The final shape of the density estimate depends on the starting position of the bins
g For multivariate data, the final shape of the density is also affected by the orientation of the bins
n The discontinuities of the estimate are not due to the underlying density, they are only an
artifact of the chosen bin locations
g These discontinuities make it very difficult, without experience, to grasp the structure of the data
n A much more serious problem is the curse of dimensionality, since the number of bins grows
exponentially with the number of dimensions
g In high dimensions we would require a very large number of examples or else most of the bins would
be empty
g All these drawbacks make the histogram unsuitable for most practical
applications except for rapid visualization of results in one or two dimensions
n Therefore, we will not spend more time looking at the histogram
[ ]
[ ]
x
containing
bin
of
width
x
as
bin
same
in
x
of
number
N
1
(x)
P
(k
H =

4
Non-parametric density estimation, general formulation (1)
g Let us return to the basic definition of probability to get a solid idea of
what we are trying to accomplish
n The probability that a vector x, drawn from a distribution p(x), will fall in a given
region ℜ of the sample space is
n Suppose now that N vectors {x(1, x(2, …, x(N} are drawn from the distribution. The
probability that k of these N vectors fall in ℜ is given by the binomial distribution
n It can be shown (from the properties of the binomial p.m.f.) that the mean and
variance of the ratio k/N are
n Therefore, as N→∞, the distribution becomes sharper (the variance gets
smaller) so we can expect that a good estimate of the probability P can be
obtained from the mean fraction of the points that fall within ℜ
∫
ℜ
= )dx'
p(x'
P
( ) k
N
k
P)
(1
P
k
N
k
P −
−








=
( )
N
P
1
P
P
N
k
E
N
k
Var
and
P
N
k
E
2
−
=














−
=






=






N
k
P ≅
From [Bishop, 1995]

5
n On the other hand, if we assume that ℜ is so small that p(x) does not vary
appreciably within it, then
g where V is the volume enclosed by region ℜ
n Merging with the previous result we obtain
n This estimate becomes more accurate as we increase the number of sample
points N and shrink the volume V
g In practice the value of N (the total number of examples) is fixed
n In order to improve the accuracy of the estimate p(x) we could let V approach
zero but then the region ℜ would then become so small that it would enclose no
examples
n This means that, in practice, we will have to find a compromise value for the
volume V
g Large enough to include enough examples within ℜ
g Small enough to support the assumption that p(x) is constant within ℜ
NV
k
p(x)
N
k
P
p(x)V
)dx'
p(x'
P
≅
⇒





≅
≅
= ∫
ℜ
p(x)V
)dx'
p(x' ≅
∫
ℜ
From [Bishop, 1995]

6
g In conclusion, the general expression for non-parametric density
estimation becomes
g When applying this result to practical density estimation problems,
two basic approaches can be adopted
n We can choose a fixed value of the volume V and determine k from the data.
This leads to methods commonly referred to as Kernel Density Estimation
(KDE), which are the subject of this lecture
n We can choose a fixed value of k and determine the corresponding volume V
from the data. This gives rise to the k Nearest Neighbor (kNN) approach, which
will be covered in the next lecture
g It can be shown that both kNN and KDE converge to the true
probability density as N→∞, provided that V shrinks with N, and k
grows with N appropriately





≅
V
inside
examples
of
number
the
is
k
examples
of
number
total
the
is
N
x
g
surroundin
volume
the
is
V
where
NV
k
p(x)
From [Bishop, 1995]

7
Parzen windows (1)
g Suppose that the region ℜ that encloses the
k examples is a hypercube with sides of
length h centered at the estimation point x
n Then its volume is given by V=hD, where D is the
number of dimensions
g To find the number of examples that fall
within this region we define a kernel
function K(u)
n This kernel, which corresponds to a unit
hypercube centered at the origin, is known as a
Parzen window or the naïve estimator
n The quantity K((x-x(n)/h) is then equal to unity if
the point x(n is inside a hypercube of side h
centered on x, and zero otherwise
( )


 =
∀
<
=
otherwise
0
D
1,..,
j
1/2
u
1
u
K j
x
h
h
h
From [Bishop, 1995]

8
Parzen windows (2)
g The total number of points inside the
hypercube is then
g Substituting back into the expression
for the density estimate
n Notice that the Parzen window density
estimate resembles the histogram, with
the exception that the bin locations are
determined by the data points
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p
∑
=







 −
=
N
1
n
n
(
h
x
x
K
k
Volume
1 / V
K(x-x(1)=1
K(x-x(2)=1
K(x-x(3)=1
K(x-x(4)=0
x(1
x(2
x(3
x(4
x
x(1
x(2
x(3
x(4
Volume
1 / V
K(x-x(1)=1
K(x-x(1)=1
K(x-x(2)=1
K(x-x(2)=1
K(x-x(3)=1
K(x-x(3)=1
K(x-x(4)=0
K(x-x(4)=0
x(1
x(2
x(3
x(4
x
x(1
x(2
x(3
x(4
From [Bishop, 1995]

9
Parzen windows (3)
g To understand the role of the kernel function we compute the
expectation of the probability estimate p(x)
n where we have assumed that the vectors x(n are drawn independently from the
true density p(x)
g We can see that the expectation of the estimated density pKDE(x) is a
convolution of the true density p(x) with the kernel function
n The width h of the kernel plays the role of a smoothing parameter: the wider the
kernel function, the smoother the estimate pKDE(x)
g For h→0, the kernel approaches a Dirac delta function and pKDE(x)
approaches the true density
n However, in practice we have a finite number of points, so h cannot be made
arbitrarily small, since the density estimate pKDE(x) would then degenerate to a set
of impulses located at the training data points
( )
[ ]
∫
∑





 −
=













 −
=
=













 −
=
=
)dx'
p(x'
h
x'
x
K
h
1
h
x
x
K
E
h
1
h
x
x
K
E
Nh
1
x
p
E
D
(n
D
N
1
n
(n
D
KDE
From [Bishop, 1995]

10
Numeric exercise
g Given the dataset below, use Parzen windows to estimate the density
p(x) at y=3,10,15. Use a bandwidth of h=4
n X = {x(1, x(2,…x(N} = {4, 5, 5, 6, 12, 14, 15, 15, 16, 17}
g Solution
n Let’s first draw the dataset to get an idea of what numerical results we should
expect
n Let’s now estimate p(y=3):
n Similarly
[ ] 0.025
4
10
1
0
0
0
0
0
0
0
0
0
1
4
10
1
4
17
3
K
...
4
6
3
K
4
5
3
K
4
5
3
K
4
4
3
K
4
10
1
h
x
y
K
Nh
1
3)
(y
p
1
13/4
-
1
-
1/2
-
1/2
-
1/4
-
1
N
1
n
(n
D
KDE
=
×
=
+
+
+
+
+
+
+
+
+
×
=
=















 −
+
+





 −
+





 −
+





 −
+





 −
×
=







 −
=
= ∑
=
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
5 10 15 x
p(x)
y=3 y=10
y=15
[ ] 0
4
10
0
0
0
0
0
0
0
0
0
0
0
4
10
1
10)
(y
p 1
KDE =
×
=
+
+
+
+
+
+
+
+
+
×
=
=
[ ] 0.1
4
10
4
0
1
1
1
1
0
0
0
0
0
4
10
1
15)
(y
p 1
KDE =
×
=
+
+
+
+
+
+
+
+
+
×
=
=

11
Smooth kernels (1)
g The Parzen window has several drawbacks
n Yields density estimates that have discontinuities
n Weights equally all the points xi, regardless of their distance to the estimation
point x
g It is easy to to overcome some of these difficulties by generalizing the
Parzen window with a smooth kernel function K(u) which satisfies the
condition
n Usually, but not not always, K(u) will be a radially symmetric and unimodal
probability density function, such as the multivariate Gaussian density function
n where the expression of the density estimate remains the same as with Parzen
windows
( ) 1
dx
x
K
D
R
=
∫
( )
( )






−
= x
x
2
1
exp
π
2
1
x
K T
2
/
D
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p
From [Bishop, 1995]

12
Smooth kernels (2)
g Just as the Parzen window estimate can be considered a sum of boxes
centered at the observations, the smooth kernel estimate is a sum of
“bumps” placed at the data points
n The kernel function determines the shape of the bumps
n The parameter h, also called the smoothing parameter or bandwidth,
determines their width
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
x
P
K
DE(x);
h=3
Density
estimate
Data
points
Kernel
functions
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
x
P
K
DE(x);
h=3
Density
estimate
Data
points
Kernel
functions

13
Choosing the bandwidth: univariate case (1)
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
x
P
KDE
(x);
h=5.0
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
x
P
KDE
(x);
h=10.0
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
x
P
KDE
(x);
h=2.5
-10 -5 0 5 10 15 20 25 30 35 40
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
x
P
KDE
(x);
h=1.0
g The problem of choosing the bandwidth is crucial in density estimation
n A large bandwidth will over-smooth the density and mask the structure in the data
n A small bandwidth will yield a density estimate that is spiky and very hard to interpret

14
Choosing the bandwidth: univariate case (2)
g We would like to find a value of the smoothing parameter that minimizes the error between
the estimated density and the true density
n A natural measure is the mean square error at the estimation point x, defined by
g This expression is an example of the bias-variance tradeoff that we saw earlier in the
course: the bias can be reduced at the expense of the variance, and vice versa
n The bias of an estimate is the systematic error incurred in the estimation
n The variance of an estimate is the random error incurred in the estimation
g The bias-variance dilemma applied to bandwidth selection simply means that
n A large bandwidth will reduce the differences among the estimates of pKDE(x) for different data sets (the
variance) but it will increase the bias of pKDE(x) with respect to the true density p(x)
n A small bandwidth will reduce the bias of pKDE(x), at the expense of a larger variance in the estimates pKDE(x)
( ) ( ) ( )
( )
[ ] ( ) ( )
[ ]
{ } ( )
( )
4
43
4
42
1
4
4
4 3
4
4
4 2
1
variance
KDE
2
bias
KDE
2
KDE
KDE
x x
p
var
x
p
x
p
E
x
p
x
p
E
p
MSE +
−
=
−
=
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x
P
KDE
(x);
h=0.1
VARIANCE
-3 -2 -1 0 1 2 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
P
KDE
(x);
h=2.0
-3 -2 -1 0 1 2 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
P
KDE
(x);
h=2.0
True density
Multiple kernel
density estimates
BIAS

15
Bandwidth selection methods, univariate case (3)
g Subjective choice
n The natural way for choosing the smoothing parameter is to plot out several curves and
choose the estimate that is most in accordance with one’s prior (subjective) ideas
n However, this method is not practical in pattern recognition since we typically have high-
dimensional data
g Reference to a standard distribution
n Assume a standard density function and find the value of the bandwidth that minimizes the
integral of the square error (MISE)
n If we assume that the true distribution is a Gaussian density and we use a Gaussian kernel,
it can be shown that the optimal value of the bandwidth becomes
g where σ is the sample variance and N is the number of training examples
( )
( )
{ } ( ) ( )
( )
[ ]
{ }
∫ −
=
= dx
x
p
x
p
E
argmin
x
p
MISE
argmin
h
2
KDE
h
KDE
h
opt
5
/
1
opt N
σ
06
.
1
h −
=
From [Silverman, 1986]

16
Bandwidth selection methods, univariate case (4)
n Better results can be obtained if we use a robust measure of the spread instead of the
sample variance and we reduce the coefficient 1.06 to better cope with multimodal densities.
The optimal bandwidth then becomes
g IQR is the interquartile range, a robust estimate of the spread. It is computed as one half the
difference between the 75th
percentile (Q3) and the 25th
percentile (Q1). The formula for semi-
interquartile range is therefore: (Q3-Q1)/2
n A percentile rank is the proportion of examples in a distribution that a specific example is greater than or equal to
g Likelihood cross-validation
n The ML estimate of h is degenerate since it yields hML=0, a density estimate with Dirac delta
functions at each training data point
n A practical alternative is to maximize the “pseudo-likelihood” computed using leave-one-out
cross-validation






=
= −
34
.
1
IQR
,
σ
min
A
where
AN
9
.
0
h 5
/
1
opt
From [Silverman, 1986]
( )
( ) ( ) ∑
∑
≠
=
−
=
−







 −
−
=






=
N
m
n
1,
m
(m
(n
(n
n
N
1
n
(n
n
h
MLCV
h
x
x
K
h
1
N
1
x
p
where
x
p
log
N
1
argmax
h

17
Multivariate density estimation
g The derived expression of the estimate PKDE(x) for multiple dimensions was
n Notice that the bandwidth h is the same for all the axes, so this density estimate will be
weight all the axis equally
g However, if the spread of the data is much greater in one of the coordinate
directions than the others, we should use a vector of smoothing parameters or
even a full covariance matrix, which complicates the procedure
g There are two basic alternatives to solve the scaling problem without having to
use a more general kernel density estimate
n Pre-scale each axis (normalize to unit variance, for instance)
n Pre-whiten the data (linearly transform to have unit covariance matrix), estimate the
density, and then transform back [Fukunaga]
g The whitening transform is simply y=Λ-1/2
MT
x,
where Λ and M are the eigenvalue and eigenvector
matrices of the sample covariance of x
g Fukunaga’s method is equivalent to using a
hyper-ellipsoidal kernel
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p

18
Product kernels
g A very popular method for performing multivariate density estimation is the
product kernel, defined as
n The product kernel consists of the product of one-dimensional kernels
g Typically the same kernel function is used in each dimension ( Kd(x)=K(x) ), and
only the bandwidths are allowed to differ
n Bandwidth selection can then be performed with any of the methods presented for univariate
density estimation
g It is important to notice that although the expression of K(x,x(n,h1,…hD) uses
kernel independence, this does not imply that any type of feature independence
is being assumed
n A density estimation method that assumed feature independence would have the following
expression
n Notice how the order of the summation and product are reversed compared to the product
kernel
( ) ( )
( ) ∏
∑
=
=







 −
⋅
⋅
⋅
=
=
D
1
d d
(n
d
D
1
D
1
(n
N
1
i
D
1
(n
PKDE
h
(d)
x
x(d)
K
h
h
1
h
,...,
h
,
x
x,
K
where
h
,...,
h
,
x
x,
K
N
1
x
p
( ) ∏ ∑
= =
− 














 −
=
D
1
d
N
1
i d
(n
d
d
IND
FEAT
h
(d)
x
x(d)
K
Nh
1
x
p

19
Product kernel, example 1
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
g This example shows the product kernel density estimate of a bivariate unimodal Gaussian
distribution
n 100 data points were drawn from the distribution
n The figures show the true density (left) and the estimates using h=1.06σN-1/5
(middle) and h=0.9AN-1/5
(right)

20
Product kernel, example 2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
g This example shows the product kernel density estimate of a bivariate bimodal Gaussian
distribution
n 100 data points were drawn from the distribution
n The figures show the true density (left) and the estimates using h=1.06σN-1/5
(middle) and h=0.9AN-1/5
(right)

21
Naïve Bayes classifier
g Recall that the Bayes classifier is given by the following family of discriminant functions
g Using Bayes rule, these discriminant functions can be expressed as
n where P(ωi) is our prior knowledge and P(x|ωi) is obtained through density estimation
g Although we have presented density estimation methods that allow us to estimate the
multivariate likelihood P(x|ωi), the curse of dimensionality makes it a very tough problem!
g One highly practical simplification of the Bayes classifier is the so-called Naïve Bayes
classifier
n The Naïve Bayes classifier makes the assumption that the features are class-conditionally independent
g It is important to notice that this assumption is not as rigid as assuming independent features
n Merging this expression into the discriminant function yields the decision rule for the Naïve Bayes classifier
g The main advantage of the Naïve Bayes classifier is that we only need to compute the
univariate densities P(x(d)|ωi), which is a much easier problem than estimating the
multivariate density P(x|ωi)
n Despite its simplicity, the Naïve Bayes has been shown to have comparable performance to artificial neural
networks and decision tree learning in some domains
x)
|
P(ω
(x)
g
where
i
j
)
x
(
g
)
x
(
g
if
ω
choose
i
i
j
i
i
=
≠
∀
>
)
)P(ω
ω
|
P(x
x)
|
P(ω
(x)
g i
i
i
i ∝
=
∏
=
=
D
1
d
i
i )
ω
|
P(x(d)
)
ω
|
P(x
Naïve Bayes
Classifier
Naïve Bayes
Classifier
( )
∏
=
=
D
1
d
i
i
NB
,
i ω
|
)
d
(
x
P
)
P(ω
)
x
(
g
∏
=
=
D
1
d
P(x(d))
P(x)

Kernel estimation(ref)

More Related Content

What's hot (20)

Similar to Kernel estimation(ref) (20)

More from Zahra Amini (11)

Recently uploaded (20)

Kernel estimation(ref)