2008 spie gmm

Dynamic mixing kernels in Gaussian Mixture Classifier for
Hyperspectral Classification
Vikram Jayaram & Bryan Usevitch
Dept. of Electrical & Computer Engineering

The University of Texas at El Paso
500 W. University Ave., El Paso, TX 79968-0523
ABSTRACT
In this paper, new Gaussian mixture classifiers are designed to deal with the case of an unknown number of mixing
kernels. Not knowing the true number of mixing components is a major learning problem for a mixture classifier
using expectation-maximization (EM). To overcome this problem, the training algorithm uses a combination of
covariance constraints, dynamic pruning, splitting and merging of mixture kernels of the Gaussian mixture to
correctly automate the learning process. This structural learning of Gaussian mixtures is employed to model
and classify Hyperspectral imagery (HSI) data. The results from the HSI experiments suggested that this new
methodology is a potential alternative to the traditional mixture based modeling and classification using general
EM.
Keywords: Hyperspectral imagery (HSI), Gaussian mixture model (GMM), EM, Classification, Kurtosis, PCA.

1. INTRODUCTION
The complexity involved in collecting, storing, analysis and processing of voluminous and multi-dimensional
remote sensing data for an array of “Earth System Science” applications is a well known problem to the remote
sensing community. The recent HSI technology have evolved from its earlier version of multispectral imaging
(MSI).1, 2 In HSI, images are acquired using hundreds of spectral channels when compared to fewer channels
in MSI. However, over the years there has not been much significant development of processing algorithms for
these ever growing (in the spectral direction) electro-optical (EO) data sets. The need to come up with reduced
dimensionality processing algorithms3 is even stronger due to the increased spectral dimensionality of the remote
sensing data. In most remote sensing EO imagery, each spatial pixel is treated as a column vector that contains
spectral information from each channel. On several occasions mixture model based approach have been justified
for processing voluminous data. In this paper, we show an instance of training a Gaussian mixture classifier
using dynamic kernel carpentry to model voluminous data such as the HSI.
Gaussian mixture model (GMM) is a standard modeling technique for estimating unknown probability density
functions (PDF). Even though the merit of GMM lie in fairly approximating most naturally occurring random
processes,4, 5 there lies a learning disability while using EM to estimate its model parameters. In order to
ensure proper learning by the EM, dynamic allocation of Gaussian kernels are used to fit the HSI data. This
model estimates an unknown PDF based on the assumption that the unknown density can be expressed as a
weighted finite sum of Gaussian kernels. These Gaussian kernels have different mixing weights and parametersmeans and covariance matrices. Updating the mixture parameters is carried out by the EM algorithm while
also monitoring the total kurtosis which serves the requirement of kernel splitting (increase in the number of
kernels). Therefore, this technique not only ensures likelihood maximization but also kurtosis minimization.
The kernel splitting comes to a halt when there is no further improvement in the minimization of kurtosis seem
possible. Similarly, the other steps of this training methodology - pruning (destroying the weak kernels), merging
of kernels and determining if the algorithm converged is carried out in a step-by-step fashion. The results of this
Further author information: (Send correspondence to Vikram Jayaram)
V. Jayaram: E-mail: jayaram@ece.utep.edu, Telephone: 1 915 747 5869

Mathematics of Data/Image Pattern Recognition, Compression, and Encryption with Applications XI, edited by
Mark S. Schmalz, Gerhard X. Ritter, Junior Barrera, Jaakko T. Astola, Proc. of SPIE Vol. 7075, 70750L, (2008)
0277-786X/08/$18 · doi: 10.1117/12.798443
Proc. of SPIE Vol. 7075 70750L-1
Downloaded From: http://spiedigitallibrary.org/ on 09/04/2013 Terms of Use: http://spiedl.org/terms

training are indicated by means of the receiver operating characteristics (ROC) curves. This structural learning6
based training technique uses relatively fewer kernels for to estimate the model parameters of GMM with a fast
converging property.

2. GAUSSIAN MIXTURE MODELS AND EM ALGORITHM
Multidimensional data such as the HSI can be modeled by a multidimensional Gaussian mixture (GM). Normally,
GM in the form of the PDF for z ∈ RP is given by
L

αi N (z, µi , Σi )

p(z) =
i=1

where
N (z, µi , Σi ) =

1

1

(2π)P/2 |Σi |1/2

−1

e{− 2 (z−µi ) Σi

(z−µi )}

.

Here L is the number of mixture components and P the number of spectral channels (bands). The GM parameters
are denoted by λ = {αi , µi , Σi }, where αi , µi , Σi are the mixing weight, mean and covariances of the individual
components for the mixture model. The parameters of the GM are estimated using maximum likelihood (ML)
by means of EM algorithm.7
Let sample vectors Z = {z1 , z2 , · · ·, zT } & λ, the likelihood (ML) of the GMM is given by:
T

p(zt |λ).

p(Z|λ) =

(1)

t=1

ˆ
ˆ
Next the ML estimation ﬁnds a new parameter model λ such that p(Z|λ) ≥ p(Z|λ). Due to the nonlinearity
behavior of the likelihood in λ given in 1, a straight forward maximization of the function is not viable. The
maximization takes place on an iterative basis using EM.7 In EM algorithm, we use the auxiliary function Q
given by
T

L

ˆ
Q(λ, λ) =

p(i|zt , λ) log[αi N (zt , µi , Σi )],
ˆ
ˆ ˆ

(2)

t=1 i=1

where p(i|zt , λ) is the a posteriori probability for each mixture component of image class i, where i = 1, · · ·, L
and satisﬁes
αi N (zt , µi , Σi )
·
(3)
p(i|zt , λ) = L
Σk=1 αk N (zt , µk , Σk )
ˆ
ˆ
The EM algorithm is such that if Q(λ, λ) ≥ Q(λ, λ) then p(Z|λ) ≥ p(Z|λ).8 After setting up derivatives of the
ˆ to zero, the re-estimation formulas are as follows8
Q function with respect to λ
αi =
ˆ

1 T
Σ p(i|zt , λ),
T t=1

(4)

µi =
ˆ

ΣT p(i|zt , λ)zt
t=1
,
ΣT p(i|zt , λ)
t=1

(5)

ΣT p(i|zt , λ)(zt − µi )(zt − µi )
ˆ
Σi = t=1
·
ΣT p(i|zt , λ)
t=1

(6)

The algorithm for training GMM is summarized as follows:
• Generate the a posteriori probability ΣT p(i|zt , λ) based on proposed method (explained further) satisfying
t=1
(4).
• Compute the mixture weight, mean vector and covariance matrix by means of (4), (5) & (6).


• Update the a posteriori probability ΣT p(i|zt , λ) according to (3) followed by computation of the Q
t=1
function using (2).
• Stop if the increase in value of Q function at the current iteration relative to the value of Q function at
the previous iteration is less than a chosen threshold, otherwise go to item 2.

3. TRAINING THE MIXTURE MODEL
In spite of robust design of GMM, there are challenges trying to train a GM with a local algorithm like EM.
First of all, the true number of mixture components is usually unknown. Not knowing the true number of mixing
components is a major learning problem for a mixture classifier using EM.5, 9
The solution to this problem is a dynamic algorithm for Gaussian mixture density estimation that could
dynamically add-remove kernel components to adequately model the input data. This methodology also increases
the chances to escape getting stuck in one of the many local maxima of the likelihood function.10 In a method
proposed by N. Vlassis and A. Likas11 called the greedy EM algorithm, GM training begins with a single
component mixture. Components or modes are then added in a sequential manner until the likelihood stops
increasing or the incrementally computed mixture is almost as good as any mixture in that form. This incremental
mixture density function uses a combination of global11 and local search11 techniques each time a new kernel
component is added to the mixture.
In case, the number of mixture components become high, they are pruned out depending upon the value of
mixing weight αi . This procedure ensures removal of weak modes from the mixture. A weak mode is identified
by checking αi with respect to certain threshold. Once identified, the weak modes are obliterated. A further
re-normalizing of αi takes place, such that i αi = 1.
Merging of kernel components is another process in this training scheme, wherein, a single mode is created
from two identical ones. The similarity measure between the mixture modes is given by a metric d. For example,
consider two PDF’s p1 (x) and p2 (x). Let there be collection of points near the central peak of p1 (x) represented
by xi ∈ X1 and another set of points near the central peak of p2 (x) denoted by xi ∈ X2 . In which case closeness
metric d is given by
xi ∈X1 p2 (xi )
xi ∈X2 p1 (xi )
(7)
d = log
xi ∈X1 p1 (xi )
xi ∈X2 p2 (xi )
Notice that d = 0 when p1 (x) = p2 (x) and d < 0 for p1 (x) = p2 (x). A pre-determined threshold is set to
determine if the modes are too close. If the two modes are found below a certain threshold, they will be merged
forming weighted sum of two modes. The mode for this newly merged kernel components will be computed as10
µ=

α1 µ1 + α2 µ2
.
α1 + α2

A similar weighted analogy cannot be applied while merging covariances as it does not take in to account
the separation of means. Instead of computing Σi directly we consider its Cholesky factors7 that are multiplied
by the respective weights given by α1α1 2 and α1α2 2 to obtain the merged covariance.11 On the other hand
+α
+α
if the number of mixture components are insufficient, then the components are split in order to increase the
total number of components. Vlassis et. al.12 define a method to monitor weighted kurtosis of each mode which
directly determine the number of mixture components. This kurtosis measure is given by
Ki =
where
wn,i =

N
n=1

n
wn,i ( Z√−µi )4
Σ

N
n=1

i

wn,i

−3

N (zn , µi , Σi )
·
ΣN N (zn , µi , Σi )
n=1


300

PCA 2

200
100
0
−100
−200
−1500

−1000

−500

0
PCA 1

500

1000

−500

0
PCA 1

500

1000

300

PCA 2

200
100
0
−100
−200
−1500

−1000

Figure 1. The scene on the left is a 1995 AVIRIS image of Cuprite field in Nevada. Figure on the right is the 2D scatter
plot of first two components after PCA rotation.

If |Ki | is too high for any mode i, then the mode is split into two. This could be modified to higher dimension
by considering skew in addition to the kurtosis, where each data sample zn is projected on to the j-th principal
j
axis of Σi in turn. Let zn,i = (zn − µi ) vij where vij is the j-th column of V, obtained from the SVD of Σi
(this step is necessary in order to condition the covariances). Conditioning of covariances is necessary in order
to prevent covariance matrices from becoming singular. Therefore, for each j
j

Ki,j =

Zn,i 4
N
n=1 wn,i ( si )
N
n=1 wn,i

−3

j

ψi,j =

Zn,i 3
N
n=1 wn,i ( si )
N
n=1 wn,i

mi,j = |Ki,j | + |ψi,j |
where
s2
i

=

N
j
2
n=1 wn,i (zn,i )
.
N
n=1 wn,i

Now, if mi,j > τ (threshold), for any j, split mode i. Further, split the mode by creating mixture components
at µ = µi + vi,j Si,j and µ = µi − vi,j Si,j . Here Si,j is the j-th singular value of Σi . The same covariance Σi is
used for each new mode. As mentioned earlier, the decision to split or not depends upon the mixing weight αi .
The splitting does not take place if the value of αi is too small. Finally, once the number of modes settles out
the likelihood stops increasing and convergence is achieved. With this combination of covariance constraints,
mode pruning, merging and splitting can result in a good PDF approximation of the mixture models.

4. EXPERIMENTS
To demonstrate the robustness of this learning scheme we run the model experiments on high dimensional HSI
data. The remote sensing data sets that we have used in our experiments comes from an Airborne Visible/Infrared
Imaging Spectrometer (AVIRIS) sensor derived imagery. AVIRIS is a unique optical sensor that delivers calibrated images of the upwelling spectral radiance in 224 contiguous spectral channels (also called bands) with
wavelengths from 0.4-2.5 µm. AVIRIS is flown all across the US, Canada and Europe. Figure 1 shows data sets


300

PCA 2

200
100
0
−100
−200
−1500

−1000

−500

0
PCA 1

500

1000

−1000

−500

0
PCA 1

500

1000

300

PCA 2

200
100
0
−100
−200
−1500

Figure 2. 2D scatter plot of the data and the Gaussian mixture model after the convergence achieved by the EM algorithm.

used in our experiments that belong to 1995 Cuprite field scene in Nevada. Since, HSI imagery is highly correlated in the spectral direction using principal component (PCA) rotation is an obvious choice for decorrelation
among the bands.13, 14 The 2D “scatter” plot of the first two principal components of the data as shown in
Figure 1. The scatter plots used in the paper are similar to marginalized PDF on any 2D plane. Marginalization
could be easily depicted for Gaussian mixtures.10 Let z = [z1 , z2 , z3 , z4 ]. For example, to visualize on the (z2 , z4 )
plane, we would need to compute
p(z2 , z4 ) =

z1

z3

p(z1 , z2 , z3 , z4 )dz1 dz3 .

This utility is very useful when visualizing high-dimensional PDF. With the given HSI data, the next step is to
train the mixture model. Training consists of five operations as mentioned earlier- begin EM algorithm, pruning
and merging the components, spitting the components if necessary and finally determining if the algorithm has
converged based on likelihood estimates of the parameters by the end of each iteration. Some of the aspects that
are critical for training are- covariance constraints, minimum individual component weights used in pruning,
threshold to determine if the two components should be merged or split and the criterion for determining if the
convergence has taken place. Figure 3 shows one dimensional PDF plots of the two PCA components. Notice
the marginal PDF’s being compared to the histograms.
During the process of training the log likelihood would monotonically increase, if not for pruning, splitting,
and merging operations. Figure 2 shows the Gaussian mixture approximation after convergence. Approximately
ten components were derived by the EM to characterize the GMM. The GM model parameters obtained as a
result of the structural learning will now be used to build a classifier. Figure 4 shows a synthetically simulated
second class of data added to the already existing input data. We will now build a classifier using Gaussian
mixtures by training a second parameter set on the newly added class also using the similar scheme of learning.
Figure 5 shows the result after the model converges to obtain the parameters for the second class. This is
followed by computing the log-likelihood of the test data. The performance of the classifier is evaluated using
ROC as shown in Figure 6. The response of the ROC curve clearly supports the robustness of the proposed
classification-learning scheme.


−3

3

x 10

PCA 1

2

1

0
−1500

−1000

−500

0

500

1000

1500

0.01

PCA 2

0.008
0.006
0.004
0.002
0
−200

−150

−100

−50

0

50

100

150

200

250

Figure 3. One dimensional PDF plots. The marginal PDF’s are compared to the histograms.

250
200
150

FEATURE B

100
50
0
−50
−100
−150
−200
−1000

−800

−600

−400

−200

0

200

400

600

FEATURE A

Figure 4. Addition of second (yellow) class to the original data.


800

1000

150

PCA 2

100
50
0
−50
−100
−300

−250

−200

−150

−100

−150

−50

−100

PCA 1

PCA 2

100

0

−100
−300

−250

−200

−50

PCA 1

Figure 5. Trained GM approximation of the second class.

1
0.95
0.9
0.85

Pd

0.8
0.75
0.7
0.65
0.6
0.55
0

0.005

0.01

0.015
Pfa

0.02

Figure 6. ROC curve for the two-class problem.


0.025

0.03

5. CONCLUSIONS AND FUTURE WORKS
In this paper, we proposed the use of Gaussian mixture models that utilize structural learning scheme for
modeling and classification of Hyperspectral imagery. Traditional learning technique employing general EM
has serious drawbacks such as no generally accepted method for parameter initialization, how many mixture
components to be employed to adequately model the input data and the chances of the model being stuck in
multiple local maxima’s of the likelihood function. These drawbacks have been well addressed by the proposed
structural learning scheme. The ROC curve in our experiments is used as a general diagnosis tool to evaluate
classification. Clearly, ROC depicted high probability of detection for low false alarm rates. The GMM in
conjunction with structural learning is well equipped to model and classify HSI data. These models provide
sufficient adjustment to several distributions with lower variances. This trait of GMM is particular appreciated
in image classification applications, since it reduces misclassification. As future work, we intend to explore and
equip current state-of-the-art statistical classifiers with better training schemes for practical HSI applications.

ACKNOWLEDGMENTS
This work was supported by NASA Earth System Science Fellowship grant. The authors would also like to thank
department of Geological Sciences at UTEP for providing access to ENVI software.

REFERENCES
[1] Landgrebe, D. A., [Signal Theory Methods in Multispectral Remote Sensing], Wiley Inter-Science, Hoboken,
NJ, second ed. (2003).
[2] Schowengerdt, R. A., [Remote Sensing Models & Methods for Image Processing], Academic Press, Burlington, MA, seventh ed. (1997).
[3] Keshava, N., “Distance metrics & band selection in hyperspectral processing with applications to material
identification & spectral libraries,” IEEE Transactions on Geoscience and Remote Sensing 42, 1552–1565
(2004).
[4] Redner, R. and Walker, H., “Mixture densities, maximum likelihood and the EM algorithm,” SIAM Review 26, 195–239 (1984).
[5] Duda, R. O., Hart, P. E., and Stork, D. G., [Pattern Classification ], John-Wiley and Sons, New York, NY,
seventh ed. (2001).
[6] Baggenstoss, P. M., “Structural learning for classification of high dimensional data,” Proc. of International
Conference on Intelligent Systems and Semantics NIST, 124–129 (1997).
[7] Moon, T. and Stirling, W., [Mathematical Methods and Algorithms for Signal Processing], Prentice Hall,
Upper Saddle River, NJ (2000).
[8] Rabiner, L., “A tutorial on hidden M arkov models and selected application in speech recognition,” Proc. of
IEEE 77, 257–286 (1989).
[9] McLachlan, G. and Peel, D., [Finite Mixture Models], Wiley Series in Probability and Statistics, New York,
NY, second ed. (2000).
[10] Hastie, T., Tibshirani, R., and Friedman, J., [The Elements of Statistical Learning], Springer-Verlag, New
York, NY (1994).
[11] Vlassis, N. and Likas, A., “A greedy EM for gaussian mixture learning,” Neural Processing Letters 15,
77–87 (2002).
[12] Vlassis, N. and Likas, A., “A kurtosis-based dynamic approach to gaussian mixture modeling,” IEEE
Transactions on Systems, Man and Cybernetics 29, 393–399 (1999).
[13] Jayaram, V., Usevitch, B., and Kosheleva, O., “Detection from H yperspectral images compressed using
rate distortion and optimization techniques under JPEG2000 part 2,” IEEE 11th DSP Workshop and 3rd
SPE Workshop cdrom, 195–239 (2004).
[14] Shaw, G. and Manolakis, D., “Signal processing for H yperspectral image exploitation,” IEEE Signal Processing Magazine 19, 12–16 (2002).


2008 spie gmm

More Related Content

What's hot (19)

Similar to 2008 spie gmm (20)

More from Pioneer Natural Resources (15)

Recently uploaded (20)

2008 spie gmm