PRML Chapter 9

Chapter 9
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics

Chapter 9. Mixture models and EM
2
Use of latent variable & clustering
Consider we have joint distribution of 𝑝(𝑥, 𝑧).
We can get marginal distribution of 𝑝(𝑥) by marginalizing latent variable 𝑧 from the full distribution.
Sometimes, its useful and more convenient to use latent variable 𝒛 here.
In this chapter, we are going to cover some discrete distribution with latent variables.
Keywords are…
1. K-Means Clustering
2. Gaussian Mixture model
3. Expectation Maximization
In fact, we are optimizing parameters of gaussian mixture by using EM algorithm.
Details will be covered soon!
We all are familiar with the idea of clustering! So let’s cover K-means straightly!

Chapter 9.1. K-Means Clustering
3
Theoretical Idea
Procedure of K-Means are so much familiar to us. It was covered in multivariate analysis, data mining, etc…
Here, let’s think K-Means on the perspective of optimization.
Let data points 𝑋𝑁, and k be index of cluster. Each center of cluster can be 𝜇𝑘.
If 𝑋𝑁 belongs to the 𝑘𝑡ℎ
cluster, then 𝑟𝑛𝑘 = 1. O. W. r𝑛𝑗 = 0(𝑗 ≠ 𝑘)
Here, sum of square distance can be
Here, we are trying to minimize overall distance 𝐽. Since function is a quadratic form, straight closed form gives a
global minimum for loss.
In fact, we don’t know 𝑟𝑛𝑘 and 𝜇𝑘 value. Thus, let’s estimate them by minimizing sum of square distance 𝑱.
Here, finding 𝑟𝑛𝑘 is obvious since 𝒌 which yields minimum value of ||𝑿𝒏 − 𝝁𝒌|| is optimal choice for minimizing.
Now, let’s find optimal value for 𝜇𝑘. We can get derivative with respect to 𝜇𝑘.

Chapter 9.1. K-Means Clustering
4
Implemetation
1. It is important to choose the appropriate initial values for 𝜇𝑘.
2. We can also use sequential update by
3. There is a general version of K-Means, the K-medoid algorithm.
4. We straightly assign data to only one specific cluster. This is called a ‘hard
assignment’!
5. Furthermore, we can apply K-Means to the image segmentation task!
** Image segmentation : Process of partitioning a digital image into multiple
segments, to simplify or change image to a simple, meaningful something.

Chapter 9.2. Mixtures of Gaussians
5
Implementation
https://towardsdatascience.com/gaussian-
mixture-models-explained-6986aaf5a95
Consider distribution of multi-mode gaussian!
We have already seen this distribution in chapter 2.
Here, let’s focus on parameter optimization!
Consider distribution of multi-mode gaussian!
We have already seen this distribution in chapter 2.
Here, let’s focus on parameter optimization!
𝒑 𝑿 = 𝒑 𝑿 𝒁 𝒑(𝒁)

6
Implementation
Then how can we use these probabilities to clustering?
We have to assign each values to a specific cluster when we a data is given. Thus, probability becomes
This 𝛾(𝑧𝑘) can be viewed as the responsibility!
Furthermore, we can generate random samples from this distribution.
Detail will be covered in chapter 11!
Note that left figure illustrates how estimated clusters fit the original
ground-truth.

7
Maximum likelihood
Suppose we have a data set of observations {𝑋1, 𝑋2, … , 𝑋𝑁}.
Then, from 𝑝(𝑋), we can define joint probability of dataset. To express product as summation, we can take log!
In this case, we can think of collapse in a data, which is related to a singularity.
Suppose certain data exactly matches the mean of the data. That is 𝑋𝑁 = 𝜇𝑗. Then, probability becomes
When there exist only one data that belongs to that component, 𝜎𝑗 → 0 due to the 𝑋𝑁 = 𝜇𝑗. Then, 𝑵(𝑿𝒏|𝑿𝒏, 𝝈𝒋
𝟐
𝑰) goes infinity.
Here, log likelihood also gets infinity, and we cannot derive appropriate solution!
This does not occur in uni-mode gaussian (common gaussian), because every data belongs to one distribution, and such collapsing issue
is being cancelled out between data.
To overcome this issue, we can use such heuristic methods…
1. Randomly re-setting mean when such issue occurs.
2. Randomly set covariance to be bigger than 0.

8
EM for Gaussian mixtures
We defined parameters and likelihood. Now, we have to find how to optimize such parameters!
Here we use EM(Expectation Maximization) to get desired parameters. General version of EM will be covered soon. Here, let’s take a look how does it
being applied to gaussian mixture.
First, let’s find estimation of 𝜇𝑘 by
𝜕 ln 𝑝 𝑋 𝜋, 𝜇, Σ
𝜕𝜇𝑘
= 0,
Second, let’s find estimation of Σ𝑘 by
𝜕Σ𝑘
= 0,
Last, let’s find estimation of 𝜋𝑘 by
𝜕𝜋𝑘
= 0,

9
Okay! We’ve got solution of gaussian mixture model! Finished! (Is it???)
Not really. Because right-hand side of equation contains parameter itself!
Here, 𝜸(𝒛𝒏𝒌) also contains 𝝁𝒌. It is like
𝜽𝒊 =
𝟏
𝑵
∑𝜽𝒋𝒙𝒋
Here, EM algorithm occurs in an iterative way!
EM consists of expectation step (E-Step)
And maximization step (M-Step)
Consider we are now in 𝑡 − 𝑡𝑖𝑚𝑒. Thus, we have 𝜇𝑘
𝑡
, Σ𝑘
𝑡
, 𝜋𝑘
(𝑡)
.
In E-Step, we are calculating each 𝛾 𝑧𝑛𝑘
(𝑡)
and other distributional values
by plugging 𝜇𝑘
𝑡
, Σ𝑘
𝑡
, 𝜋𝑘
(𝑡)
values in.
In M-Step, we are updating each parameters by aforementioned equations.
For example, 𝜇𝑘
(𝑡+1)
=
1
𝑁𝑘
(𝑡) ∑𝑛 𝛾 𝑧𝑛𝑘
(𝑡)
𝑋𝑛

10
Overall process can be written as…
Initial values of 𝜇𝑘
0
, Σ𝑘
0
, 𝜋𝑘
(0)
can be computed by using K-Means.
Here, usually GMM needs much more iteration than K-Means.
Now, let’s focus more on fundamental notion of EM algorithm!

Chapter 9.3. An Alternative View of EM
11
General equation of likelihood
Here, summation exist inside the log, which makes calculation much harder than when summation
is outside the log.
Now, there are terms called ‘complete dataset’ and ‘incomplete dataset’.
If we observe 𝑿 and 𝒁 together, it is called ‘complete dataset’. That is, we intrinsically know which data belongs to which cluster!
Obviously, this is ridiculous. We only observe 𝑿, which is called a ‘incomplete dataset’
For now, consider we observed both 𝑍 and 𝑋. Then here, we don’t need to compute assign certain data to specific cluster, etc…
Our goal is to estimate general parameter 𝜃. We don’t assume any distribution or shape of 𝜃. It is in a general condition!
Here again we get estimation in an iterative way!
Here, 𝜃 gives effect on latent variable 𝑧. Thus, expectation of ln 𝑝(𝑋, 𝑍|𝜃) can be defined as

12
General equation of likelihood
In fact, we do not observe Z, thus we are using expectation of it.
Here, we can integrate prior and gets MAP by simply changing
expectation of ln 𝑝(𝑋, 𝑍|𝜃) to
𝑄 𝜃, 𝜃𝑜𝑙𝑑
+ ln 𝑝(𝜃)
Now, let’s move on to the example of EM algorithm and applying it to
the general case of various models.
First, take a look at gaussian mixture!
Complete data can be expressed as

13
Gaussian mixtures revisited
Joint pdf of gaussian mixture with complete data can be expressed as
Left shows incomplete, and upper shows complete data.
As you can see, summation and log has been interchanged!
Since summation exist outside the log, estimation of parameter is extremely easy!
Now, let’s think of the expectation of latent variable.
Here, we can express joint pdf in much simpler form by

14
Gaussian mixtures revisited
That is, 𝑍𝑘 only depends on 𝑋𝑘, rest 𝑋1 … 𝑋𝑘−1, 𝑋𝑘+1 … does not give influence.
By using equation 𝑝 𝑍|𝑋 =
𝑝 𝑍 𝑝 𝑋 𝑍
𝑝(𝑋)
and the fact of conditional independence, we can compute expectation of latent variable as
Furthermore, we can compute other parameters iteratively by maximizing above
expectation of joint probability!
Hard assignment K-Means can also be shown in this manner.
Assume Gaussian distribution for each data. Then
If we assume variance term 𝜖 → 0, this 𝛾 𝑧𝑛𝑘 → 1.
Thus, assignment becomes

15
Bernoulli distribution
We can apply latent variable and EM algorithm to Bernoulli distribution either.
This is also called as ‘latent class analysis.’ This gives foundation for a hidden Markov models over discrete variables.
That is, we considered there exist one probability of Bernoulli. That is, x becomes 1 0 0 0 … 0 − 𝐷 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛
Here, probability can be defined as
I think it’s important to capture the general idea of mixture Bernoulli.
Here, we are going to consider latent variable in a Bernoulli!
Only one variable survives, but
it is expressed by linear sum of
values in a specific column!

16
As you all can see in the covariance term, mixture Bernoulli can capture the relation ship between variables!
That is, the correlation between 𝑿𝟏 = (𝑿𝟏𝟏, 𝑿𝟏𝟐, 𝑿𝟏𝟑, … , 𝑿𝟏𝒑 ) can be captured by using this model! Here also we consider complete data.
Previous case Now

17
Here again we define expectation of latent variable (responsibilities) by.
By using it, we can again derive for the parameters of
Parameters can be estimated by
Before moving on, let’s get this straight!
What we did in this section 9.3. is to get the estimation under the condition ‘complete data’.
Here, E-step which finds expectation was to compute

18
EM for Bayesian linear regression
We can apply EM algorithm to the Bayesian linear regression. We use it in evidence approximation, where we find nuisance parameters
We marginalized out parameter 𝒘 to obtain the desired predictive distribution. Thus, we think parameter 𝒘 as latent variable.
Where large M
denotes the number of
variables in a
regression!
They are not equal,
But they do
converge to be same!

Chapter 9.4. The EM Algorithm in General
19
General idea of EM
What exactly is EM?
EM is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables!
Please focus on our goal.
Our goal is to maximize the likelihood function that is given by
Here, let 𝑞(𝑍) of be the probability of mere 𝑍 itself. Regardless of 𝑞(𝑍), following decomposition holds true.
Now, from this equation, we are going to study why EM algorithm increases the likelihood.
Please keep in mind that 𝑍 is a latent variable and 𝜃 is a parameter we are trying to estimate.

20
1st : KL-Divergence and E-Step
As you all know, KL-divergence is non-negative (𝐾𝐿(𝑞| 𝑝 ≥ 0)
If we move ℒ to left hand side of the equation, we can simply get ln 𝑝(𝑋|𝜃) − ℒ 𝑞, 𝜃 = 𝐾𝐿 ≥ 0
We can see 𝑝 𝑋 𝜃 ≥ ℒ(𝑞, 𝜃), which means ℒ 𝑞, 𝜃 gives a lower bound of our objective function.
As we’ve seen in previous sections, we hold 𝜃(𝑜𝑙𝑑)
and find expectation values of 𝑍.
For E-step, we are maximizing ℒ(𝑞, 𝜃) with respect to 𝑍. Under the fixed ln 𝑝(𝑋|𝜃), only way to maximize ℒ(𝑞, 𝜃) is to make KL value equal to zero.
Which means, 𝑞 𝑍 = 𝑝(𝑍|𝑋, 𝜃(𝑜𝑙𝑑)
). We are making posterior equal to prior!
General framework of EM E-Step of EM

21
2nd : Finding new 𝜽 in M-Step
As you can see, we are interested in θ for M-Step. Thus, we can get 𝒬(𝜃, 𝜃𝑜𝑙𝑑
), which we used in previous sections.
Now we are having new type of function, and we can derive a new 𝜽 which gives a bigger likelihood! This is a M-Step.
This figure gives great explanation to overall
phenomenon. As we compute blue curve, we
move on to the 𝜃 value which gives maximum
value of blue one. Then, green curve forms
new posterior which is much greater than
previous blue curve!
Likewise, we sequentially update 𝜽 and
corresponding distribution to get desired
maximized value.

22
Related examples
For particular case of an i.i.d. dataset, we can re-write 𝑝(𝑍|𝑋, 𝜃) by
This means responsibility for each data only depends on data 𝑋𝑛.
Other variables in dataset does not give influence in computing responsibility.
It is totally okay to compute related variables only.
EM algorithm can also be applied to update posterior distribution!
EM algorithm is efficient in many optimization issues, but there still exist some difficulties in some tasks. Breakthroughs can be
Generalized Expectation Maximization (GEM).
Choice of good 𝑞𝜃(𝑍), using only one single data point in order to update corresponding parameters, etc…
There are many extensions of expectation maximization!

PRML Chapter 9

More Related Content

What's hot (20)

Similar to PRML Chapter 9 (20)

Recently uploaded (20)

PRML Chapter 9