SlideShare a Scribd company logo
Chapter 9
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
Chapter 9. Mixture models and EM
2
Use of latent variable & clustering
Consider we have joint distribution of 𝑝(𝑥, 𝑧).
We can get marginal distribution of 𝑝(𝑥) by marginalizing latent variable 𝑧 from the full distribution.
Sometimes, its useful and more convenient to use latent variable 𝒛 here.
In this chapter, we are going to cover some discrete distribution with latent variables.
Keywords are…
1. K-Means Clustering
2. Gaussian Mixture model
3. Expectation Maximization
In fact, we are optimizing parameters of gaussian mixture by using EM algorithm.
Details will be covered soon!
We all are familiar with the idea of clustering! So let’s cover K-means straightly!
Chapter 9.1. K-Means Clustering
3
Theoretical Idea
Procedure of K-Means are so much familiar to us. It was covered in multivariate analysis, data mining, etc…
Here, let’s think K-Means on the perspective of optimization.
Let data points 𝑋𝑁, and k be index of cluster. Each center of cluster can be 𝜇𝑘.
If 𝑋𝑁 belongs to the 𝑘𝑡ℎ
cluster, then 𝑟𝑛𝑘 = 1. O. W. r𝑛𝑗 = 0(𝑗 ≠ 𝑘)
Here, sum of square distance can be
Here, we are trying to minimize overall distance 𝐽. Since function is a quadratic form, straight closed form gives a
global minimum for loss.
In fact, we don’t know 𝑟𝑛𝑘 and 𝜇𝑘 value. Thus, let’s estimate them by minimizing sum of square distance 𝑱.
Here, finding 𝑟𝑛𝑘 is obvious since 𝒌 which yields minimum value of ||𝑿𝒏 − 𝝁𝒌|| is optimal choice for minimizing.
Now, let’s find optimal value for 𝜇𝑘. We can get derivative with respect to 𝜇𝑘.
Chapter 9.1. K-Means Clustering
4
Implemetation
1. It is important to choose the appropriate initial values for 𝜇𝑘.
2. We can also use sequential update by
3. There is a general version of K-Means, the K-medoid algorithm.
4. We straightly assign data to only one specific cluster. This is called a ‘hard
assignment’!
5. Furthermore, we can apply K-Means to the image segmentation task!
** Image segmentation : Process of partitioning a digital image into multiple
segments, to simplify or change image to a simple, meaningful something.
Chapter 9.2. Mixtures of Gaussians
5
Implementation
https://towardsdatascience.com/gaussian-
mixture-models-explained-6986aaf5a95
Consider distribution of multi-mode gaussian!
We have already seen this distribution in chapter 2.
Here, let’s focus on parameter optimization!
Consider distribution of multi-mode gaussian!
We have already seen this distribution in chapter 2.
Here, let’s focus on parameter optimization!
𝒑 𝑿 = 𝒑 𝑿 𝒁 𝒑(𝒁)
Chapter 9.2. Mixtures of Gaussians
6
Implementation
Then how can we use these probabilities to clustering?
We have to assign each values to a specific cluster when we a data is given. Thus, probability becomes
This 𝛾(𝑧𝑘) can be viewed as the responsibility!
Furthermore, we can generate random samples from this distribution.
Detail will be covered in chapter 11!
Note that left figure illustrates how estimated clusters fit the original
ground-truth.
Chapter 9.2. Mixtures of Gaussians
7
Maximum likelihood
Suppose we have a data set of observations {𝑋1, 𝑋2, … , 𝑋𝑁}.
Then, from 𝑝(𝑋), we can define joint probability of dataset. To express product as summation, we can take log!
In this case, we can think of collapse in a data, which is related to a singularity.
Suppose certain data exactly matches the mean of the data. That is 𝑋𝑁 = 𝜇𝑗. Then, probability becomes
When there exist only one data that belongs to that component, 𝜎𝑗 → 0 due to the 𝑋𝑁 = 𝜇𝑗. Then, 𝑵(𝑿𝒏|𝑿𝒏, 𝝈𝒋
𝟐
𝑰) goes infinity.
Here, log likelihood also gets infinity, and we cannot derive appropriate solution!
This does not occur in uni-mode gaussian (common gaussian), because every data belongs to one distribution, and such collapsing issue
is being cancelled out between data.
To overcome this issue, we can use such heuristic methods…
1. Randomly re-setting mean when such issue occurs.
2. Randomly set covariance to be bigger than 0.
Chapter 9.2. Mixtures of Gaussians
8
EM for Gaussian mixtures
We defined parameters and likelihood. Now, we have to find how to optimize such parameters!
Here we use EM(Expectation Maximization) to get desired parameters. General version of EM will be covered soon. Here, let’s take a look how does it
being applied to gaussian mixture.
First, let’s find estimation of 𝜇𝑘 by
𝜕 ln 𝑝 𝑋 𝜋, 𝜇, Σ
𝜕𝜇𝑘
= 0,
Second, let’s find estimation of Σ𝑘 by
𝜕 ln 𝑝 𝑋 𝜋, 𝜇, Σ
𝜕Σ𝑘
= 0,
Last, let’s find estimation of 𝜋𝑘 by
𝜕 ln 𝑝 𝑋 𝜋, 𝜇, Σ
𝜕𝜋𝑘
= 0,
Chapter 9.2. Mixtures of Gaussians
9
EM for Gaussian mixtures
Okay! We’ve got solution of gaussian mixture model! Finished! (Is it???)
Not really. Because right-hand side of equation contains parameter itself!
Here, 𝜸(𝒛𝒏𝒌) also contains 𝝁𝒌. It is like
𝜽𝒊 =
𝟏
𝑵
∑𝜽𝒋𝒙𝒋
Here, EM algorithm occurs in an iterative way!
EM consists of expectation step (E-Step)
And maximization step (M-Step)
Consider we are now in 𝑡 − 𝑡𝑖𝑚𝑒. Thus, we have 𝜇𝑘
𝑡
, Σ𝑘
𝑡
, 𝜋𝑘
(𝑡)
.
In E-Step, we are calculating each 𝛾 𝑧𝑛𝑘
(𝑡)
and other distributional values
by plugging 𝜇𝑘
𝑡
, Σ𝑘
𝑡
, 𝜋𝑘
(𝑡)
values in.
In M-Step, we are updating each parameters by aforementioned equations.
For example, 𝜇𝑘
(𝑡+1)
=
1
𝑁𝑘
(𝑡) ∑𝑛 𝛾 𝑧𝑛𝑘
(𝑡)
𝑋𝑛
Chapter 9.2. Mixtures of Gaussians
10
EM for Gaussian mixtures
Overall process can be written as…
Initial values of 𝜇𝑘
0
, Σ𝑘
0
, 𝜋𝑘
(0)
can be computed by using K-Means.
Here, usually GMM needs much more iteration than K-Means.
Now, let’s focus more on fundamental notion of EM algorithm!
Chapter 9.3. An Alternative View of EM
11
General equation of likelihood
Here, summation exist inside the log, which makes calculation much harder than when summation
is outside the log.
Now, there are terms called ‘complete dataset’ and ‘incomplete dataset’.
If we observe 𝑿 and 𝒁 together, it is called ‘complete dataset’. That is, we intrinsically know which data belongs to which cluster!
Obviously, this is ridiculous. We only observe 𝑿, which is called a ‘incomplete dataset’
For now, consider we observed both 𝑍 and 𝑋. Then here, we don’t need to compute assign certain data to specific cluster, etc…
Our goal is to estimate general parameter 𝜃. We don’t assume any distribution or shape of 𝜃. It is in a general condition!
Here again we get estimation in an iterative way!
Here, 𝜃 gives effect on latent variable 𝑧. Thus, expectation of ln 𝑝(𝑋, 𝑍|𝜃) can be defined as
Chapter 9.3. An Alternative View of EM
12
General equation of likelihood
In fact, we do not observe Z, thus we are using expectation of it.
Here, we can integrate prior and gets MAP by simply changing
expectation of ln 𝑝(𝑋, 𝑍|𝜃) to
𝑄 𝜃, 𝜃𝑜𝑙𝑑
+ ln 𝑝(𝜃)
Now, let’s move on to the example of EM algorithm and applying it to
the general case of various models.
First, take a look at gaussian mixture!
Complete data can be expressed as
Chapter 9.3. An Alternative View of EM
13
Gaussian mixtures revisited
Joint pdf of gaussian mixture with complete data can be expressed as
Left shows incomplete, and upper shows complete data.
As you can see, summation and log has been interchanged!
Since summation exist outside the log, estimation of parameter is extremely easy!
Now, let’s think of the expectation of latent variable.
Here, we can express joint pdf in much simpler form by
Chapter 9.3. An Alternative View of EM
14
Gaussian mixtures revisited
That is, 𝑍𝑘 only depends on 𝑋𝑘, rest 𝑋1 … 𝑋𝑘−1, 𝑋𝑘+1 … does not give influence.
By using equation 𝑝 𝑍|𝑋 =
𝑝 𝑍 𝑝 𝑋 𝑍
𝑝(𝑋)
and the fact of conditional independence, we can compute expectation of latent variable as
Furthermore, we can compute other parameters iteratively by maximizing above
expectation of joint probability!
Hard assignment K-Means can also be shown in this manner.
Assume Gaussian distribution for each data. Then
If we assume variance term 𝜖 → 0, this 𝛾 𝑧𝑛𝑘 → 1.
Thus, assignment becomes
Chapter 9.3. An Alternative View of EM
15
Bernoulli distribution
We can apply latent variable and EM algorithm to Bernoulli distribution either.
This is also called as ‘latent class analysis.’ This gives foundation for a hidden Markov models over discrete variables.
That is, we considered there exist one probability of Bernoulli. That is, x becomes 1 0 0 0 … 0 − 𝐷 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛
Here, probability can be defined as
I think it’s important to capture the general idea of mixture Bernoulli.
Here, we are going to consider latent variable in a Bernoulli!
Only one variable survives, but
it is expressed by linear sum of
values in a specific column!
Chapter 9.3. An Alternative View of EM
16
Bernoulli distribution
As you all can see in the covariance term, mixture Bernoulli can capture the relation ship between variables!
That is, the correlation between 𝑿𝟏 = (𝑿𝟏𝟏, 𝑿𝟏𝟐, 𝑿𝟏𝟑, … , 𝑿𝟏𝒑 ) can be captured by using this model! Here also we consider complete data.
Previous case Now
Chapter 9.3. An Alternative View of EM
17
Bernoulli distribution
Here again we define expectation of latent variable (responsibilities) by.
By using it, we can again derive for the parameters of
Parameters can be estimated by
Before moving on, let’s get this straight!
What we did in this section 9.3. is to get the estimation under the condition ‘complete data’.
Here, E-step which finds expectation was to compute
Chapter 9.3. An Alternative View of EM
18
EM for Bayesian linear regression
We can apply EM algorithm to the Bayesian linear regression. We use it in evidence approximation, where we find nuisance parameters
We marginalized out parameter 𝒘 to obtain the desired predictive distribution. Thus, we think parameter 𝒘 as latent variable.
Where large M
denotes the number of
variables in a
regression!
They are not equal,
But they do
converge to be same!
Chapter 9.4. The EM Algorithm in General
19
General idea of EM
What exactly is EM?
EM is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables!
Please focus on our goal.
Our goal is to maximize the likelihood function that is given by
Here, let 𝑞(𝑍) of be the probability of mere 𝑍 itself. Regardless of 𝑞(𝑍), following decomposition holds true.
Now, from this equation, we are going to study why EM algorithm increases the likelihood.
Please keep in mind that 𝑍 is a latent variable and 𝜃 is a parameter we are trying to estimate.
Chapter 9.4. The EM Algorithm in General
20
1st : KL-Divergence and E-Step
As you all know, KL-divergence is non-negative (𝐾𝐿(𝑞| 𝑝 ≥ 0)
If we move ℒ to left hand side of the equation, we can simply get ln 𝑝(𝑋|𝜃) − ℒ 𝑞, 𝜃 = 𝐾𝐿 ≥ 0
We can see 𝑝 𝑋 𝜃 ≥ ℒ(𝑞, 𝜃), which means ℒ 𝑞, 𝜃 gives a lower bound of our objective function.
As we’ve seen in previous sections, we hold 𝜃(𝑜𝑙𝑑)
and find expectation values of 𝑍.
For E-step, we are maximizing ℒ(𝑞, 𝜃) with respect to 𝑍. Under the fixed ln 𝑝(𝑋|𝜃), only way to maximize ℒ(𝑞, 𝜃) is to make KL value equal to zero.
Which means, 𝑞 𝑍 = 𝑝(𝑍|𝑋, 𝜃(𝑜𝑙𝑑)
). We are making posterior equal to prior!
General framework of EM E-Step of EM
Chapter 9.4. The EM Algorithm in General
21
2nd : Finding new 𝜽 in M-Step
As you can see, we are interested in θ for M-Step. Thus, we can get 𝒬(𝜃, 𝜃𝑜𝑙𝑑
), which we used in previous sections.
Now we are having new type of function, and we can derive a new 𝜽 which gives a bigger likelihood! This is a M-Step.
This figure gives great explanation to overall
phenomenon. As we compute blue curve, we
move on to the 𝜃 value which gives maximum
value of blue one. Then, green curve forms
new posterior which is much greater than
previous blue curve!
Likewise, we sequentially update 𝜽 and
corresponding distribution to get desired
maximized value.
Chapter 9.4. The EM Algorithm in General
22
Related examples
For particular case of an i.i.d. dataset, we can re-write 𝑝(𝑍|𝑋, 𝜃) by
This means responsibility for each data only depends on data 𝑋𝑛.
Other variables in dataset does not give influence in computing responsibility.
It is totally okay to compute related variables only.
EM algorithm can also be applied to update posterior distribution!
EM algorithm is efficient in many optimization issues, but there still exist some difficulties in some tasks. Breakthroughs can be
Generalized Expectation Maximization (GEM).
Choice of good 𝑞𝜃(𝑍), using only one single data point in order to update corresponding parameters, etc…
There are many extensions of expectation maximization!

More Related Content

PPTX
PRML Chapter 2
PPTX
PRML Chapter 10
PPTX
PRML Chapter 1
PPTX
PRML Chapter 12
PPTX
PRML Chapter 11
PPTX
PRML Chapter 5
PPTX
Activation function
PDF
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
PRML Chapter 2
PRML Chapter 10
PRML Chapter 1
PRML Chapter 12
PRML Chapter 11
PRML Chapter 5
Activation function
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets

What's hot (20)

PDF
CNN Attention Networks
PDF
Regularization and variable selection via elastic net
PDF
Visual Explanation of Ridge Regression and LASSO
PPTX
Restricted boltzmann machine
PDF
Approximate Inference (Chapter 10, PRML Reading)
PPTX
Machine Learning lecture6(regularization)
PPTX
Chapter 14 AutoEncoder
PPTX
Lecture 18: Gaussian Mixture Models and Expectation Maximization
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
PPTX
Feedforward neural network
PPTX
Convolutional Neural Network (CNN)
PPTX
PRML Chapter 6
PDF
Contraction mapping
PPTX
PRML Chapter 3
PPTX
Machine learning session4(linear regression)
PDF
Markov decision process
PPTX
Hyperparameter Tuning
PDF
Introduction to Neural Networks
PDF
Ridge regression
PPTX
Artificial Neural Network
CNN Attention Networks
Regularization and variable selection via elastic net
Visual Explanation of Ridge Regression and LASSO
Restricted boltzmann machine
Approximate Inference (Chapter 10, PRML Reading)
Machine Learning lecture6(regularization)
Chapter 14 AutoEncoder
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Feedforward neural network
Convolutional Neural Network (CNN)
PRML Chapter 6
Contraction mapping
PRML Chapter 3
Machine learning session4(linear regression)
Markov decision process
Hyperparameter Tuning
Introduction to Neural Networks
Ridge regression
Artificial Neural Network
Ad

Similar to PRML Chapter 9 (20)

PPTX
PRML Chapter 8
PPTX
PRML Chapter 7
PPTX
PRML Chapter 4
PPTX
A popular clustering algorithm is known as K-means, which will follow an iter...
PPTX
Using the Pade Technique to Approximate the Function.pptx
PPTX
a a a a a a a a a a a a a a a a aa a a a a 41520z
PDF
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
PPTX
Bootcamp of new world to taken seriously
PDF
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
PDF
Quantum Deep Learning
PPTX
Data mining Part 1
PDF
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
PDF
Chapter 18,19
PPTX
1.2 A Tutorial Example - Deep Learning Foundations and Concepts.pptx
PDF
Forecasting day ahead power prices in germany using fixed size least squares ...
PDF
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
DOCX
Essentials of machine learning algorithms
PDF
Linear logisticregression
PDF
PPTX
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
PRML Chapter 8
PRML Chapter 7
PRML Chapter 4
A popular clustering algorithm is known as K-means, which will follow an iter...
Using the Pade Technique to Approximate the Function.pptx
a a a a a a a a a a a a a a a a aa a a a a 41520z
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
Bootcamp of new world to taken seriously
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Quantum Deep Learning
Data mining Part 1
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Chapter 18,19
1.2 A Tutorial Example - Deep Learning Foundations and Concepts.pptx
Forecasting day ahead power prices in germany using fixed size least squares ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
Essentials of machine learning algorithms
Linear logisticregression
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
Ad

Recently uploaded (20)

PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Predictive modeling basics in data cleaning process
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Transcultural that can help you someday.
PPTX
Database Infoormation System (DBIS).pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
How to run a consulting project- client discovery
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
New ISO 27001_2022 standard and the changes
PDF
Microsoft Core Cloud Services powerpoint
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
ISS -ESG Data flows What is ESG and HowHow
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Predictive modeling basics in data cleaning process
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
CYBER SECURITY the Next Warefare Tactics
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Optimise Shopper Experiences with a Strong Data Estate.pdf
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Transcultural that can help you someday.
Database Infoormation System (DBIS).pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
How to run a consulting project- client discovery
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
New ISO 27001_2022 standard and the changes
Microsoft Core Cloud Services powerpoint
STERILIZATION AND DISINFECTION-1.ppthhhbx

PRML Chapter 9

  • 1. Chapter 9 Reviewer : Sunwoo Kim Christopher M. Bishop Pattern Recognition and Machine Learning Yonsei University Department of Applied Statistics
  • 2. Chapter 9. Mixture models and EM 2 Use of latent variable & clustering Consider we have joint distribution of 𝑝(𝑥, 𝑧). We can get marginal distribution of 𝑝(𝑥) by marginalizing latent variable 𝑧 from the full distribution. Sometimes, its useful and more convenient to use latent variable 𝒛 here. In this chapter, we are going to cover some discrete distribution with latent variables. Keywords are… 1. K-Means Clustering 2. Gaussian Mixture model 3. Expectation Maximization In fact, we are optimizing parameters of gaussian mixture by using EM algorithm. Details will be covered soon! We all are familiar with the idea of clustering! So let’s cover K-means straightly!
  • 3. Chapter 9.1. K-Means Clustering 3 Theoretical Idea Procedure of K-Means are so much familiar to us. It was covered in multivariate analysis, data mining, etc… Here, let’s think K-Means on the perspective of optimization. Let data points 𝑋𝑁, and k be index of cluster. Each center of cluster can be 𝜇𝑘. If 𝑋𝑁 belongs to the 𝑘𝑡ℎ cluster, then 𝑟𝑛𝑘 = 1. O. W. r𝑛𝑗 = 0(𝑗 ≠ 𝑘) Here, sum of square distance can be Here, we are trying to minimize overall distance 𝐽. Since function is a quadratic form, straight closed form gives a global minimum for loss. In fact, we don’t know 𝑟𝑛𝑘 and 𝜇𝑘 value. Thus, let’s estimate them by minimizing sum of square distance 𝑱. Here, finding 𝑟𝑛𝑘 is obvious since 𝒌 which yields minimum value of ||𝑿𝒏 − 𝝁𝒌|| is optimal choice for minimizing. Now, let’s find optimal value for 𝜇𝑘. We can get derivative with respect to 𝜇𝑘.
  • 4. Chapter 9.1. K-Means Clustering 4 Implemetation 1. It is important to choose the appropriate initial values for 𝜇𝑘. 2. We can also use sequential update by 3. There is a general version of K-Means, the K-medoid algorithm. 4. We straightly assign data to only one specific cluster. This is called a ‘hard assignment’! 5. Furthermore, we can apply K-Means to the image segmentation task! ** Image segmentation : Process of partitioning a digital image into multiple segments, to simplify or change image to a simple, meaningful something.
  • 5. Chapter 9.2. Mixtures of Gaussians 5 Implementation https://towardsdatascience.com/gaussian- mixture-models-explained-6986aaf5a95 Consider distribution of multi-mode gaussian! We have already seen this distribution in chapter 2. Here, let’s focus on parameter optimization! Consider distribution of multi-mode gaussian! We have already seen this distribution in chapter 2. Here, let’s focus on parameter optimization! 𝒑 𝑿 = 𝒑 𝑿 𝒁 𝒑(𝒁)
  • 6. Chapter 9.2. Mixtures of Gaussians 6 Implementation Then how can we use these probabilities to clustering? We have to assign each values to a specific cluster when we a data is given. Thus, probability becomes This 𝛾(𝑧𝑘) can be viewed as the responsibility! Furthermore, we can generate random samples from this distribution. Detail will be covered in chapter 11! Note that left figure illustrates how estimated clusters fit the original ground-truth.
  • 7. Chapter 9.2. Mixtures of Gaussians 7 Maximum likelihood Suppose we have a data set of observations {𝑋1, 𝑋2, … , 𝑋𝑁}. Then, from 𝑝(𝑋), we can define joint probability of dataset. To express product as summation, we can take log! In this case, we can think of collapse in a data, which is related to a singularity. Suppose certain data exactly matches the mean of the data. That is 𝑋𝑁 = 𝜇𝑗. Then, probability becomes When there exist only one data that belongs to that component, 𝜎𝑗 → 0 due to the 𝑋𝑁 = 𝜇𝑗. Then, 𝑵(𝑿𝒏|𝑿𝒏, 𝝈𝒋 𝟐 𝑰) goes infinity. Here, log likelihood also gets infinity, and we cannot derive appropriate solution! This does not occur in uni-mode gaussian (common gaussian), because every data belongs to one distribution, and such collapsing issue is being cancelled out between data. To overcome this issue, we can use such heuristic methods… 1. Randomly re-setting mean when such issue occurs. 2. Randomly set covariance to be bigger than 0.
  • 8. Chapter 9.2. Mixtures of Gaussians 8 EM for Gaussian mixtures We defined parameters and likelihood. Now, we have to find how to optimize such parameters! Here we use EM(Expectation Maximization) to get desired parameters. General version of EM will be covered soon. Here, let’s take a look how does it being applied to gaussian mixture. First, let’s find estimation of 𝜇𝑘 by 𝜕 ln 𝑝 𝑋 𝜋, 𝜇, Σ 𝜕𝜇𝑘 = 0, Second, let’s find estimation of Σ𝑘 by 𝜕 ln 𝑝 𝑋 𝜋, 𝜇, Σ 𝜕Σ𝑘 = 0, Last, let’s find estimation of 𝜋𝑘 by 𝜕 ln 𝑝 𝑋 𝜋, 𝜇, Σ 𝜕𝜋𝑘 = 0,
  • 9. Chapter 9.2. Mixtures of Gaussians 9 EM for Gaussian mixtures Okay! We’ve got solution of gaussian mixture model! Finished! (Is it???) Not really. Because right-hand side of equation contains parameter itself! Here, 𝜸(𝒛𝒏𝒌) also contains 𝝁𝒌. It is like 𝜽𝒊 = 𝟏 𝑵 ∑𝜽𝒋𝒙𝒋 Here, EM algorithm occurs in an iterative way! EM consists of expectation step (E-Step) And maximization step (M-Step) Consider we are now in 𝑡 − 𝑡𝑖𝑚𝑒. Thus, we have 𝜇𝑘 𝑡 , Σ𝑘 𝑡 , 𝜋𝑘 (𝑡) . In E-Step, we are calculating each 𝛾 𝑧𝑛𝑘 (𝑡) and other distributional values by plugging 𝜇𝑘 𝑡 , Σ𝑘 𝑡 , 𝜋𝑘 (𝑡) values in. In M-Step, we are updating each parameters by aforementioned equations. For example, 𝜇𝑘 (𝑡+1) = 1 𝑁𝑘 (𝑡) ∑𝑛 𝛾 𝑧𝑛𝑘 (𝑡) 𝑋𝑛
  • 10. Chapter 9.2. Mixtures of Gaussians 10 EM for Gaussian mixtures Overall process can be written as… Initial values of 𝜇𝑘 0 , Σ𝑘 0 , 𝜋𝑘 (0) can be computed by using K-Means. Here, usually GMM needs much more iteration than K-Means. Now, let’s focus more on fundamental notion of EM algorithm!
  • 11. Chapter 9.3. An Alternative View of EM 11 General equation of likelihood Here, summation exist inside the log, which makes calculation much harder than when summation is outside the log. Now, there are terms called ‘complete dataset’ and ‘incomplete dataset’. If we observe 𝑿 and 𝒁 together, it is called ‘complete dataset’. That is, we intrinsically know which data belongs to which cluster! Obviously, this is ridiculous. We only observe 𝑿, which is called a ‘incomplete dataset’ For now, consider we observed both 𝑍 and 𝑋. Then here, we don’t need to compute assign certain data to specific cluster, etc… Our goal is to estimate general parameter 𝜃. We don’t assume any distribution or shape of 𝜃. It is in a general condition! Here again we get estimation in an iterative way! Here, 𝜃 gives effect on latent variable 𝑧. Thus, expectation of ln 𝑝(𝑋, 𝑍|𝜃) can be defined as
  • 12. Chapter 9.3. An Alternative View of EM 12 General equation of likelihood In fact, we do not observe Z, thus we are using expectation of it. Here, we can integrate prior and gets MAP by simply changing expectation of ln 𝑝(𝑋, 𝑍|𝜃) to 𝑄 𝜃, 𝜃𝑜𝑙𝑑 + ln 𝑝(𝜃) Now, let’s move on to the example of EM algorithm and applying it to the general case of various models. First, take a look at gaussian mixture! Complete data can be expressed as
  • 13. Chapter 9.3. An Alternative View of EM 13 Gaussian mixtures revisited Joint pdf of gaussian mixture with complete data can be expressed as Left shows incomplete, and upper shows complete data. As you can see, summation and log has been interchanged! Since summation exist outside the log, estimation of parameter is extremely easy! Now, let’s think of the expectation of latent variable. Here, we can express joint pdf in much simpler form by
  • 14. Chapter 9.3. An Alternative View of EM 14 Gaussian mixtures revisited That is, 𝑍𝑘 only depends on 𝑋𝑘, rest 𝑋1 … 𝑋𝑘−1, 𝑋𝑘+1 … does not give influence. By using equation 𝑝 𝑍|𝑋 = 𝑝 𝑍 𝑝 𝑋 𝑍 𝑝(𝑋) and the fact of conditional independence, we can compute expectation of latent variable as Furthermore, we can compute other parameters iteratively by maximizing above expectation of joint probability! Hard assignment K-Means can also be shown in this manner. Assume Gaussian distribution for each data. Then If we assume variance term 𝜖 → 0, this 𝛾 𝑧𝑛𝑘 → 1. Thus, assignment becomes
  • 15. Chapter 9.3. An Alternative View of EM 15 Bernoulli distribution We can apply latent variable and EM algorithm to Bernoulli distribution either. This is also called as ‘latent class analysis.’ This gives foundation for a hidden Markov models over discrete variables. That is, we considered there exist one probability of Bernoulli. That is, x becomes 1 0 0 0 … 0 − 𝐷 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 Here, probability can be defined as I think it’s important to capture the general idea of mixture Bernoulli. Here, we are going to consider latent variable in a Bernoulli! Only one variable survives, but it is expressed by linear sum of values in a specific column!
  • 16. Chapter 9.3. An Alternative View of EM 16 Bernoulli distribution As you all can see in the covariance term, mixture Bernoulli can capture the relation ship between variables! That is, the correlation between 𝑿𝟏 = (𝑿𝟏𝟏, 𝑿𝟏𝟐, 𝑿𝟏𝟑, … , 𝑿𝟏𝒑 ) can be captured by using this model! Here also we consider complete data. Previous case Now
  • 17. Chapter 9.3. An Alternative View of EM 17 Bernoulli distribution Here again we define expectation of latent variable (responsibilities) by. By using it, we can again derive for the parameters of Parameters can be estimated by Before moving on, let’s get this straight! What we did in this section 9.3. is to get the estimation under the condition ‘complete data’. Here, E-step which finds expectation was to compute
  • 18. Chapter 9.3. An Alternative View of EM 18 EM for Bayesian linear regression We can apply EM algorithm to the Bayesian linear regression. We use it in evidence approximation, where we find nuisance parameters We marginalized out parameter 𝒘 to obtain the desired predictive distribution. Thus, we think parameter 𝒘 as latent variable. Where large M denotes the number of variables in a regression! They are not equal, But they do converge to be same!
  • 19. Chapter 9.4. The EM Algorithm in General 19 General idea of EM What exactly is EM? EM is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables! Please focus on our goal. Our goal is to maximize the likelihood function that is given by Here, let 𝑞(𝑍) of be the probability of mere 𝑍 itself. Regardless of 𝑞(𝑍), following decomposition holds true. Now, from this equation, we are going to study why EM algorithm increases the likelihood. Please keep in mind that 𝑍 is a latent variable and 𝜃 is a parameter we are trying to estimate.
  • 20. Chapter 9.4. The EM Algorithm in General 20 1st : KL-Divergence and E-Step As you all know, KL-divergence is non-negative (𝐾𝐿(𝑞| 𝑝 ≥ 0) If we move ℒ to left hand side of the equation, we can simply get ln 𝑝(𝑋|𝜃) − ℒ 𝑞, 𝜃 = 𝐾𝐿 ≥ 0 We can see 𝑝 𝑋 𝜃 ≥ ℒ(𝑞, 𝜃), which means ℒ 𝑞, 𝜃 gives a lower bound of our objective function. As we’ve seen in previous sections, we hold 𝜃(𝑜𝑙𝑑) and find expectation values of 𝑍. For E-step, we are maximizing ℒ(𝑞, 𝜃) with respect to 𝑍. Under the fixed ln 𝑝(𝑋|𝜃), only way to maximize ℒ(𝑞, 𝜃) is to make KL value equal to zero. Which means, 𝑞 𝑍 = 𝑝(𝑍|𝑋, 𝜃(𝑜𝑙𝑑) ). We are making posterior equal to prior! General framework of EM E-Step of EM
  • 21. Chapter 9.4. The EM Algorithm in General 21 2nd : Finding new 𝜽 in M-Step As you can see, we are interested in θ for M-Step. Thus, we can get 𝒬(𝜃, 𝜃𝑜𝑙𝑑 ), which we used in previous sections. Now we are having new type of function, and we can derive a new 𝜽 which gives a bigger likelihood! This is a M-Step. This figure gives great explanation to overall phenomenon. As we compute blue curve, we move on to the 𝜃 value which gives maximum value of blue one. Then, green curve forms new posterior which is much greater than previous blue curve! Likewise, we sequentially update 𝜽 and corresponding distribution to get desired maximized value.
  • 22. Chapter 9.4. The EM Algorithm in General 22 Related examples For particular case of an i.i.d. dataset, we can re-write 𝑝(𝑍|𝑋, 𝜃) by This means responsibility for each data only depends on data 𝑋𝑛. Other variables in dataset does not give influence in computing responsibility. It is totally okay to compute related variables only. EM algorithm can also be applied to update posterior distribution! EM algorithm is efficient in many optimization issues, but there still exist some difficulties in some tasks. Breakthroughs can be Generalized Expectation Maximization (GEM). Choice of good 𝑞𝜃(𝑍), using only one single data point in order to update corresponding parameters, etc… There are many extensions of expectation maximization!