The Expectation Maximization
or EM algorithm
Carl Edward Rasmussen
November 15th, 2017
Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11
Contents
• notation, objective
• the lower bound functional, F(q(H), θ)
• the EM algorithm
• example: Gaussian mixture model
• Appendix: KL divergence
Carl Edward Rasmussen The EM algorithm November 15th, 2017 2 / 11
Notation
Probabilistic models may have visible (or observed) variables y, latent variables,
(or hidden or unobserved variables or missing data) z and parameters θ.
Example: in a Gaussian mixture model, the visible variables are the observations,
the latent variables are the assignments of data points to mixture components and
the parameters are the means, variances, and weights of the mixture components.
The likelihood, p(y|θ), is the probability of the visible variables given the
parameters. The goal of the EM algorithm is to find parameters θ which
maximize the likelihood. The EM algorithm is iterative and converges to a local
maximum.
Throughout, q(z) will be used to denote an arbitrary distribution of the latent
variables, z. The exposition will assume that the latent variables are continuous,
but an analogue derivation for discrete z can be obtained by substituting integrals
with sums.
Carl Edward Rasmussen The EM algorithm November 15th, 2017 3 / 11
The lower bound
Bayes’ rule:
p(z|y, θ) =
p(y|z, θ)p(z|θ)
p(y|θ)
⇔ p(y|θ) =
p(y|z, θ)p(z|θ)
p(z|y, θ)
.
Multiply and divide by an arbitrary (non-zero) distribution q(z):
p(y|θ) =
p(y|z, θ)p(z|θ)
q(z)
q(z)
p(z|y, θ)
,
take logarithms:
log p(y|θ) = log
p(y|z, θ)p(z|θ)
q(z)
+ log
q(z)
p(z|y, θ)
,
and average both sides wrt q(z):
log p(y|θ) =
Z
q(z) log
p(y|z, θ)p(z|θ)
q(z)
dz
| {z }
lower bound functional F(q(z),θ)
+
Z
q(z) log
q(z)
p(z|y, θ)
dz
| {z }
non-negative KL(q(z)||p(z|y,θ))
.
Carl Edward Rasmussen The EM algorithm November 15th, 2017 4 / 11
The EM algorithm
From initial (random) parameters θt=0
iterate t = 1, . . . , T the two steps:
E step: for fixed θt−1
, maximize the lower bound F(q(z), θt−1
) wrt q(z). Since
the log likelihood log p(y|θ) is independent of q(z) maximizing the lower bound
is equivalent to minimizing KL(q(z)||p(z|y, θt−1
)), so qt
(z) = p(z|y, θt−1
).
M step: for fixed qt
(z) maximize the lower bound F(qk
(z), θ) wrt θ. We have:
F(q(z), θ) =
Z
q(z) log p(y|z, θ)p(z|θ)

dz −
Z
q(z) log q(z)dz,
whose second term is the entropy of q(z), independent of θ, so the M step is
θt
= argmax
θ
Z
qt
(z) log p(y|z, θ)p(z|θ)

dz.
Although the steps work with the lower bound, each iteration cannot decrease the
log likelihood as
log p(y|θt−1
)
E step
= F(qt
(z), θt−1
)
M step
⩽ F(qt
(z), θt
)
lower bound
⩽ log p(y|θt
).
Carl Edward Rasmussen The EM algorithm November 15th, 2017 5 / 11
EM as Coordinate Ascent in F
Carl Edward Rasmussen The EM algorithm November 15th, 2017 6 / 11
Example: Mixture of Gaussians
In a Gaussian mixture model, the parameters are θ = {µj, σ2
j , πj}j=1...k the
mixture means, variances and mixing proportions for each of the k components.
There is one latent variable per data-point zi, i = 1 . . . n taking on values 1 . . . k.
The probability of the observations given the latent variables and the parameters,
and the prior on latent variables are
p(yi|zi = j, θ) = exp −
(yi−µj)2
2σ2
j

/
√
2πσ2
j , p(zi = j|θ) = πj,
so the E step becomes:
q(zi = j) ∝ uij = πj exp(−(yi − µj)2
/2σ2
j )/
√
2πσ2
j ⇒ q(zi = j) = rij =
uij
ui
,
where ui =
Pk
j=1 uij. This shows that the posterior for each latent variable, zi
follows a discrete distribution with probability given by the product of the prior
and likelihood, renormalized. Here, rij is called the responsibility that component
j takes for data point i.
Carl Edward Rasmussen The EM algorithm November 15th, 2017 7 / 11
Example: Mixture of Gaussians continued
The lower bound is
F(q(z), θ) =
n
X
i=1
k
X
j=1
q(zi = j)

log(πj) − 1
2 (yi − µj)2
/σ2
j − 1
2 log(σ2
j )

+ const.
The M step, optimizing F(q(z), θ) wrt the parameters, θ
∂F
∂µj
=
n
X
i=1
q(zi = j)
yi − µj
σ2
j
= 0 ⇒ µj =
Pn
i=1q(zi = j)yi
Pn
i=1q(zi = j)
,
∂F
∂σ2
j
=
n
X
i=1
q(zi = j)
(yi − µj)2
2σ4
j
−
1
2σ2
j

= 0 ⇒ σ2
j =
Pn
i=1q(zi = j)(yi − µj)2
Pn
i=1q(zi = j)
,
∂[F + λ(1 −
Pk
j=1 πj)]
∂πj
= 0 ⇒ πj =
1
n
n
X
i=1
q(zi = j),
which have nice interpretations in terms of weighted averages.
Carl Edward Rasmussen The EM algorithm November 15th, 2017 8 / 11
Clustering with MoG
Carl Edward Rasmussen The EM algorithm November 15th, 2017 9 / 11
Clustering with MoG
Carl Edward Rasmussen The EM algorithm November 15th, 2017 10 / 11
Appendix: some properties of KL divergence
The (asymmetric) Kullbach Leibler divergence (or relative entropy)
KL(q(x)||p(x)) is non-negative. To minimize, add a Lagrange multiplier enforcing
proper normalization and take variational derivatives:
δ
δq(x)
h Z
q(x) log
q(x)
p(x)
dx + λ 1 −
Z
q(x)dx
i
= log
q(x)
p(x)
+ 1 − λ.
Find stationary point by setting the derivative to zero:
q(x) = exp(λ − 1)p(x), normalization conditon λ = 1, so q(x) = p(x),
which corresponds to a minimum, since the second derivative is positive:
δ2
δq(x)δq(x)
KL(q(x)||p(x)) =
1
q(x)
 0.
The minimum value attained at q(x) = p(x) is KL(p(x)||p(x)) = 0, showing that
KL(q(x)||p(x))
• is non-negative
• attains its minimum 0 when p(x) and q(x) are equal (almost everywhere).
Carl Edward Rasmussen The EM algorithm November 15th, 2017 11 / 11

More Related Content

PDF
On the Jensen-Shannon symmetrization of distances relying on abstract means
PDF
Introduction to modern Variational Inference.
PDF
Approximate Bayesian Computation with Quasi-Likelihoods
PDF
Cs229 notes7b
PDF
A Note on Latent LSTM Allocation
PDF
MAPE regression, seminar @ QUT (Brisbane)
PDF
Bregman divergences from comparative convexity
PDF
Machine learning (8)
On the Jensen-Shannon symmetrization of distances relying on abstract means
Introduction to modern Variational Inference.
Approximate Bayesian Computation with Quasi-Likelihoods
Cs229 notes7b
A Note on Latent LSTM Allocation
MAPE regression, seminar @ QUT (Brisbane)
Bregman divergences from comparative convexity
Machine learning (8)

Similar to expectation maximization and Guassian Mixture.pdf (20)

PDF
Murphy: Machine learning A probabilistic perspective: Ch.9
PDF
Equivariance
PDF
Machine learning (9)
PDF
sada_pres
PDF
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
PDF
A nonlinear approximation of the Bayesian Update formula
PDF
Darmon Points: an Overview
PDF
lecture 5 about bayesian econometrics and statistics
PDF
Improved Trainings of Wasserstein GANs (WGAN-GP)
PDF
Cs229 notes8
PPT
Chap14_Sec8 - Lagrange Multiplier.ppt
PDF
Complex analysis notes
PDF
Tensor train to solve stochastic PDEs
PDF
Cheatsheet supervised-learning
PDF
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
PDF
Introduction to logistic regression
PDF
Side 2019 #7
PDF
Metodo gauss_newton.pdf
PDF
Tensor Train data format for uncertainty quantification
PPT
lecture6.ppt
Murphy: Machine learning A probabilistic perspective: Ch.9
Equivariance
Machine learning (9)
sada_pres
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
A nonlinear approximation of the Bayesian Update formula
Darmon Points: an Overview
lecture 5 about bayesian econometrics and statistics
Improved Trainings of Wasserstein GANs (WGAN-GP)
Cs229 notes8
Chap14_Sec8 - Lagrange Multiplier.ppt
Complex analysis notes
Tensor train to solve stochastic PDEs
Cheatsheet supervised-learning
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable...
Introduction to logistic regression
Side 2019 #7
Metodo gauss_newton.pdf
Tensor Train data format for uncertainty quantification
lecture6.ppt
Ad

Recently uploaded (20)

PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
International_Financial_Reporting_Standa.pdf
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
Complications of Minimal Access-Surgery.pdf
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
advance database management system book.pdf
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
Hazard Identification & Risk Assessment .pdf
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
HVAC Specification 2024 according to central public works department
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
Empowerment Technology for Senior High School Guide
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
B.Sc. DS Unit 2 Software Engineering.pptx
Environmental Education MCQ BD2EE - Share Source.pdf
International_Financial_Reporting_Standa.pdf
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Complications of Minimal Access-Surgery.pdf
History, Philosophy and sociology of education (1).pptx
advance database management system book.pdf
TNA_Presentation-1-Final(SAVE)) (1).pptx
Hazard Identification & Risk Assessment .pdf
Uderstanding digital marketing and marketing stratergie for engaging the digi...
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
Practical Manual AGRO-233 Principles and Practices of Natural Farming
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
HVAC Specification 2024 according to central public works department
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Empowerment Technology for Senior High School Guide
Ad

expectation maximization and Guassian Mixture.pdf

  • 1. The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11
  • 2. Contents • notation, objective • the lower bound functional, F(q(H), θ) • the EM algorithm • example: Gaussian mixture model • Appendix: KL divergence Carl Edward Rasmussen The EM algorithm November 15th, 2017 2 / 11
  • 3. Notation Probabilistic models may have visible (or observed) variables y, latent variables, (or hidden or unobserved variables or missing data) z and parameters θ. Example: in a Gaussian mixture model, the visible variables are the observations, the latent variables are the assignments of data points to mixture components and the parameters are the means, variances, and weights of the mixture components. The likelihood, p(y|θ), is the probability of the visible variables given the parameters. The goal of the EM algorithm is to find parameters θ which maximize the likelihood. The EM algorithm is iterative and converges to a local maximum. Throughout, q(z) will be used to denote an arbitrary distribution of the latent variables, z. The exposition will assume that the latent variables are continuous, but an analogue derivation for discrete z can be obtained by substituting integrals with sums. Carl Edward Rasmussen The EM algorithm November 15th, 2017 3 / 11
  • 4. The lower bound Bayes’ rule: p(z|y, θ) = p(y|z, θ)p(z|θ) p(y|θ) ⇔ p(y|θ) = p(y|z, θ)p(z|θ) p(z|y, θ) . Multiply and divide by an arbitrary (non-zero) distribution q(z): p(y|θ) = p(y|z, θ)p(z|θ) q(z) q(z) p(z|y, θ) , take logarithms: log p(y|θ) = log p(y|z, θ)p(z|θ) q(z) + log q(z) p(z|y, θ) , and average both sides wrt q(z): log p(y|θ) = Z q(z) log p(y|z, θ)p(z|θ) q(z) dz | {z } lower bound functional F(q(z),θ) + Z q(z) log q(z) p(z|y, θ) dz | {z } non-negative KL(q(z)||p(z|y,θ)) . Carl Edward Rasmussen The EM algorithm November 15th, 2017 4 / 11
  • 5. The EM algorithm From initial (random) parameters θt=0 iterate t = 1, . . . , T the two steps: E step: for fixed θt−1 , maximize the lower bound F(q(z), θt−1 ) wrt q(z). Since the log likelihood log p(y|θ) is independent of q(z) maximizing the lower bound is equivalent to minimizing KL(q(z)||p(z|y, θt−1 )), so qt (z) = p(z|y, θt−1 ). M step: for fixed qt (z) maximize the lower bound F(qk (z), θ) wrt θ. We have: F(q(z), θ) = Z q(z) log p(y|z, θ)p(z|θ) dz − Z q(z) log q(z)dz, whose second term is the entropy of q(z), independent of θ, so the M step is θt = argmax θ Z qt (z) log p(y|z, θ)p(z|θ) dz. Although the steps work with the lower bound, each iteration cannot decrease the log likelihood as log p(y|θt−1 ) E step = F(qt (z), θt−1 ) M step ⩽ F(qt (z), θt ) lower bound ⩽ log p(y|θt ). Carl Edward Rasmussen The EM algorithm November 15th, 2017 5 / 11
  • 6. EM as Coordinate Ascent in F Carl Edward Rasmussen The EM algorithm November 15th, 2017 6 / 11
  • 7. Example: Mixture of Gaussians In a Gaussian mixture model, the parameters are θ = {µj, σ2 j , πj}j=1...k the mixture means, variances and mixing proportions for each of the k components. There is one latent variable per data-point zi, i = 1 . . . n taking on values 1 . . . k. The probability of the observations given the latent variables and the parameters, and the prior on latent variables are p(yi|zi = j, θ) = exp − (yi−µj)2 2σ2 j / √ 2πσ2 j , p(zi = j|θ) = πj, so the E step becomes: q(zi = j) ∝ uij = πj exp(−(yi − µj)2 /2σ2 j )/ √ 2πσ2 j ⇒ q(zi = j) = rij = uij ui , where ui = Pk j=1 uij. This shows that the posterior for each latent variable, zi follows a discrete distribution with probability given by the product of the prior and likelihood, renormalized. Here, rij is called the responsibility that component j takes for data point i. Carl Edward Rasmussen The EM algorithm November 15th, 2017 7 / 11
  • 8. Example: Mixture of Gaussians continued The lower bound is F(q(z), θ) = n X i=1 k X j=1 q(zi = j) log(πj) − 1 2 (yi − µj)2 /σ2 j − 1 2 log(σ2 j ) + const. The M step, optimizing F(q(z), θ) wrt the parameters, θ ∂F ∂µj = n X i=1 q(zi = j) yi − µj σ2 j = 0 ⇒ µj = Pn i=1q(zi = j)yi Pn i=1q(zi = j) , ∂F ∂σ2 j = n X i=1 q(zi = j) (yi − µj)2 2σ4 j − 1 2σ2 j = 0 ⇒ σ2 j = Pn i=1q(zi = j)(yi − µj)2 Pn i=1q(zi = j) , ∂[F + λ(1 − Pk j=1 πj)] ∂πj = 0 ⇒ πj = 1 n n X i=1 q(zi = j), which have nice interpretations in terms of weighted averages. Carl Edward Rasmussen The EM algorithm November 15th, 2017 8 / 11
  • 9. Clustering with MoG Carl Edward Rasmussen The EM algorithm November 15th, 2017 9 / 11
  • 10. Clustering with MoG Carl Edward Rasmussen The EM algorithm November 15th, 2017 10 / 11
  • 11. Appendix: some properties of KL divergence The (asymmetric) Kullbach Leibler divergence (or relative entropy) KL(q(x)||p(x)) is non-negative. To minimize, add a Lagrange multiplier enforcing proper normalization and take variational derivatives: δ δq(x) h Z q(x) log q(x) p(x) dx + λ 1 − Z q(x)dx i = log q(x) p(x) + 1 − λ. Find stationary point by setting the derivative to zero: q(x) = exp(λ − 1)p(x), normalization conditon λ = 1, so q(x) = p(x), which corresponds to a minimum, since the second derivative is positive: δ2 δq(x)δq(x) KL(q(x)||p(x)) = 1 q(x) 0. The minimum value attained at q(x) = p(x) is KL(p(x)||p(x)) = 0, showing that KL(q(x)||p(x)) • is non-negative • attains its minimum 0 when p(x) and q(x) are equal (almost everywhere). Carl Edward Rasmussen The EM algorithm November 15th, 2017 11 / 11