expectation maximization and Guassian Mixture.pdf

The Expectation Maximization
or EM algorithm
Carl Edward Rasmussen
November 15th, 2017
Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11

Contents
• notation, objective
• the lower bound functional, F(q(H), θ)
• the EM algorithm
• example: Gaussian mixture model
• Appendix: KL divergence

Notation
Probabilistic models may have visible (or observed) variables y, latent variables,
(or hidden or unobserved variables or missing data) z and parameters θ.
Example: in a Gaussian mixture model, the visible variables are the observations,
the latent variables are the assignments of data points to mixture components and
the parameters are the means, variances, and weights of the mixture components.
The likelihood, p(y|θ), is the probability of the visible variables given the
parameters. The goal of the EM algorithm is to find parameters θ which
maximize the likelihood. The EM algorithm is iterative and converges to a local
maximum.
Throughout, q(z) will be used to denote an arbitrary distribution of the latent
variables, z. The exposition will assume that the latent variables are continuous,
but an analogue derivation for discrete z can be obtained by substituting integrals
with sums.

The lower bound
Bayes’ rule:
p(z|y, θ) =
p(y|z, θ)p(z|θ)
p(y|θ)
⇔ p(y|θ) =
p(y|z, θ)p(z|θ)
p(z|y, θ)
.
Multiply and divide by an arbitrary (non-zero) distribution q(z):
p(y|θ) =
p(y|z, θ)p(z|θ)
q(z)
q(z)
p(z|y, θ)
,
take logarithms:
log p(y|θ) = log
p(y|z, θ)p(z|θ)
q(z)
+ log
q(z)
p(z|y, θ)
,
and average both sides wrt q(z):
log p(y|θ) =
Z
q(z) log
p(y|z, θ)p(z|θ)
q(z)
dz
| {z }
lower bound functional F(q(z),θ)
+
Z
q(z) log
q(z)
p(z|y, θ)
dz
| {z }
non-negative KL(q(z)||p(z|y,θ))
.

The EM algorithm
From initial (random) parameters θt=0
iterate t = 1, . . . , T the two steps:
E step: for fixed θt−1
, maximize the lower bound F(q(z), θt−1
) wrt q(z). Since
the log likelihood log p(y|θ) is independent of q(z) maximizing the lower bound
is equivalent to minimizing KL(q(z)||p(z|y, θt−1
)), so qt
(z) = p(z|y, θt−1
).
M step: for fixed qt
(z) maximize the lower bound F(qk
(z), θ) wrt θ. We have:
F(q(z), θ) =
Z
q(z) log p(y|z, θ)p(z|θ)

dz −
Z
q(z) log q(z)dz,
whose second term is the entropy of q(z), independent of θ, so the M step is
θt
= argmax
θ
Z
qt
(z) log p(y|z, θ)p(z|θ)

dz.
Although the steps work with the lower bound, each iteration cannot decrease the
log likelihood as
log p(y|θt−1
)
E step
= F(qt
(z), θt−1
)
M step
⩽ F(qt
(z), θt
)
lower bound
⩽ log p(y|θt
).

EM as Coordinate Ascent in F

Example: Mixture of Gaussians
In a Gaussian mixture model, the parameters are θ = {µj, σ2
j , πj}j=1...k the
mixture means, variances and mixing proportions for each of the k components.
There is one latent variable per data-point zi, i = 1 . . . n taking on values 1 . . . k.
The probability of the observations given the latent variables and the parameters,
and the prior on latent variables are
p(yi|zi = j, θ) = exp −
(yi−µj)2
2σ2
j

/
√
2πσ2
j , p(zi = j|θ) = πj,
so the E step becomes:
q(zi = j) ∝ uij = πj exp(−(yi − µj)2
/2σ2
j )/
√
2πσ2
j ⇒ q(zi = j) = rij =
uij
ui
,
where ui =
Pk
j=1 uij. This shows that the posterior for each latent variable, zi
follows a discrete distribution with probability given by the product of the prior
and likelihood, renormalized. Here, rij is called the responsibility that component
j takes for data point i.

Example: Mixture of Gaussians continued
The lower bound is
F(q(z), θ) =
n
X
i=1
k
X
j=1
q(zi = j)

log(πj) − 1
2 (yi − µj)2
/σ2
j − 1
2 log(σ2
j )

+ const.
The M step, optimizing F(q(z), θ) wrt the parameters, θ
∂F
∂µj
=
n
X
i=1
q(zi = j)
yi − µj
σ2
j
= 0 ⇒ µj =
Pn
i=1q(zi = j)yi
Pn
i=1q(zi = j)
,
∂F
∂σ2
j
=
n
X
i=1
q(zi = j)
(yi − µj)2
2σ4
j
−
1
2σ2
j

= 0 ⇒ σ2
j =
Pn
i=1q(zi = j)(yi − µj)2
Pn
i=1q(zi = j)
,
∂[F + λ(1 −
Pk
j=1 πj)]
∂πj
= 0 ⇒ πj =
1
n
n
X
i=1
q(zi = j),
which have nice interpretations in terms of weighted averages.

Clustering with MoG

Appendix: some properties of KL divergence
The (asymmetric) Kullbach Leibler divergence (or relative entropy)
KL(q(x)||p(x)) is non-negative. To minimize, add a Lagrange multiplier enforcing
proper normalization and take variational derivatives:
δ
δq(x)
h Z
q(x) log
q(x)
p(x)
dx + λ 1 −
Z
q(x)dx
i
= log
q(x)
p(x)
+ 1 − λ.
Find stationary point by setting the derivative to zero:
q(x) = exp(λ − 1)p(x), normalization conditon λ = 1, so q(x) = p(x),
which corresponds to a minimum, since the second derivative is positive:
δ2
δq(x)δq(x)
KL(q(x)||p(x)) =
1
q(x)
0.
The minimum value attained at q(x) = p(x) is KL(p(x)||p(x)) = 0, showing that
KL(q(x)||p(x))
• is non-negative
• attains its minimum 0 when p(x) and q(x) are equal (almost everywhere).

expectation maximization and Guassian Mixture.pdf

More Related Content

Similar to expectation maximization and Guassian Mixture.pdf (20)

Recently uploaded (20)

expectation maximization and Guassian Mixture.pdf