Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 1
Machine Learning
Mixture Model, HMM, and
Expectation Maximization
Eric Xing
Lecture 9, August 14, 2010
Reading:

 Data log-likelihood
 MLE
 What if we do not know zn?
Cxzz
xN
zxpzpxzpD
n k
kn
k
n
n k
k
k
n
n
z
k
k
n
n k
z
k
nn
n
n
n
nn
k
n
k
n
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
2
2
1
2 µπ
σµπ
σµπ
σ
θl
Gaussian Discriminative Analysis
zi
xi
N
),;(maxargˆ , DMLEk θlππ =
);(maxargˆ , DMLEk θlµµ =
);(maxargˆ , DMLEk θlσσ =
∑
∑
,
ˆ⇒
n
k
n
n n
k
n
MLEk
z
xz
=µ
{ }
{ }∑ −−
−−
==
'
2
'2
1
2/12'
2
2
1
2/12
)(exp
)2(
1
)(exp
)2(
1
),,|1(
2
2
k
knk
knk
n
k
nyp
µ
πσ
π
µ
πσ
π
σµ
σ
σ
x
x
x

Clustering

Unobserved Variables
 A variable can be unobserved (latent) because:
 it is an imaginary quantity meant to provide some simplified and abstractive view
of the data generation process
 e.g., speech recognition models, mixture models …
 it is a real-world object and/or phenomena, but difficult or impossible to measure
 e.g., the temperature of a star, causes of a disease, evolutionary ancestors …
 it is a real-world object and/or phenomena, but sometimes wasn’t measured,
because of faulty sensors; or was measure with a noisy channel, etc.
 e.g., traffic radio, aircraft signal on a radar screen,
 Discrete latent variables can be used to partition/cluster data
into sub-groups (mixture models, forthcoming).
 Continuous latent variables (factors) can be used for
dimensionality reduction (factor analysis, etc., later lectures).

Mixture Models
 A density model p(x) may be multi-modal.
 We may be able to model it as a mixture of uni-modal
distributions (e.g., Gaussians).
 Each mode may correspond to a different sub-population
(e.g., male and female).
⇒

Gaussian Mixture Models (GMMs)
 Consider a mixture of K Gaussian components:
 Z is a latent class indicator vector:
 X is a conditional Gaussian variable with a class-specific mean/covariance
 The likelihood of a sample:
( )∏):(multi)(
k
z
knn
k
n
zzp ππ ==
{ })-()-(-exp
)(
),,|( // knk
T
kn
k
m
k
nn xxzxp µµ
π
µ 1
2
1
212
2
1
1 −
Σ
Σ
=Σ=
( )( ) ∑∑ ∏
∑
Σ=Σ=
Σ===Σ
k kkkz k
z
kkn
z
k
k
kk
n
xNxN
zxpzpxp
n
k
n
k
n
),|,(),:(
),,|,()|(),(
µπµπ
µπµ 11
mixture proportion
mixture component
Z
X

Gaussian Mixture Models (GMMs)
 Consider a mixture of K Gaussian components:
 This model can be used for unsupervised clustering.
 This model (fit by AutoClass) has been used to discover new kinds of stars in
astronomical data, etc.
∑ Σ=Σ k kkkn xNxp ),|,(),( µπµ
mixture proportion mixture component

Learning mixture models
 Given data
 Likelihood:
( )∏ ∑∏ Σ=Σ=Σ
n
k kkk
n
n xNxpDL ),|,(),,();,,( µπµπµπ
{ } ),,,(maxarg**,*, DL Σ=Σ µπµπ

Why is Learning Harder?
 In fully observed iid settings, the log likelihood decomposes
into a sum of local terms.
 With latent variables, all the parameters become coupled
together via marginalization
),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l
∑∑ ==
z
xz
z
c zxpzpzxpD ),|()|(log)|,(log);( θθθθl

 Recall MLE for completely observed data
 Data log-likelihood
 MLE
 What if we do not know zn?
Cxzz
xN
zxpzpxzpD
n k
kn
k
n
n k
k
k
n
n
z
k
k
n
n k
z
k
nn
n
n
n
nn
k
n
k
n
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
2
2
1
2 µπ
σµπ
σµπ
σ
θl
Toward the EM algorithm
zi
xi
N
),;(maxargˆ , DMLEk θlππ =
);(maxargˆ , DMLEk θlµµ =
);(maxargˆ , DMLEk θlσσ =
∑
∑
,
ˆ⇒
n
k
n
n n
k
n
MLEk
z
xz
=µ

Recall K-means
 Start:
 "Guess" the centroid µk and coveriance Σk of each of the K clusters
 Loop
 For each point n=1 to N,
compute its cluster label:
 For each cluster k=1:K
)()(minarg )()(1)()( t
kn
t
k
Tt
kn
k
t
n xxz µµ −Σ−= −
∑
∑=+
n
t
n
n n
t
nt
k
kz
xkz
),(
),(
)(
)(
)(
δ
δ
µ 1
...)(
=Σ +1t
k

Expectation-Maximization
 Start:
 "Guess" the centroid µk and coveriance Σk of each of the K clusters
 Loop

─ Expectation step: computing the expected value of the
sufficient statistics of the hidden variables (i.e., z) given
current est. of the parameters (i.e., π and µ).
 Here we are essentially doing inference
∑ ),|,(
),|,(
),,|( )()()(
)()()(
)()()(
)(
i
t
i
t
in
t
i
t
k
t
kn
t
kttk
nq
k
n
tk
n
xN
xN
xzpz t
Σ
Σ
=Σ===
µπ
µπ
µτ 1
E-step
Zn
Xn
N

─ Maximization step: compute the parameters under
current results of the expected value of the hidden variables
 This is isomorphic to MLE except that the variables that are hidden are
replaced by their expectations (in general they will by replaced by their
corresponding "sufficient statistics")
M-step
Zn
Xn
N
⇒
s.t.,∀,)(⇒,)(maxarg
)(
*
k
∂
∂*
∑
∑
)(
N
n
NN
z
kll
kn
tk
nn q
k
n
k
kcck
t
k
===
===
∑ τ
π
ππ π 10θθ
∑
∑
)(
)(
)1(*
⇒,)(maxarg
n
tk
n
n n
tk
nt
kk
x
l
τ
τ
µµ == +
θ
∑
∑
)(
)()()(
)(*
))((
⇒,)(maxarg
n
tk
n
n
Tt
kn
t
kn
tk
nt
kk
xx
l
τ
µµτ 11
1
++
+
−−
=Σ=Σ θ

How is EM derived?
 A mixture of K Gaussians:
 Z is a latent class indicator vector
 X is a conditional Gaussian variable with a class-specific mean/covariance
 The likelihood of a sample:
 The “complete” likelihood
Zn
Xn
N
( )∏):(multi)(
k
z
knn
k
n
zzp ππ ==
{ })-()-(-exp
)(
),,|( // knk
T
kn
k
m
k
nn xxzxp µµ
π
µ 1
2
1
212
2
1
1 −
Σ
Σ
=Σ=
( )( ) ∑∑ ∏
∑
Σ=Σ=
Σ===Σ
k kkkz k
z
kkn
z
k
k
k
n
k
nn
xNxN
zxpzpxp
n
k
n
k
n
),|,(),:(
),,1|,()|1(),(
µπµπ
µπµ
),|,(),,1|,()|1(),1,( kkk
k
n
k
n
k
nn xNzxpzpzxp Σ=Σ===Σ= µπµπµ
But this is itself a random variable! Not good as objective function
[ ]∏ Σ=Σ
k
z
kkknn
k
n
xNzxp ),|,(),,( µπµ

How is EM derived?
 The complete log likelihood:
 The expected complete log likelihood
 We maximize iteratively using the above
iterative procedure:
Zn
Xn
N
( )∑∑∑∑
∑∑
log)()(
2
1
log
),,|(log)|(log),;( )|()|(
n k
kknk
T
kn
k
n
n k
k
k
n
n
xzpnn
n
xzpnc
Cxxzz
zxpzpzx
+Σ+−Σ−−=
Σ+=
−
µµπ
µπ
1
θl
Cxzz
xN
zxpzpxzpD
n k
kn
k
n
n k
k
k
n
n
z
k
k
n
n k
z
k
nn
n
n
n
nn
k
n
k
n
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
2
2
1
2 µπ
σµπ
σµπ
σ
θl
)(θcl

Compare: K-means
 The EM algorithm for mixtures of Gaussians is like a "soft
version" of the K-means algorithm.
 In the K-means “E-step” we do hard assignment:
 In the K-means “M-step” we update the means as the
weighted sum of the data, but now the weights are 0 or 1:
)()(maxarg )()()()( t
kn
t
k
Tt
kn
k
t
n xxz µµ −Σ−= −1
∑
∑=+
n
t
n
n n
t
nt
k
kz
xkz
),(
),(
)(
)(
)(
δ
δ
µ 1








=+
∑
∑
)(
)(
)1(
n
tk
n
n n
tk
nt
k
x
τ
τ
µ
( ))(
)(
t
q
k
n
tk
n z=τ

Theory underlying EM
 What are we doing?
 Recall that according to MLE, we intend to learn the model
parameter that would have maximize the likelihood of the
data.
 But we do not observe z, so computing
is difficult!
 What shall we do?
∑∑ ==
z
xz
z
c zxpzpzxpD ),|()|(log)|,(log);( θθθθl

Complete & Incomplete Log
Likelihoods
 Complete log likelihood
Let X denote the observable variable(s), and Z denote the latent variable(s).
If Z could be observed, then
 Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully
observed models).
 Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of
factors, the parameter for each factor can be estimated separately.
 But given that Z is not observed, lc() is a random quantity, cannot be
maximized directly.
 Incomplete log likelihood
With z unobserved, our objective becomes the log of a marginal probability:
 This objective won't decouple
)|,(log),;(
def
θθ zxpzxc =l
∑==
z
c zxpxpx )|,(log)|(log);( θθθl

Expected Complete Log
Likelihood
∑=
z
qc zxpxzqzx )|,(log),|(),;(
def
θθθl
∑
∑
∑
≥
=
=
=
z
z
z
xzq
zxp
xzq
xzq
zxp
xzq
zxp
xpx
)|(
)|,(
log)|(
)|(
)|,(
)|(log
)|,(log
)|(log);(
θ
θ
θ
θθl
qqc Hzxx +≥⇒ ),;();( θθ ll
 For any distribution q(z), define expected complete log likelihood:
 A deterministic function of θ
 Linear in lc() --- inherit its factorizabiility
 Does maximizing this surrogate yield a maximizer of the likelihood?
 Jensen’s inequality

Lower Bounds and Free Energy
 For fixed data x, define a functional called the free energy:
 The EM algorithm is coordinate-ascent on F :
 E-step:
 M-step:
);(
)|(
)|,(
log)|(),(
def
x
xzq
zxp
xzqqF
z
θ
θ
θ l≤= ∑
),(maxarg t
q
t
qFq θ=+1
),(maxarg ttt
qF θθ
θ
11 ++
=

E-step: maximization of expected
lc w.r.t. q
 Claim:
 This is the posterior distribution over the latent variables given the data and the
parameters. Often we need this at test time anyway (e.g. to perform
classification).
 Proof (easy): this setting attains the bound l(θ;x)≥F(q,θ )
 Can also show this result using variational calculus or the fact
that
),|(),(maxarg tt
q
t
xzpqFq θθ ==+1
);()|(log
)|(log),(
),(
)|,(
log),()),,((
xxp
xpxzp
xzp
zxp
xzpxzpF
tt
z
tt
z
t
t
ttt
θθ
θθ
θ
θ
θθθ
l==
=
=
∑
∑
( )),|(||KL),();( θθθ xzpqqFx =−l

E-step ≡ plug in posterior
expectation of latent variables
 Without loss of generality: assume that p(x,z|θ) is a
generalized exponential family distribution:
 Special cases: if p(X|Z) are GLIMs, then
 The expected complete log likelihood under
is
)(),(
)()|,(log),|(),;(
),|(
θθ
θθθθ
θ
Azxf
Azxpxzqzx
i
xzqi
t
i
z
tt
q
t
c
t
t
−=
−=
∑
∑+1
l






= ∑i
ii zxfzxh
Z
zxp ),(exp),(
)(
),( θ
θ
θ
1
)()(),( xzzxf i
T
ii ξη=
),|( tt
xzpq θ=+1
)()()( ),|(
GLIM~
θξηθ θ
Axz
i
ixzqi
t
i
p
t −= ∑

M-step: maximization of expected
lc w.r.t. θ
 Note that the free energy breaks into two terms:
 The first term is the expected complete log likelihood (energy) and the second
term, which does not depend on θ, is the entropy.
 Thus, in the M-step, maximizing with respect to θ for fixed q
we only need to consider the first term:
 Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed
model p(x,z|θ), with the sufficient statistics involving z replaced by their
expectations w.r.t. p(z|x,θ).
qqc
zz
z
Hzx
xzqxzqzxpxzq
xzq
zxp
xzqqF
+=
−=
=
∑∑
∑
),;(
)|(log)|()|,(log)|(
)|(
)|,(
log)|(),(
θ
θ
θ
θ
l
∑== +
+
z
qc
t
zxpxzqzx t )|,(log)|(maxarg),;(maxarg θθθ
θθ
1
1
l

Summary: EM Algorithm
 A way of maximizing likelihood function for latent variable
models. Finds MLE of parameters when the original (hard)
problem can be broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current
parameters.
2. Using this “complete” data, find the maximum likelihood parameter estimates.
 Alternate between filling in the latent variables using the best
guess (posterior) and updating the parameters based on this
guess:
 E-step:
 M-step:
 In the M-step we optimize a lower bound on the likelihood. In
the E-step we close the gap, making bound=likelihood.
),(maxarg t
q
t
qFq θ=+1
),(maxarg ttt
qF θθ
θ
11 ++
=

From static to dynamic mixture
models
Dynamic mixture
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Static mixture
AX1
Y1
N
The sequence:
The underlying
source:
Phonemes,
Speech signal,
sequence of rolls,
dice,

Chromosomes of tumor cell:
Predicting Tumor Cell States

Copy number profile for chromosome
1 from 600 MPE cell line
Copy number profile for chromosome
8 from COLO320 cell line
60-70 fold amplification of CMYC region
Copy number profile for chromosome 8
in MDA-MB-231 cell line
deletion
DNA Copy number aberration
types in breast cancer

A real CGH run

Hidden Markov Model
 Observation space
Alphabetic set:
Euclidean space:
 Index set of hidden states
 Transition probabilities between any two states
or
 Start probabilities
 Emission probabilities associated with each state
or in general:
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
{ }Kccc ,,, 21=C
d
R
{ }M,,, 21=I
,)|( ,ji
i
t
j
t ayyp === − 11 1
( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miii
i
tt 211 1
( ).,,,lMultinomia~)( Myp πππ 211
( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiii
i
tt 211
( ) .,|f~)|( I∈∀⋅= iyxp i
i
tt θ1
Graphical model
K
1
…
2
State automata

The Dishonest Casino
A casino has two dice:
 Fair die
P(1) = P(2) = P(3) = P(5) = P(6) = 1/6
 Loaded die
P(1) = P(2) = P(3) = P(5) = 1/10
P(6) = 1/2
Casino player switches back-&-forth
between fair and loaded die once every
20 turns
Game:
1. You bet $1
2. You roll (always with a fair die)
3. Casino player rolls (maybe with fair die,
maybe with loaded die)
4. Highest number wins $2

FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
The Dishonest Casino Model

Puzzles Regarding the Dishonest
Casino
GIVEN: A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
 How likely is this sequence, given our model of how the casino
works?
 This is the EVALUATION problem in HMMs
 What portion of the sequence was generated with the fair die, and
what portion with the loaded die?
 This is the DECODING question in HMMs
 How “loaded” is the loaded die? How “fair” is the fair die? How often
does the casino player change from fair to loaded, and back?
 This is the LEARNING question in HMMs

Joint Probability
1245526462146146136136661664661636616366163616515615115146123562344

Example: the Dishonest Casino
 Let the sequence of rolls be:
 x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
 Then, what is the likelihood of
 y = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?
(say initial probs a0Fair = ½, aoLoaded = ½)
½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) =
½ × (1/6)10 × (0.95)9 = .00000000521158647211 = 5.21 × 10-9

 So, the likelihood the die is fair in all this run
is just 5.21 × 10-9
 OK, but what is the likelihood of
 π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded,
Loaded, Loaded, Loaded?
½ × P(1 | Loaded) P(Loaded | Loaded) … P(4 | Loaded) =
½ × (1/10)8 × (1/2)2 (0.95)9 = .00000000078781176215 = 0.79 × 10-9
 Therefore, it is after all 6.59 times more likely that the die is fair
all the way, than that it is loaded all the way

 Let the sequence of rolls be:
 x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6
 Now, what is the likelihood π = F, F, …, F?
 ½ × (1/6)10 × (0.95)9 = 0.5 × 10-9, same as before
 What is the likelihood y = L, L, …, L?
½ × (1/10)4 × (1/2)6 (0.95)9 = .00000049238235134735 = 5 × 10-7
 So, it is 100 times more likely the die is loaded

Three Main Questions on HMMs
1. Evaluation
GIVEN an HMM M, and a sequence x,
FIND Prob (x | M)
ALGO. Forward
2. Decoding
GIVEN an HMM M, and a sequence x ,
FIND the sequence y of states that maximizes, e.g., P(y | x , M),
or the most probable subsequence of states
ALGO. Viterbi, Forward-backward
3. Learning
GIVEN an HMM M, with unspecified transition/emission probs.,
and a sequence x,
FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ)
ALGO. Baum-Welch (EM)

Applications of HMMs
 Some early applications of HMMs
 finance, but we never saw them
 speech recognition
 modelling ion channels
 In the mid-late 1980s HMMs entered genetics and molecular
biology, and they are now firmly entrenched.
 Some current applications of HMMs to biology
 mapping chromosomes
 aligning biological sequences
 predicting sequence structure
 inferring evolutionary relationships
 finding genes in DNA sequence

The Forward Algorithm
 We want to calculate P(x), the likelihood of x, given the HMM
 Sum over all possible ways of generating x:
 To avoid summing over an exponential number of paths y, define
(the forward probability)
 The recursion:
),,...,()(
def
11 1 ==== k
tt
k
t
k
t yxxPy αα
∑ −==
i
ki
i
t
k
tt
k
t ayxp ,)|( 11 αα
∑=
k
k
TP α)(x
∑ ∑ ∑ ∑ ∏ ∏= =
−
== y
yxx
1 2 11
2 1
y y y
T
t
T
t
ttyyy
N tt
yxpapp )|(),()( ,π

The Forward Algorithm –
derivation
 Compute the forward probability:
),,,...,( 111 == −
k
ttt
k
t yxxxPα
),,...,,|(),...,,|(),,...,( 111111111 11
1
−−−−−− === ∑ −
tt
k
tttt
k
ty tt yxxyxPxxyyPyxxP
t
)|()|(),,...,( 11 1111
1
=== −−−∑ −
k
ttt
k
ty tt yxPyyPyxxP
t
)|(),,...,()|( 1111 1111 ===== −−−∑ i
t
k
ti
i
tt
k
tt yyPyxxPyxP
kii
i
t
k
tt ayxP ,)|( ∑ −== 11 α
AA xtx1
yty1 ...
Axt-1
yt-1
...
...
...
),|()|()(),,(:ruleChain BACPABPAPCBAP =

The Forward Algorithm
 We can compute for all k, t, using dynamic programming!
Initialization:
Iteration:
Termination:
k
tα
k
kk
yxP πα )|( 1111 ==
k
k
kk
kk
yxP
yPyxP
yxP
π
α
)|(
)()|(
),(
1
11
1
11
111
111
==
===
==
kii
i
t
k
tt
k
t ayxP ,)|( ∑ −== 11 αα
∑=
k
k
TP α)(x

The Backward Algorithm
 We want to compute ,
the posterior probability distribution on the
t th position, given x
 We start by computing
 The recursion:
)|( x1=k
tyP
Forward, αt
k
Backward,
),...,,,,...,(),( Tt
k
tt
k
t xxyxxPyP 11 11 +=== x
)|...()...(
),,...,|,...,(),,...,(
, 11
11
11
111
===
===
+
+
k
tTt
k
tt
k
ttTt
k
tt
yxxPyxxP
yxxxxPyxxP
)|,...,( 11 == +
k
tTt
k
t yxxPβ
∑ +++ ==
i
i
t
i
ttik
k
t yxpa 111, )1|( ββ
A Axt+1 xT
yt+1 yT...
Axt
yt
...
...
...

Example:
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kii
i
t
k
tt
k
t ayxP ,)|( ∑ −== 11 αα
i
t
i
tti ik
k
t yxPa 111 1 +++ == ∑ ββ )|(,

Alpha (actual)
0.0833 0.0500
0.0136 0.0052
0.0022 0.0006
0.0004 0.0001
0.0001 0.0000
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
Beta (actual)
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
0.0000 0.0000
0.0001 0.0001
0.0007 0.0006
0.0045 0.0055
0.0264 0.0112
0.1633 0.1033
1.0000 1.0000
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kii
i
t
k
tt
k
t ayxP ,)|( ∑ −== 11 αα
i
t
i
tti ik
k
t yxPa 11 1 +++ == ∑ ββ )|(,

Alpha (logs)
-2.4849 -2.9957
-4.2969 -5.2655
-6.1201 -7.4896
-7.9499 -9.6553
-9.7834 -10.1454
-11.5905 -12.4264
-13.4110 -14.6657
-15.2391 -15.2407
-17.0310 -17.5432
-18.8430 -19.8129
Beta (logs)
-16.2439 -17.2014
-14.4185 -14.9922
-12.6028 -12.7337
-10.8042 -10.4389
-9.0373 -9.7289
-7.2181 -7.4833
-5.4135 -5.1977
-3.6352 -4.4938
-1.8120 -2.2698
0 0
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kii
i
t
k
tt
k
t ayxP ,)|( ∑ −== 11 αα
i
t
i
tti ik
k
t yxPa 11 1 +++ == ∑ ββ )|(,

What is the probability of a
hidden state prediction?

Posterior decoding
 We can now calculate
 Then, we can ask
 What is the most likely state at position t of sequence x:
 Note that this is an MPA of a single hidden state,
what if we want to a MPA of a whole hidden state sequence?
 Posterior Decoding:
 This is different from MPA of a whole sequence of hidden
states
 This can be understood as bit error rate
vs. word error rate
)()(
),(
)|(
xx
x
x
PP
yP
yP
k
t
k
t
k
tk
t
βα
=
=
==
1
1
)|(maxarg
*
x1== k
tkt yPk
{ }:
*
Tty tk
t 11 ==
Example:
MPA of X ?
MPA of (X, Y) ?
x y P(x,y)
0 0 0.35
0 1 0.05
1 0 0.3
1 1 0.3

Viterbi decoding
 GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that
P(y|x) is maximized:
y* = argmaxy P(y|x) = argmaxπ P(y,x)
 Let
= Probability of most likely sequence of states ending at state yt = k
 The recursion:
 Underflows are a significant problem
 These numbers become extremely small – underflow
 Solution: Take the logs of all values:
),,...,,,...,(max ,--},...{ -
1111111
== k
ttttyy
k
t yxyyxxPV t
i
tkii
k
tt
k
t VayxpV 11 −== ,max)|(
x1 x2 x3 ……………………...……..xN
State 1
2
K
x1 x2 x3 ……………………...……..xN
State 1
2
K
x1 x2 x3 ……………………...……..xN
State 1
2
K
x1 x2 x3 ……………………...……..xN
State 1
2
K
Vi(t)
k
tV
tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,(  11121111 −
= π
( )( )i
tkii
k
tt
k
t VayxpV 11 −++== ,logmax)|(log

Computational Complexity and
implementation details
 What is the running time, and space required, for Forward,
and Backward?
Time: O(K2N); Space: O(KN).
 Useful implementation technique to avoid underflows
 Viterbi: sum of logs
 Forward/Backward: rescaling at each position by multiplying by a constant
∑ −==
i
ki
i
t
k
tt
k
t ayxp ,1)1|( αα
i
t
i
tt
i
ik
k
t yxpa 111, )1|( +++ == ∑ ββ
i
tkii
k
tt
k
t VayxpV 1,max)1|( −==

(Homework!)
Learning HMM
 Given x = x1…xN for which the true state path y = y1…yN is
known,
 Define:
Aij = # times state transition i→j occurs in y
Bik = # times state i in y emits k in x
 We can show that the maximum likelihood parameters θ are:
 What if y is continuous? We can treat as N×T
observations of, e.g., a Gaussian, and apply learning rules for Gaussian …
∑∑ ∑
∑ ∑ ==
•→
→
=
= −
= −
' ',
,,
)(#
)(#
j ij
ij
n
T
t
i
tn
j
tnn
T
t
i
tnML
ij
A
A
y
yy
i
ji
a
2 1
2 1
∑∑ ∑
∑ ∑ ==
•→
→
=
=
=
' ',
,,
)(#
)(#
k ik
ik
n
T
t
i
tn
k
tnn
T
t
i
tnML
ik
B
B
y
xy
i
ki
b
1
1
( ){ }NnTtyx tntn :,::, ,, 11 ==
(Homework!)
∏ ∏∏ 





==
==
−
n
T
t
tntn
T
t
tntnn xxpyypypp
1
,,
2
1,,1, )|()|()(log),(log),;( yxyxθl

Unsupervised ML estimation
 Given x = x1…xN for which the true state path y = y1…yN is
unknown,
 EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters θ:
1. Estimate Aij , Bik in the training data
 How? , , How? (homework)
2. Update θ according to Aij , Bik
 Now a "supervised learning" problem
3. Repeat 1 & 2, until convergence
This is called the Baum-Welch Algorithm
We can get to a provably more (or equally) likely parameter set θ each iteration
k
tntn
i
tnik xyB ,, ,∑=∑ −= tn
j
tn
i
tnij yyA , ,, 1

The Baum Welch algorithm
 The complete log likelihood
 The expected complete log likelihood
 EM
 The E step
 The M step ("symbolically" identical to MLE)
∏ ∏∏ 





==
==
−
n
T
t
tntn
T
t
tntnnc xxpyypypp
12
11 )|()|()(log),(log),;( ,,,,,yxyxθl
∑∑∑∑∑ ==
− 




+




+




=
−
n
T
t
kiyp
i
tn
k
tn
n
T
t
ji
yyp
j
tn
i
tn
n
iyp
i
nc byxayyy
ntnntntnnn
12
11
11
,)|(,,,
)|,(
,,)|(, logloglog),;(
,,,, xxx
yxθ πl
)|( ,,, n
i
tn
i
tn
i
tn ypy x1===γ
)|,( ,,,,
,
, n
j
tn
i
tn
j
tn
i
tn
ji
tn yypyy x1111 ==== −−ξ
∑ ∑
∑ ∑
−
=
=
=
n
T
t
i
tn
n
T
t
ji
tnML
ija 1
1
2
,
,
,
γ
ξ
∑ ∑
∑ ∑
−
=
=
=
n
T
t
i
tn
k
tnn
T
t
i
tnML
ik
x
b 1
1
1
,
,,
γ
γ
N
n
i
nML
i
∑=
1,γ
π

Summary
 Modeling hidden transitional trajectories (in discrete state
space, such as cluster label, DNA copy number, dice-choice,
etc.) underlying observed sequence data (discrete, such as
dice outcomes; or continuous, such as CGH signals)
 Useful for parsing, segmenting sequential data
 Important HMM computations:
 The joint likelihood of a parse and data can be written as a product to local terms
(i.e., initial prob, transition prob, emission prob.)
 Computing marginal likelihood of the observed sequence: forward algorithm
 Predicting a single hidden state: forward-backward
 Predicting an entire sequence of hidden states: viterbi
 Learning HMM parameters: an EM algorithm known as Baum-Welch

Lecture9 xing

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Lecture9 xing (20)

More from Tianlu Wang (20)

Recently uploaded (20)

Lecture9 xing