ch14MarkovChainkfkkklmkllmkkaskldask.ppt

.
Markov Chains
© Ydo Wexler & Dan Geiger
Modified by Longin Jan Latecki
Temple University, Philadelphia
latecki@temple.edu

.
Statistical Parameter Estimation
Reminder
• The basic paradigm:
• MLE / bayesian approach
• Input data: series of observations X1, X2 … Xt
-We assumed observations were i.i.d (independent identical distributed)
Data set
Model
Parameters: Θ
Heads - P(H) Tails - 1-P(H)

Markov Process
• Markov Property: The state of the system at time t+1 depends only
on the state of the system at time t
X1
X2 X3 X4 X5
   
x
| X
x
X
x
x
X
| X
x
X t
t
t
t
t
t
t
t 



 


 1
1
1
1
1
1 Pr
Pr 

• Stationary Assumption: Transition probabilities are independent of
time (t)
 
1
Pr t t ab
X b| X a p
   
Bounded memory transition model

Weather:
• raining today 40% rain tomorrow
60% no rain tomorrow
• not raining today 20% rain tomorrow
Markov Process
Simple Example
rain no rain
0.6
0.4 0.8
0.2
Stochastic FSM:

Weather:
• raining today 40% rain tomorrow
• not raining today 20% rain tomorrow
Markov Process
Simple Example









8
.
0
2
.
0
6
.
0
4
.
0
P
• Stochastic matrix:
Rows sum up to 1
• Double stochastic matrix:
Rows and columns sum up to 1
The transition matrix:

– Gambler starts with $10
- At each play we have one of the following:
• Gambler wins $1 with probability p
• Gambler looses $1 with probability 1-p
– Game ends when gambler goes broke, or gains a fortune of $100
(Both 0 and 100 are absorbing states)
0 1 2 99 100
p p p p
1-p 1-p 1-p 1-p
Start
(10$)
Markov Process
Gambler’s Example

• Markov process - described by a stochastic FSM
• Markov chain - a random walk on this graph
(distribution over paths)
• Edge-weights give us
• We can ask more complex questions, like
Markov Process
 
1
Pr t t ab
X b| X a p
   
  ?
Pr 2 


 b
a | X
X t
t
0 1 2 99 100
p p p p
1-p 1-p 1-p 1-p
Start
(10$)

• Given that a person’s last cola purchase was Coke,
there is a 90% chance that his next cola purchase will
also be Coke.
• If a person’s last cola purchase was Pepsi, there is
an 80% chance that his next cola purchase will also be
Pepsi.
coke pepsi
0.1
0.9 0.8
0.2
Markov Process
Coke vs. Pepsi Example







8
.
0
2
.
0
1
.
0
9
.
0
P
transition matrix:

Given that a person is currently a Pepsi purchaser,
what is the probability that he will purchase Coke two
purchases from now?
Pr[ Pepsi?Coke ] =
Pr[ PepsiCokeCoke ] + Pr[ Pepsi Pepsi Coke ] =
0.2 * 0.9 + 0.8 * 0.2 = 0.34




















66
.
0
34
.
0
17
.
0
83
.
0
8
.
0
2
.
0
1
.
0
9
.
0
8
.
0
2
.
0
1
.
0
9
.
0
2
P
Markov Process
Coke vs. Pepsi Example (cont)
Pepsi  ? ?  Coke







8
.
0
2
.
0
1
.
0
9
.
0
P

Given that a person is currently a Coke purchaser,
what is the probability that he will purchase Pepsi
three purchases from now?
Markov Process




















562
.
0
438
.
0
219
.
0
781
.
0
66
.
0
34
.
0
17
.
0
83
.
0
8
.
0
2
.
0
1
.
0
9
.
0
3
P

•Assume each person makes one cola purchase per week
•Suppose 60% of all people now drink Coke, and 40% drink Pepsi
•What fraction of people will be drinking Coke three weeks from now?
Markov Process







8
.
0
2
.
0
1
.
0
9
.
0
P 






562
.
0
438
.
0
219
.
0
781
.
0
3
P
Pr[X3=Coke] = 0.6 * 0.781 + 0.4 * 0.438 = 0.6438
Qi - the distribution in week i
Q0=(0.6,0.4) - initial distribution
Q3= Q0 * P3
=(0.6438,0.3562)

Simulation:
Markov Process
week - i
Pr[X
i
=
Coke]
2/3
   
3
1
3
2
3
1
3
2
8
.
0
2
.
0
1
.
0
9
.
0







stationary distribution
coke pepsi
0.1
0.9 0.8
0.2

An Introduction to
Markov Chain Monte Carlo
Teg Grenager
July 1, 2004
Modified by Longin Jan Latecki
Temple University, Philadelphia
latecki@temple.edu

Agenda
 Motivation
 The Monte Carlo Principle
 Markov Chain Monte Carlo
 Metropolis Hastings
 Gibbs Sampling
 Advanced Topics

Monte Carlo principle
 Consider the game of solitaire:
what’s the chance of winning with a
properly shuffled deck?
 Hard to compute analytically
because winning or losing depends
on a complex procedure of
reorganizing cards
 Insight: why not just play a few
hands, and see empirically how
many do in fact win?
 More generally, can approximate a
probability density function using
only samples from that density
?
Lose
Lose
Win
Lose
Chance of winning is 1 in 4!

 Given a very large set X and a distribution p(x) over it
 We draw i.i.d. a set of N samples
 We can then approximate the distribution using these
samples




N
i
i
N x
x
N
x
1
)
(
)
1(
1
)
(
p
X
p(x)
)
p(x
N 



 We can also use these samples to compute expectations
 And even use them to find a maximum



N
i
i
N x
f
N
f
E
1
)
(
)
(
1
)
(
)]
[p(
max
arg
ˆ )
(
)
(
i
x
x
x
i






x
N
x
x
f
f
E )
p(
)
(
)
(

Example: Bayes net inference
 Suppose we have a Bayesian
network with variables X
 Our state space is the set of all
possible assignments of values to
variables
 Computing the joint distribution is
in the worst case NP-hard
 However, note that you can draw a
sample in time that is linear in the
size of the network
 Draw N samples, use them to
approximate the joint
Sample 1: FTFTTTFFT
T T
T
F
F
F T
Sample 2: FTFFTTTFF
F
T
F T
T
F
T
F F
F
T
etc.

Rejection sampling
 Suppose we have a Bayesian
network with variables X
 We wish to condition on some
evidence ZX and compute the
posterior over Y=X-Z
 Draw samples, rejecting them
when they contradict the evidence
in Z
 Very inefficient if the evidence is
itself improbable, because we
must reject a large number of
samples
F=
=T
Sample 1: FTFTTTFFT reject
T T
T
F
F
F T
Sample 2: FTFFTTTFF accept
F
T
F T
T
F
T
F F
F
T
etc.

Rejection sampling
 More generally, we would like to sample from p(x), but
it’s easier to sample from a proposal distribution q(x)
 q(x) satisfies p(x) ≤ M q(x) for some M<∞
 Procedure:
 Sample x(i)
from q(x)
 Accept with probability p(x(i)
) / Mq(x(i)
)
 Reject otherwise
 The accepted x(i)
are sampled from p(x)!
 Problem: if M is too large, we will rarely accept samples
 In the Bayes network, if the evidence Z is very unlikely then we
will reject almost all samples

Markov chain Monte Carlo
 Recall again the set X and the distribution p(x) we
wish to sample from
 Suppose that it is hard to sample p(x) but that it is
possible to “walk around” in X using only local state
transitions
 Insight: we can use a “random walk” to help us
draw random samples from p(x)
X
p(x)

Markov chains
 Markov chain on a space X with transitions T is a random
process (infinite sequence of random variables)
(x(0)
, x(1)
,…x(t)
,…) in X∞
that satisfy
 That is, the probability of being in a particular state at time t
given the state history depends only on the state at time t-1
 If the transition probabilities are fixed for all t, the chain is
considered homogeneous
)
,
T(
)
,...,
|
p( )
(
)
1
(
)
1
(
)
1
(
)
( t
t
t
t
x
x
x
x
x 


T=
0.7 0.3 0
0.3 0.4 0.3
0 0.3 0.7
x2
x1 x3
0.4
0.3
0.3
0.3
0.7 0.7
0.3

Markov Chains for sampling
 In order for a Markov chain to useful for sampling p(x), we
require that for any starting state x(1)
 Equivalently, the stationary distribution of the Markov
chain must be p(x)
 If this is the case, we can start in an arbitrary state, use the
Markov chain to do a random walk for a while, and stop and
output the current state x(t)
 The resulting state will be sampled from p(x)!
)
p(
)
(
p )
(
)
1
( x
x
t
t
x 


)
p(
)
](
[p x
x 
T

Stationary distribution
 Consider the Markov chain given above:
 The stationary distribution is
 Some samples:
T=
0.7 0.3 0
0.3 0.4 0.3
0 0.3 0.7
x2
x1 x3
0.4
0.3
0.3
0.3
0.7 0.7
0.3
0.33 0.33 0.33 x =
0.7 0.3 0
0.3 0.4 0.3
0 0.3 0.7
0.33 0.33 0.33
1,1,2,3,2,1,2,3,3,2
1,2,2,1,1,2,3,3,3,3
1,1,1,2,3,2,2,1,1,1
1,2,3,3,3,2,1,2,2,3
1,1,2,2,2,3,3,2,1,1
1,2,2,2,3,3,3,2,2,2
Empirical Distribution:
0.33 0.33 0.33

Ergodicity
 Claim: To ensure that the chain converges to a unique stationary
distribution the following conditions are sufficient:
 Irreducibility: every state is eventually reachable from any start
state; for all x, y in X there exists a t such that
 Aperiodicity: the chain doesn’t get caught in cycles; for all x, y in X it
is the case that
 The process is ergodic if it is both irreducible and aperiodic
 This claim is easy to prove, but involves eigenstuff!
1
}
0
)
(
p
:
gcd{ )
(


y
t t
x
0
)
(
p )
(

y
t
x

Markov Chains for sampling
 Claim: To ensure that the stationary distribution of
the Markov chain is p(x) it is sufficient for p and T to
satisfy the detailed balance (reversibility) condition:
 Proof: for all y we have
 And thus p must be a stationary distribution of T
)
,
(
)
p(
)
,
(
)
p( x
y
T
y
y
x
T
x 
)
p(
)
,
(
)
p(
)
,
(
)
p(
)
,
(
)
p(
)
](
[p y
x
y
T
y
x
y
T
y
y
x
T
x
y
x
x
x



 


T

Metropolis algorithm
 How to pick a suitable Markov chain for our distribution?
 Suppose our distribution p(x) is easy to sample, and easy to compute up
to a normalization constant, but hard to compute exactly
 e.g. a Bayesian posterior P(M|D)P(D|M)P(M)
 We define a Markov chain with the following process:
 Sample a candidate point x* from a proposal distribution q(x*|x(t)
) which is
symmetric: q(x|y)=q(y|x)
 Compute the importance ratio (this is easy since the normalization
constants cancel)
 With probability min(r,1) transition to x*, otherwise stay in the same state
)
p(
*)
p(
)
(t
x
x
r 

Metropolis intuition
 Why does the Metropolis algorithm work?
 Proposal distribution can propose anything it
likes (as long as it can jump back with the
same probability)
 Proposal is always accepted if it’s jumping
to a more likely state
 Proposal accepted with the importance ratio
if it’s jumping to a less likely state
 The acceptance policy, combined with the
reversibility of the proposal distribution,
makes sure that the algorithm explores
states in proportion to p(x)!
xt
r=1.0
x*
r=p(x*)/p(xt
)
x*

Metropolis convergence
 Claim: The Metropolis algorithm converges to the
target distribution p(x).
 Proof: It satisfies detailed balance
For all x,y in X, wlog assuming p(x)<p(y), then
Hence:
)
,
(
)
p(
)
p(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
,
(
)
p(
x
y
T
y
y
x
y
x
q
y
y
x
q
x
x
y
q
x
y
x
T
x




q is symmetric
)
p(
)
p(
)
|
(
)
,
(
)
|
(
)
,
(
y
x
y
x
q
x
y
T
x
y
q
y
x
T

 candidate is always accepted, since the r= 1
Since, w generate x with prob q(x|y) and
accept
with prob r = the ratio < 1.

Metropolis-Hastings
 The symmetry requirement of the Metropolis proposal distribution
can be hard to satisfy
 Metropolis-Hastings is the natural generalization of the Metropolis
algorithm, and the most popular MCMC algorithm
 We define a Markov chain with the following process:
 Sample a candidate point x* from a proposal distribution q(x*|x(t)
) which is
not necessarily symmetric
 Compute the importance ratio:
 With probability min(r,1) transition to x*, otherwise stay in the same state x(t)
)
|
q(
)
p(
)
|
q(
)
p(
)
(
*
)
(
*
)
(
*
t
t
t
x
x
x
x
x
x
r 

MH convergence
 Claim: The Metropolis-Hastings algorithm converges
to the target distribution p(x).
 Proof: It satisfies detailed balance
For all x,y in X, wlog assume p(x)q(y|x)<p(y)q(x|y), then
Hence:
)
,
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
,
(
)
p(
x
y
T
y
y
x
q
y
x
y
q
x
y
x
q
y
y
x
q
y
y
x
q
y
x
y
q
x
x
y
q
x
y
x
T
x




candidate is always accepted, since r = 1
Since, w generate x with prob q(x|y) and
accept
with prob r = the ratio < 1.
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
,
(
)
|
(
)
,
(
y
x
q
y
x
y
q
x
y
x
q
x
y
T
x
y
q
y
x
T



Gibbs sampling
 A special case of Metropolis-Hastings which is applicable to state
spaces in which we have a factored state space, and access to
the full conditionals:
 Perfect for Bayesian networks!
 Idea: To transition from one state (variable assignment) to
another,
 Pick a variable,
 Sample its value from the conditional distribution
 That’s it!
 We’ll show in a minute why this is an instance of MH and thus
must be sampling from the full joint
)
,...,
,
,...,
|
p( 1
1
1 n
j
j
j x
x
x
x
x 


Markov blanket
 Recall that Bayesian networks
encode a factored
representation of the joint
distribution
 Variables are independent of
their non-descendents given
their parents
 Variables are independent of
everything else in the network
given their Markov blanket!
 So, to sample each node, we
only need to condition its
Markov blanket
))
MB(
|
p( j
j x
x

Gibbs sampling
 More formally, the proposal distribution is
 The importance ratio is
 So we always accept!

)
|
( )
(
* t
x
x
q )
|
( )
(
* t
j
j x
x
p  if x*-j=x(t)
-j
0 otherwise
1
)
p(
)
p(
)
p(
)
,
p(
)
p(
)
p(
)
,
p(
)
p(
)
|
p(
)
p(
)
|
p(
)
p(
)
|
q(
)
p(
)
|
q(
)
p(
)
(
*
)
(
*
*
)
(
*
)
(
)
(
*
*
*
)
(
)
(
)
(
*
)
(
*
)
(
*
)
(
*













t
j
j
t
j
j
j
t
j
t
j
t
j
j
j
t
t
j
t
j
t
t
t
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
r
Dfn of proposal
distribution
Dfn of conditional
probability
B/c we didn’t
change other vars

Practical issues
 How many iterations?
 How to know when to stop?
 What’s a good proposal function?

Advanced Topics
 Simulated annealing, for global optimization, is a
form of MCMC
 Mixtures of MCMC transition functions
 Monte Carlo EM (stochastic E-step)
 Reversible jump MCMC for model selection
 Adaptive proposal distributions

ch14MarkovChainkfkkklmkllmkkaskldask.ppt

More Related Content

Similar to ch14MarkovChainkfkkklmkllmkkaskldask.ppt (20)

More from FarooqKhurshid1 (19)

Recently uploaded (20)

ch14MarkovChainkfkkklmkllmkkaskldask.ppt