SlideShare a Scribd company logo
.
Markov Chains
© Ydo Wexler & Dan Geiger
Modified by Longin Jan Latecki
Temple University, Philadelphia
latecki@temple.edu
.
Statistical Parameter Estimation
Reminder
• The basic paradigm:
• MLE / bayesian approach
• Input data: series of observations X1, X2 … Xt
-We assumed observations were i.i.d (independent identical distributed)
Data set
Model
Parameters: Θ
Heads - P(H) Tails - 1-P(H)
Markov Process
• Markov Property: The state of the system at time t+1 depends only
on the state of the system at time t
X1
X2 X3 X4 X5
   
x
| X
x
X
x
x
X
| X
x
X t
t
t
t
t
t
t
t 



 


 1
1
1
1
1
1 Pr
Pr 

• Stationary Assumption: Transition probabilities are independent of
time (t)
 
1
Pr t t ab
X b| X a p
   
Bounded memory transition model
Weather:
• raining today 40% rain tomorrow
60% no rain tomorrow
• not raining today 20% rain tomorrow
80% no rain tomorrow
Markov Process
Simple Example
rain no rain
0.6
0.4 0.8
0.2
Stochastic FSM:
Weather:
• raining today 40% rain tomorrow
60% no rain tomorrow
• not raining today 20% rain tomorrow
80% no rain tomorrow
Markov Process
Simple Example









8
.
0
2
.
0
6
.
0
4
.
0
P
• Stochastic matrix:
Rows sum up to 1
• Double stochastic matrix:
Rows and columns sum up to 1
The transition matrix:
– Gambler starts with $10
- At each play we have one of the following:
• Gambler wins $1 with probability p
• Gambler looses $1 with probability 1-p
– Game ends when gambler goes broke, or gains a fortune of $100
(Both 0 and 100 are absorbing states)
0 1 2 99 100
p p p p
1-p 1-p 1-p 1-p
Start
(10$)
Markov Process
Gambler’s Example
• Markov process - described by a stochastic FSM
• Markov chain - a random walk on this graph
(distribution over paths)
• Edge-weights give us
• We can ask more complex questions, like
Markov Process
 
1
Pr t t ab
X b| X a p
   
  ?
Pr 2 


 b
a | X
X t
t
0 1 2 99 100
p p p p
1-p 1-p 1-p 1-p
Start
(10$)
• Given that a person’s last cola purchase was Coke,
there is a 90% chance that his next cola purchase will
also be Coke.
• If a person’s last cola purchase was Pepsi, there is
an 80% chance that his next cola purchase will also be
Pepsi.
coke pepsi
0.1
0.9 0.8
0.2
Markov Process
Coke vs. Pepsi Example







8
.
0
2
.
0
1
.
0
9
.
0
P
transition matrix:
Given that a person is currently a Pepsi purchaser,
what is the probability that he will purchase Coke two
purchases from now?
Pr[ Pepsi?Coke ] =
Pr[ PepsiCokeCoke ] + Pr[ Pepsi Pepsi Coke ] =
0.2 * 0.9 + 0.8 * 0.2 = 0.34




















66
.
0
34
.
0
17
.
0
83
.
0
8
.
0
2
.
0
1
.
0
9
.
0
8
.
0
2
.
0
1
.
0
9
.
0
2
P
Markov Process
Coke vs. Pepsi Example (cont)
Pepsi  ? ?  Coke







8
.
0
2
.
0
1
.
0
9
.
0
P
Given that a person is currently a Coke purchaser,
what is the probability that he will purchase Pepsi
three purchases from now?
Markov Process
Coke vs. Pepsi Example (cont)




















562
.
0
438
.
0
219
.
0
781
.
0
66
.
0
34
.
0
17
.
0
83
.
0
8
.
0
2
.
0
1
.
0
9
.
0
3
P
•Assume each person makes one cola purchase per week
•Suppose 60% of all people now drink Coke, and 40% drink Pepsi
•What fraction of people will be drinking Coke three weeks from now?
Markov Process
Coke vs. Pepsi Example (cont)







8
.
0
2
.
0
1
.
0
9
.
0
P 






562
.
0
438
.
0
219
.
0
781
.
0
3
P
Pr[X3=Coke] = 0.6 * 0.781 + 0.4 * 0.438 = 0.6438
Qi - the distribution in week i
Q0=(0.6,0.4) - initial distribution
Q3= Q0 * P3
=(0.6438,0.3562)
Simulation:
Markov Process
Coke vs. Pepsi Example (cont)
week - i
Pr[X
i
=
Coke]
2/3
   
3
1
3
2
3
1
3
2
8
.
0
2
.
0
1
.
0
9
.
0







stationary distribution
coke pepsi
0.1
0.9 0.8
0.2
An Introduction to
Markov Chain Monte Carlo
Teg Grenager
July 1, 2004
Modified by Longin Jan Latecki
Temple University, Philadelphia
latecki@temple.edu
Agenda
 Motivation
 The Monte Carlo Principle
 Markov Chain Monte Carlo
 Metropolis Hastings
 Gibbs Sampling
 Advanced Topics
Monte Carlo principle
 Consider the game of solitaire:
what’s the chance of winning with a
properly shuffled deck?
 Hard to compute analytically
because winning or losing depends
on a complex procedure of
reorganizing cards
 Insight: why not just play a few
hands, and see empirically how
many do in fact win?
 More generally, can approximate a
probability density function using
only samples from that density
?
Lose
Lose
Win
Lose
Chance of winning is 1 in 4!
Monte Carlo principle
 Given a very large set X and a distribution p(x) over it
 We draw i.i.d. a set of N samples
 We can then approximate the distribution using these
samples




N
i
i
N x
x
N
x
1
)
(
)
1(
1
)
(
p
X
p(x)
)
p(x
N 


Monte Carlo principle
 We can also use these samples to compute expectations
 And even use them to find a maximum



N
i
i
N x
f
N
f
E
1
)
(
)
(
1
)
(
)]
[p(
max
arg
ˆ )
(
)
(
i
x
x
x
i






x
N
x
x
f
f
E )
p(
)
(
)
(
Example: Bayes net inference
 Suppose we have a Bayesian
network with variables X
 Our state space is the set of all
possible assignments of values to
variables
 Computing the joint distribution is
in the worst case NP-hard
 However, note that you can draw a
sample in time that is linear in the
size of the network
 Draw N samples, use them to
approximate the joint
Sample 1: FTFTTTFFT
T T
T
F
F
F T
Sample 2: FTFFTTTFF
F
T
F T
T
F
T
F F
F
T
etc.
Rejection sampling
 Suppose we have a Bayesian
network with variables X
 We wish to condition on some
evidence ZX and compute the
posterior over Y=X-Z
 Draw samples, rejecting them
when they contradict the evidence
in Z
 Very inefficient if the evidence is
itself improbable, because we
must reject a large number of
samples
F=
=T
Sample 1: FTFTTTFFT reject
T T
T
F
F
F T
Sample 2: FTFFTTTFF accept
F
T
F T
T
F
T
F F
F
T
etc.
Rejection sampling
 More generally, we would like to sample from p(x), but
it’s easier to sample from a proposal distribution q(x)
 q(x) satisfies p(x) ≤ M q(x) for some M<∞
 Procedure:
 Sample x(i)
from q(x)
 Accept with probability p(x(i)
) / Mq(x(i)
)
 Reject otherwise
 The accepted x(i)
are sampled from p(x)!
 Problem: if M is too large, we will rarely accept samples
 In the Bayes network, if the evidence Z is very unlikely then we
will reject almost all samples
Markov chain Monte Carlo
 Recall again the set X and the distribution p(x) we
wish to sample from
 Suppose that it is hard to sample p(x) but that it is
possible to “walk around” in X using only local state
transitions
 Insight: we can use a “random walk” to help us
draw random samples from p(x)
X
p(x)
Markov chains
 Markov chain on a space X with transitions T is a random
process (infinite sequence of random variables)
(x(0)
, x(1)
,…x(t)
,…) in X∞
that satisfy
 That is, the probability of being in a particular state at time t
given the state history depends only on the state at time t-1
 If the transition probabilities are fixed for all t, the chain is
considered homogeneous
)
,
T(
)
,...,
|
p( )
(
)
1
(
)
1
(
)
1
(
)
( t
t
t
t
x
x
x
x
x 


T=
0.7 0.3 0
0.3 0.4 0.3
0 0.3 0.7
x2
x1 x3
0.4
0.3
0.3
0.3
0.7 0.7
0.3
Markov Chains for sampling
 In order for a Markov chain to useful for sampling p(x), we
require that for any starting state x(1)
 Equivalently, the stationary distribution of the Markov
chain must be p(x)
 If this is the case, we can start in an arbitrary state, use the
Markov chain to do a random walk for a while, and stop and
output the current state x(t)
 The resulting state will be sampled from p(x)!
)
p(
)
(
p )
(
)
1
( x
x
t
t
x 


)
p(
)
](
[p x
x 
T
Stationary distribution
 Consider the Markov chain given above:
 The stationary distribution is
 Some samples:
T=
0.7 0.3 0
0.3 0.4 0.3
0 0.3 0.7
x2
x1 x3
0.4
0.3
0.3
0.3
0.7 0.7
0.3
0.33 0.33 0.33 x =
0.7 0.3 0
0.3 0.4 0.3
0 0.3 0.7
0.33 0.33 0.33
1,1,2,3,2,1,2,3,3,2
1,2,2,1,1,2,3,3,3,3
1,1,1,2,3,2,2,1,1,1
1,2,3,3,3,2,1,2,2,3
1,1,2,2,2,3,3,2,1,1
1,2,2,2,3,3,3,2,2,2
Empirical Distribution:
0.33 0.33 0.33
Ergodicity
 Claim: To ensure that the chain converges to a unique stationary
distribution the following conditions are sufficient:
 Irreducibility: every state is eventually reachable from any start
state; for all x, y in X there exists a t such that
 Aperiodicity: the chain doesn’t get caught in cycles; for all x, y in X it
is the case that
 The process is ergodic if it is both irreducible and aperiodic
 This claim is easy to prove, but involves eigenstuff!
1
}
0
)
(
p
:
gcd{ )
(


y
t t
x
0
)
(
p )
(

y
t
x
Markov Chains for sampling
 Claim: To ensure that the stationary distribution of
the Markov chain is p(x) it is sufficient for p and T to
satisfy the detailed balance (reversibility) condition:
 Proof: for all y we have
 And thus p must be a stationary distribution of T
)
,
(
)
p(
)
,
(
)
p( x
y
T
y
y
x
T
x 
)
p(
)
,
(
)
p(
)
,
(
)
p(
)
,
(
)
p(
)
](
[p y
x
y
T
y
x
y
T
y
y
x
T
x
y
x
x
x



 


T
Metropolis algorithm
 How to pick a suitable Markov chain for our distribution?
 Suppose our distribution p(x) is easy to sample, and easy to compute up
to a normalization constant, but hard to compute exactly
 e.g. a Bayesian posterior P(M|D)P(D|M)P(M)
 We define a Markov chain with the following process:
 Sample a candidate point x* from a proposal distribution q(x*|x(t)
) which is
symmetric: q(x|y)=q(y|x)
 Compute the importance ratio (this is easy since the normalization
constants cancel)
 With probability min(r,1) transition to x*, otherwise stay in the same state
)
p(
*)
p(
)
(t
x
x
r 
Metropolis intuition
 Why does the Metropolis algorithm work?
 Proposal distribution can propose anything it
likes (as long as it can jump back with the
same probability)
 Proposal is always accepted if it’s jumping
to a more likely state
 Proposal accepted with the importance ratio
if it’s jumping to a less likely state
 The acceptance policy, combined with the
reversibility of the proposal distribution,
makes sure that the algorithm explores
states in proportion to p(x)!
xt
r=1.0
x*
r=p(x*)/p(xt
)
x*
Metropolis convergence
 Claim: The Metropolis algorithm converges to the
target distribution p(x).
 Proof: It satisfies detailed balance
For all x,y in X, wlog assuming p(x)<p(y), then
Hence:
)
,
(
)
p(
)
p(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
,
(
)
p(
x
y
T
y
y
x
y
x
q
y
y
x
q
x
x
y
q
x
y
x
T
x




q is symmetric
)
p(
)
p(
)
|
(
)
,
(
)
|
(
)
,
(
y
x
y
x
q
x
y
T
x
y
q
y
x
T

 candidate is always accepted, since the r= 1
Since, w generate x with prob q(x|y) and
accept
with prob r = the ratio < 1.
Metropolis-Hastings
 The symmetry requirement of the Metropolis proposal distribution
can be hard to satisfy
 Metropolis-Hastings is the natural generalization of the Metropolis
algorithm, and the most popular MCMC algorithm
 We define a Markov chain with the following process:
 Sample a candidate point x* from a proposal distribution q(x*|x(t)
) which is
not necessarily symmetric
 Compute the importance ratio:
 With probability min(r,1) transition to x*, otherwise stay in the same state x(t)
)
|
q(
)
p(
)
|
q(
)
p(
)
(
*
)
(
*
)
(
*
t
t
t
x
x
x
x
x
x
r 
MH convergence
 Claim: The Metropolis-Hastings algorithm converges
to the target distribution p(x).
 Proof: It satisfies detailed balance
For all x,y in X, wlog assume p(x)q(y|x)<p(y)q(x|y), then
Hence:
)
,
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
p(
)
,
(
)
p(
x
y
T
y
y
x
q
y
x
y
q
x
y
x
q
y
y
x
q
y
y
x
q
y
x
y
q
x
x
y
q
x
y
x
T
x




candidate is always accepted, since r = 1
Since, w generate x with prob q(x|y) and
accept
with prob r = the ratio < 1.
)
|
(
)
p(
)
|
(
)
p(
)
|
(
)
,
(
)
|
(
)
,
(
y
x
q
y
x
y
q
x
y
x
q
x
y
T
x
y
q
y
x
T


Gibbs sampling
 A special case of Metropolis-Hastings which is applicable to state
spaces in which we have a factored state space, and access to
the full conditionals:
 Perfect for Bayesian networks!
 Idea: To transition from one state (variable assignment) to
another,
 Pick a variable,
 Sample its value from the conditional distribution
 That’s it!
 We’ll show in a minute why this is an instance of MH and thus
must be sampling from the full joint
)
,...,
,
,...,
|
p( 1
1
1 n
j
j
j x
x
x
x
x 

Markov blanket
 Recall that Bayesian networks
encode a factored
representation of the joint
distribution
 Variables are independent of
their non-descendents given
their parents
 Variables are independent of
everything else in the network
given their Markov blanket!
 So, to sample each node, we
only need to condition its
Markov blanket
))
MB(
|
p( j
j x
x
Gibbs sampling
 More formally, the proposal distribution is
 The importance ratio is
 So we always accept!

)
|
( )
(
* t
x
x
q )
|
( )
(
* t
j
j x
x
p  if x*-j=x(t)
-j
0 otherwise
1
)
p(
)
p(
)
p(
)
,
p(
)
p(
)
p(
)
,
p(
)
p(
)
|
p(
)
p(
)
|
p(
)
p(
)
|
q(
)
p(
)
|
q(
)
p(
)
(
*
)
(
*
*
)
(
*
)
(
)
(
*
*
*
)
(
)
(
)
(
*
)
(
*
)
(
*
)
(
*













t
j
j
t
j
j
j
t
j
t
j
t
j
j
j
t
t
j
t
j
t
t
t
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
r
Dfn of proposal
distribution
Dfn of conditional
probability
B/c we didn’t
change other vars
Practical issues
 How many iterations?
 How to know when to stop?
 What’s a good proposal function?
Advanced Topics
 Simulated annealing, for global optimization, is a
form of MCMC
 Mixtures of MCMC transition functions
 Monte Carlo EM (stochastic E-step)
 Reversible jump MCMC for model selection
 Adaptive proposal distributions

More Related Content

PDF
Hastings 1970
PPT
CS221: HMM and Particle Filters
PDF
Introduction to MCMC methods
PDF
Book chapter-5
PDF
short course at CIRM, Bayesian Masterclass, October 2018
PDF
By BIRASA FABRICE
PDF
A bit about мcmc
PPT
Markov chains1
Hastings 1970
CS221: HMM and Particle Filters
Introduction to MCMC methods
Book chapter-5
short course at CIRM, Bayesian Masterclass, October 2018
By BIRASA FABRICE
A bit about мcmc
Markov chains1

Similar to ch14MarkovChainkfkkklmkllmkkaskldask.ppt (20)

PPTX
Hidden Markov Models
PDF
PDF
PDF
Random Matrix Theory and Machine Learning - Part 3
PDF
2012 mdsp pr06  hmm
PDF
Delayed acceptance for Metropolis-Hastings algorithms
PDF
Epidemic processes on switching networks
PPTX
Markov chain
PDF
Spacey random walks and higher-order data analysis
PDF
Bayesian Experimental Design for Stochastic Kinetic Models
PPT
Input analysis
PDF
Recent developments on unbiased MCMC
PDF
Phase-Type Distributions for Finite Interacting Particle Systems
PDF
CS-438 COMPUTER SYSTEM MODELINGWK9+10LEC17-19.pdf
PDF
Hathor@FGV: Introductory notes by Dr. A. Linhares
PPT
Semi-Classical Transport Theory.ppt
PPT
Bayesian phylogenetic inference_big4_ws_2016-10-10
PDF
12 Machine Learning Supervised Hidden Markov Chains
PDF
Looking Inside Mechanistic Models of Carcinogenesis
Hidden Markov Models
Random Matrix Theory and Machine Learning - Part 3
2012 mdsp pr06  hmm
Delayed acceptance for Metropolis-Hastings algorithms
Epidemic processes on switching networks
Markov chain
Spacey random walks and higher-order data analysis
Bayesian Experimental Design for Stochastic Kinetic Models
Input analysis
Recent developments on unbiased MCMC
Phase-Type Distributions for Finite Interacting Particle Systems
CS-438 COMPUTER SYSTEM MODELINGWK9+10LEC17-19.pdf
Hathor@FGV: Introductory notes by Dr. A. Linhares
Semi-Classical Transport Theory.ppt
Bayesian phylogenetic inference_big4_ws_2016-10-10
12 Machine Learning Supervised Hidden Markov Chains
Looking Inside Mechanistic Models of Carcinogenesis
Ad

More from FarooqKhurshid1 (19)

PPTX
awslambda-240508203904-07xsds253491.pptx
PPTX
p2 of lecture slides making it suitable.pptx
PPTX
p1 of lecture slides making it suitable.pptx
PPT
555_Spring12_topic01 lecture of crip.ppt
PPTX
Lecture 8 - BlockChain for ug students.pptx
PPTX
Lecture 6 - Marcov Chain introduction.pptx
PPT
Lecture 7 - CRYPTOGRAPHYpptof my presentation.ppt
PPTX
Developing your case study report (student edition) (1).pptx
PDF
engineering related stuff to read and learn 5
PDF
engineering related stuff to read and learn 1
PDF
engineering related stuff to read and learn 3
PDF
21CS7PECCT-CC-PPT-Oct-Jan_2022 classroomxc
PDF
Notice for Ramdan Timings 2025 classroom
PPTX
crit_think_intro classroom slides class notes
PPT
crit_think_intro lecture slides class notes
PPT
crit_think_intro_I_08 lecture slides notes
PPTX
ADS Class Thursday 21-12 abouth the theory
DOCX
390294267 - Communication-Based Train Control or CBTC (1).docx
DOCX
Paper_formatfor the updaetd reported.docx
awslambda-240508203904-07xsds253491.pptx
p2 of lecture slides making it suitable.pptx
p1 of lecture slides making it suitable.pptx
555_Spring12_topic01 lecture of crip.ppt
Lecture 8 - BlockChain for ug students.pptx
Lecture 6 - Marcov Chain introduction.pptx
Lecture 7 - CRYPTOGRAPHYpptof my presentation.ppt
Developing your case study report (student edition) (1).pptx
engineering related stuff to read and learn 5
engineering related stuff to read and learn 1
engineering related stuff to read and learn 3
21CS7PECCT-CC-PPT-Oct-Jan_2022 classroomxc
Notice for Ramdan Timings 2025 classroom
crit_think_intro classroom slides class notes
crit_think_intro lecture slides class notes
crit_think_intro_I_08 lecture slides notes
ADS Class Thursday 21-12 abouth the theory
390294267 - Communication-Based Train Control or CBTC (1).docx
Paper_formatfor the updaetd reported.docx
Ad

Recently uploaded (20)

PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
modul_python (1).pptx for professional and student
DOCX
Factor Analysis Word Document Presentation
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Microsoft Core Cloud Services powerpoint
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Introduction to Inferential Statistics.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
Predictive modeling basics in data cleaning process
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
modul_python (1).pptx for professional and student
Factor Analysis Word Document Presentation
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft Core Cloud Services powerpoint
[EN] Industrial Machine Downtime Prediction
Navigating the Thai Supplements Landscape.pdf
Introduction to Inferential Statistics.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Predictive modeling basics in data cleaning process
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Analytics and business intelligence.pdf
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Qualitative Qantitative and Mixed Methods.pptx

ch14MarkovChainkfkkklmkllmkkaskldask.ppt

  • 1. . Markov Chains © Ydo Wexler & Dan Geiger Modified by Longin Jan Latecki Temple University, Philadelphia latecki@temple.edu
  • 2. . Statistical Parameter Estimation Reminder • The basic paradigm: • MLE / bayesian approach • Input data: series of observations X1, X2 … Xt -We assumed observations were i.i.d (independent identical distributed) Data set Model Parameters: Θ Heads - P(H) Tails - 1-P(H)
  • 3. Markov Process • Markov Property: The state of the system at time t+1 depends only on the state of the system at time t X1 X2 X3 X4 X5     x | X x X x x X | X x X t t t t t t t t          1 1 1 1 1 1 Pr Pr   • Stationary Assumption: Transition probabilities are independent of time (t)   1 Pr t t ab X b| X a p     Bounded memory transition model
  • 4. Weather: • raining today 40% rain tomorrow 60% no rain tomorrow • not raining today 20% rain tomorrow 80% no rain tomorrow Markov Process Simple Example rain no rain 0.6 0.4 0.8 0.2 Stochastic FSM:
  • 5. Weather: • raining today 40% rain tomorrow 60% no rain tomorrow • not raining today 20% rain tomorrow 80% no rain tomorrow Markov Process Simple Example          8 . 0 2 . 0 6 . 0 4 . 0 P • Stochastic matrix: Rows sum up to 1 • Double stochastic matrix: Rows and columns sum up to 1 The transition matrix:
  • 6. – Gambler starts with $10 - At each play we have one of the following: • Gambler wins $1 with probability p • Gambler looses $1 with probability 1-p – Game ends when gambler goes broke, or gains a fortune of $100 (Both 0 and 100 are absorbing states) 0 1 2 99 100 p p p p 1-p 1-p 1-p 1-p Start (10$) Markov Process Gambler’s Example
  • 7. • Markov process - described by a stochastic FSM • Markov chain - a random walk on this graph (distribution over paths) • Edge-weights give us • We can ask more complex questions, like Markov Process   1 Pr t t ab X b| X a p       ? Pr 2     b a | X X t t 0 1 2 99 100 p p p p 1-p 1-p 1-p 1-p Start (10$)
  • 8. • Given that a person’s last cola purchase was Coke, there is a 90% chance that his next cola purchase will also be Coke. • If a person’s last cola purchase was Pepsi, there is an 80% chance that his next cola purchase will also be Pepsi. coke pepsi 0.1 0.9 0.8 0.2 Markov Process Coke vs. Pepsi Example        8 . 0 2 . 0 1 . 0 9 . 0 P transition matrix:
  • 9. Given that a person is currently a Pepsi purchaser, what is the probability that he will purchase Coke two purchases from now? Pr[ Pepsi?Coke ] = Pr[ PepsiCokeCoke ] + Pr[ Pepsi Pepsi Coke ] = 0.2 * 0.9 + 0.8 * 0.2 = 0.34                     66 . 0 34 . 0 17 . 0 83 . 0 8 . 0 2 . 0 1 . 0 9 . 0 8 . 0 2 . 0 1 . 0 9 . 0 2 P Markov Process Coke vs. Pepsi Example (cont) Pepsi  ? ?  Coke        8 . 0 2 . 0 1 . 0 9 . 0 P
  • 10. Given that a person is currently a Coke purchaser, what is the probability that he will purchase Pepsi three purchases from now? Markov Process Coke vs. Pepsi Example (cont)                     562 . 0 438 . 0 219 . 0 781 . 0 66 . 0 34 . 0 17 . 0 83 . 0 8 . 0 2 . 0 1 . 0 9 . 0 3 P
  • 11. •Assume each person makes one cola purchase per week •Suppose 60% of all people now drink Coke, and 40% drink Pepsi •What fraction of people will be drinking Coke three weeks from now? Markov Process Coke vs. Pepsi Example (cont)        8 . 0 2 . 0 1 . 0 9 . 0 P        562 . 0 438 . 0 219 . 0 781 . 0 3 P Pr[X3=Coke] = 0.6 * 0.781 + 0.4 * 0.438 = 0.6438 Qi - the distribution in week i Q0=(0.6,0.4) - initial distribution Q3= Q0 * P3 =(0.6438,0.3562)
  • 12. Simulation: Markov Process Coke vs. Pepsi Example (cont) week - i Pr[X i = Coke] 2/3     3 1 3 2 3 1 3 2 8 . 0 2 . 0 1 . 0 9 . 0        stationary distribution coke pepsi 0.1 0.9 0.8 0.2
  • 13. An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004 Modified by Longin Jan Latecki Temple University, Philadelphia latecki@temple.edu
  • 14. Agenda  Motivation  The Monte Carlo Principle  Markov Chain Monte Carlo  Metropolis Hastings  Gibbs Sampling  Advanced Topics
  • 15. Monte Carlo principle  Consider the game of solitaire: what’s the chance of winning with a properly shuffled deck?  Hard to compute analytically because winning or losing depends on a complex procedure of reorganizing cards  Insight: why not just play a few hands, and see empirically how many do in fact win?  More generally, can approximate a probability density function using only samples from that density ? Lose Lose Win Lose Chance of winning is 1 in 4!
  • 16. Monte Carlo principle  Given a very large set X and a distribution p(x) over it  We draw i.i.d. a set of N samples  We can then approximate the distribution using these samples     N i i N x x N x 1 ) ( ) 1( 1 ) ( p X p(x) ) p(x N   
  • 17. Monte Carlo principle  We can also use these samples to compute expectations  And even use them to find a maximum    N i i N x f N f E 1 ) ( ) ( 1 ) ( )] [p( max arg ˆ ) ( ) ( i x x x i       x N x x f f E ) p( ) ( ) (
  • 18. Example: Bayes net inference  Suppose we have a Bayesian network with variables X  Our state space is the set of all possible assignments of values to variables  Computing the joint distribution is in the worst case NP-hard  However, note that you can draw a sample in time that is linear in the size of the network  Draw N samples, use them to approximate the joint Sample 1: FTFTTTFFT T T T F F F T Sample 2: FTFFTTTFF F T F T T F T F F F T etc.
  • 19. Rejection sampling  Suppose we have a Bayesian network with variables X  We wish to condition on some evidence ZX and compute the posterior over Y=X-Z  Draw samples, rejecting them when they contradict the evidence in Z  Very inefficient if the evidence is itself improbable, because we must reject a large number of samples F= =T Sample 1: FTFTTTFFT reject T T T F F F T Sample 2: FTFFTTTFF accept F T F T T F T F F F T etc.
  • 20. Rejection sampling  More generally, we would like to sample from p(x), but it’s easier to sample from a proposal distribution q(x)  q(x) satisfies p(x) ≤ M q(x) for some M<∞  Procedure:  Sample x(i) from q(x)  Accept with probability p(x(i) ) / Mq(x(i) )  Reject otherwise  The accepted x(i) are sampled from p(x)!  Problem: if M is too large, we will rarely accept samples  In the Bayes network, if the evidence Z is very unlikely then we will reject almost all samples
  • 21. Markov chain Monte Carlo  Recall again the set X and the distribution p(x) we wish to sample from  Suppose that it is hard to sample p(x) but that it is possible to “walk around” in X using only local state transitions  Insight: we can use a “random walk” to help us draw random samples from p(x) X p(x)
  • 22. Markov chains  Markov chain on a space X with transitions T is a random process (infinite sequence of random variables) (x(0) , x(1) ,…x(t) ,…) in X∞ that satisfy  That is, the probability of being in a particular state at time t given the state history depends only on the state at time t-1  If the transition probabilities are fixed for all t, the chain is considered homogeneous ) , T( ) ,..., | p( ) ( ) 1 ( ) 1 ( ) 1 ( ) ( t t t t x x x x x    T= 0.7 0.3 0 0.3 0.4 0.3 0 0.3 0.7 x2 x1 x3 0.4 0.3 0.3 0.3 0.7 0.7 0.3
  • 23. Markov Chains for sampling  In order for a Markov chain to useful for sampling p(x), we require that for any starting state x(1)  Equivalently, the stationary distribution of the Markov chain must be p(x)  If this is the case, we can start in an arbitrary state, use the Markov chain to do a random walk for a while, and stop and output the current state x(t)  The resulting state will be sampled from p(x)! ) p( ) ( p ) ( ) 1 ( x x t t x    ) p( ) ]( [p x x  T
  • 24. Stationary distribution  Consider the Markov chain given above:  The stationary distribution is  Some samples: T= 0.7 0.3 0 0.3 0.4 0.3 0 0.3 0.7 x2 x1 x3 0.4 0.3 0.3 0.3 0.7 0.7 0.3 0.33 0.33 0.33 x = 0.7 0.3 0 0.3 0.4 0.3 0 0.3 0.7 0.33 0.33 0.33 1,1,2,3,2,1,2,3,3,2 1,2,2,1,1,2,3,3,3,3 1,1,1,2,3,2,2,1,1,1 1,2,3,3,3,2,1,2,2,3 1,1,2,2,2,3,3,2,1,1 1,2,2,2,3,3,3,2,2,2 Empirical Distribution: 0.33 0.33 0.33
  • 25. Ergodicity  Claim: To ensure that the chain converges to a unique stationary distribution the following conditions are sufficient:  Irreducibility: every state is eventually reachable from any start state; for all x, y in X there exists a t such that  Aperiodicity: the chain doesn’t get caught in cycles; for all x, y in X it is the case that  The process is ergodic if it is both irreducible and aperiodic  This claim is easy to prove, but involves eigenstuff! 1 } 0 ) ( p : gcd{ ) (   y t t x 0 ) ( p ) (  y t x
  • 26. Markov Chains for sampling  Claim: To ensure that the stationary distribution of the Markov chain is p(x) it is sufficient for p and T to satisfy the detailed balance (reversibility) condition:  Proof: for all y we have  And thus p must be a stationary distribution of T ) , ( ) p( ) , ( ) p( x y T y y x T x  ) p( ) , ( ) p( ) , ( ) p( ) , ( ) p( ) ]( [p y x y T y x y T y y x T x y x x x        T
  • 27. Metropolis algorithm  How to pick a suitable Markov chain for our distribution?  Suppose our distribution p(x) is easy to sample, and easy to compute up to a normalization constant, but hard to compute exactly  e.g. a Bayesian posterior P(M|D)P(D|M)P(M)  We define a Markov chain with the following process:  Sample a candidate point x* from a proposal distribution q(x*|x(t) ) which is symmetric: q(x|y)=q(y|x)  Compute the importance ratio (this is easy since the normalization constants cancel)  With probability min(r,1) transition to x*, otherwise stay in the same state ) p( *) p( ) (t x x r 
  • 28. Metropolis intuition  Why does the Metropolis algorithm work?  Proposal distribution can propose anything it likes (as long as it can jump back with the same probability)  Proposal is always accepted if it’s jumping to a more likely state  Proposal accepted with the importance ratio if it’s jumping to a less likely state  The acceptance policy, combined with the reversibility of the proposal distribution, makes sure that the algorithm explores states in proportion to p(x)! xt r=1.0 x* r=p(x*)/p(xt ) x*
  • 29. Metropolis convergence  Claim: The Metropolis algorithm converges to the target distribution p(x).  Proof: It satisfies detailed balance For all x,y in X, wlog assuming p(x)<p(y), then Hence: ) , ( ) p( ) p( ) p( ) | ( ) p( ) | ( ) p( ) | ( ) p( ) , ( ) p( x y T y y x y x q y y x q x x y q x y x T x     q is symmetric ) p( ) p( ) | ( ) , ( ) | ( ) , ( y x y x q x y T x y q y x T   candidate is always accepted, since the r= 1 Since, w generate x with prob q(x|y) and accept with prob r = the ratio < 1.
  • 30. Metropolis-Hastings  The symmetry requirement of the Metropolis proposal distribution can be hard to satisfy  Metropolis-Hastings is the natural generalization of the Metropolis algorithm, and the most popular MCMC algorithm  We define a Markov chain with the following process:  Sample a candidate point x* from a proposal distribution q(x*|x(t) ) which is not necessarily symmetric  Compute the importance ratio:  With probability min(r,1) transition to x*, otherwise stay in the same state x(t) ) | q( ) p( ) | q( ) p( ) ( * ) ( * ) ( * t t t x x x x x x r 
  • 31. MH convergence  Claim: The Metropolis-Hastings algorithm converges to the target distribution p(x).  Proof: It satisfies detailed balance For all x,y in X, wlog assume p(x)q(y|x)<p(y)q(x|y), then Hence: ) , ( ) p( ) | ( ) p( ) | ( ) p( ) | ( ) p( ) | ( ) p( ) | ( ) p( ) | ( ) p( ) | ( ) p( ) , ( ) p( x y T y y x q y x y q x y x q y y x q y y x q y x y q x x y q x y x T x     candidate is always accepted, since r = 1 Since, w generate x with prob q(x|y) and accept with prob r = the ratio < 1. ) | ( ) p( ) | ( ) p( ) | ( ) , ( ) | ( ) , ( y x q y x y q x y x q x y T x y q y x T  
  • 32. Gibbs sampling  A special case of Metropolis-Hastings which is applicable to state spaces in which we have a factored state space, and access to the full conditionals:  Perfect for Bayesian networks!  Idea: To transition from one state (variable assignment) to another,  Pick a variable,  Sample its value from the conditional distribution  That’s it!  We’ll show in a minute why this is an instance of MH and thus must be sampling from the full joint ) ,..., , ,..., | p( 1 1 1 n j j j x x x x x  
  • 33. Markov blanket  Recall that Bayesian networks encode a factored representation of the joint distribution  Variables are independent of their non-descendents given their parents  Variables are independent of everything else in the network given their Markov blanket!  So, to sample each node, we only need to condition its Markov blanket )) MB( | p( j j x x
  • 34. Gibbs sampling  More formally, the proposal distribution is  The importance ratio is  So we always accept!  ) | ( ) ( * t x x q ) | ( ) ( * t j j x x p  if x*-j=x(t) -j 0 otherwise 1 ) p( ) p( ) p( ) , p( ) p( ) p( ) , p( ) p( ) | p( ) p( ) | p( ) p( ) | q( ) p( ) | q( ) p( ) ( * ) ( * * ) ( * ) ( ) ( * * * ) ( ) ( ) ( * ) ( * ) ( * ) ( *              t j j t j j j t j t j t j j j t t j t j t t t x x x x x x x x x x x x x x x x x x x x x x r Dfn of proposal distribution Dfn of conditional probability B/c we didn’t change other vars
  • 35. Practical issues  How many iterations?  How to know when to stop?  What’s a good proposal function?
  • 36. Advanced Topics  Simulated annealing, for global optimization, is a form of MCMC  Mixtures of MCMC transition functions  Monte Carlo EM (stochastic E-step)  Reversible jump MCMC for model selection  Adaptive proposal distributions