SlideShare a Scribd company logo
Appendix for Lecture 2: Monte Carlo Methods (Basics)
Dahua Lin
1 Justification of Basic Sampling Methods
Proposition 1. Let F be the cdf of a real-valued random variable with distribution D. Let
U ∼ Uniform([0, 1]), then F−1(U) ∼ D.
Proof. Let X = F−1(U). It suffices to show that the cdf of X is F. For any t ∈ R,
P(X ≤ t) = P(F−1
(U) ≤ t) = P(U ≤ F(t)) = F(t). (1)
Here, we utilize the fact that F is non-decreasing.
Proposition 2. Samples producted using Rejection sampling has the desired distribution.
Proof. Each iteration actually generate two random variables: x and u, where u ∈ {0, 1} is the
indicator of acceptance. The join distribution of x and u is given by
˜p(dx, u = 1) = a(u = 1|x)q(dx) =
p(x)
Mq(x)
q(x)dx =
p(x)
M
µ(dx). (2)
Here, a(u|x) is the conditional distribution of u on x, and µ is the base measure. On the other
hand, we have
Pr(u = 1) = ˜p(dx, u = 1) =
p(x)
M
µ(dx) =
1
M
. (3)
Thus, the resultant distribution is
˜p(dx|u = 1) =
˜p(dx, u = 1)
Pr(u = 1)
= p(x)µ(dx). (4)
This completes the proof.
2 Markov Chain Theory
Proposition 3. When the state space Ω is countable, we have
µ − ν TV =
1
2
x∈Ω
|µ(x) − ν(x)|. (5)
Proof. Let A = {x ∈ Ω : µ(x) ≥ nu(x)}. By definition, we have
µ − ν TV ≥ |µ(A) − ν(A)| = µ(A) − ν(A), (6)
µ − ν TV ≥ |µ(Ac
) − ν(Ac
)| = ν(Ac
) − µ(Ac
). (7)
We also have
µ(A) − ν(A) =
x∈A
µ(x) − ν(x) =
x∈A
|µ(x) − ν(x)|, (8)
ν(Ac
) − µ(Ac
) =
x∈Ac
ν(x) − µ(x) =
x∈Ac
|µ(x) − ν(x)|. (9)
1
Combining the equations above results in
µ − ν TV ≥
1
2
(µ(A) − ν(A) + ν(Ac
) − µ(Ac
))
=
1
2
x∈A
|µ(x) − ν(x)| +
x∈Ac
|µ(x) − ν(x)|
=
1
2
x∈Ω
|µ(x) − ν(x)|. (10)
Next we show the inequality of the other direction. For any A ⊂ Ω, we have
|µ(Ac
) − ν(Ac
)| = |(µ(Ω) − µ(A)) − (ν(Ω) − ν(A))| = |µ(A) − ν(A)| (11)
Hence,
|µ(A) − ν(A)| =
1
2
(|µ(A) − ν(A)| + |µ(Ac
) − ν(Ac
)|)
≤
1
2
x∈A
|µ(x) − ν(x)| +
x∈Ac
|µ(x) − ν(x)|
≤
1
2
x∈Ω
|µ(x) − ν(x)|. (12)
As A is arbitrary, we can conclude that
µ − ν TV sup
A
|µ(A) − ν(A)| ≤
1
2
x∈Ω
|µ(x) − ν(x)|. (13)
This completes the proof.
Proposition 4. The total variation distance (µ, ν) → µ − ν TV is a metric.
Proof. To show that it is a metric, we verify the four properties that a metric needs to satisfy
one by one.
1. µ − ν TV is non-negative, as |µ(A) − ν(A)| is always non-negative.
2. When µ = ν, |µ(A) − ν(A)| is always zero, and hence µ − ν TV = 0. On the other
hand, when µ = ν, there exists A ⊂ S such that |µ(A) − ν(A)| > 0, and therefore
µ − ν TV ≥ |µ(A) − ν(A)| > 0. Together we can conclude that µ − ν TV = 0 iff µ = ν.
3. µ−ν TV = ν −µ TV as |µ(A)−ν(A)| = |ν(A)−µ(A)| holds for any measurable subset
A.
4. Next, we show that the total variation distance satisfies the triangle inequality, as below.
Let µ, ν, η be three probability measures over Ω:
µ − ν TV = sup
A∈S
|µ(A) − ν(A)|
= sup
A∈S
|µ(A) − η(A) + η(A) − ν(A)|
≤ sup
A∈S
(|µ(A) − η(A)| + |η(A) − ν(A)|)
≤ sup
A∈S
|µ(A) − η(A)| + sup
A∈S
|η(A) − ν(A)|
= µ − η TV + η − ν TV . (14)
2
The proof is completed.
Proposition 5. Consider a Markov chain over a countable space Ω with transition proba-
bility matrix P. Let π be a probability measure over Ω that is in detailed balance with P,
i.e. π(x)P(x, y) = π(y)P(y, x), ∀x, y ∈ Ω. Then π is invariant to P, i.e. π = πP.
Proof. With the assumption of detailed balance, we have
(πP)(y) =
x∈Ω
π(x)P(x, y) =
x∈Ω
π(y)P(y, x) = π(y)
x∈Ω
P(y, x) = π(y). (15)
Hence, π = πP, or in other words, π is invariant to P.
Proposition 6. Let (Xt) be an ergodic Markov chain Markov(π, P) where π is in detailed
balance with P, then given arbitrary sequence x0, . . . , xn ∈ Ω, we have
Pr(X0 = x0, . . . , Xn = xn) = Pr(X0 = xn, . . . , Xn = x0). (16)
Proof. First, we have
Pr(X0 = x0, . . . , Xn = xn) = π(x0)P(x0, x1) · · · P(xn−1, xn). (17)
On the other hand, by detailed balance, we have P(x, y) = π(y)P(y,x)
π(x) , and thus
Pr(X0 = xn, . . . , Xn = x0) = π(xn)P(xn, xn−1) · · · P(x1, x0)
= π(xn)
π(xn−1)P(xn−1, xn)
π(xn)
· · ·
π(x0)P(x0, x1)
π(x1)
= P(xn−1, xn) · · · P(x0, x1)π(x0). (18)
Comparing Eq.(17) and Eq.(18) results in the equality that we intend to prove.
Proposition 7. Over a measurable space (Ω, S), if a stochastic kernel P is reversible w.r.t. π,
then π is invariant to P.
Proof. Let π = πP, it suffices to show that π (A) = π(A) for every A ∈ S under the reversibility
assumption. Given any A ∈ S, let fA(x, y) := 1(y ∈ A), then we have
π (A) = π(dx)P(x, A)
= π(dx) fA(x, y)P(x, dy)
= fA(x, y)π(dx)P(x, dy)
= fA(y, x)π(dx)P(x, dy)
= 1(x ∈ A)π(dx)P(x, dy)
= 1(x ∈ A)π(dx) P(x, dy)
= 1(x ∈ A)π(dx) = π(A). (19)
This completes the proof.
3
Proposition 8. Given a stochastic kernel P and a probability measure π over (Ω, S). Suppose
both Px and π are absolutely continuous w.r.t. a base measure µ, that is, π(dx) = π(x)µ(dx)
and P(x, dy) = Px(dy) = px(y)µ(dy), then P is reversible w.r.t. π if and only if
π(x)px(y) = π(y)py(x), a.e. (20)
Proof. First, assuming detailed balance, i.e. π(x)px(y) = π(y)py(x), a.e., we show reversibility.
f(x, y)π(dx)P(x, dy) = f(x, y)π(x)px(y)µ(dx)µ(dy)
= f(x, y)π(y)py(x)µ(dx)µ(dy) ...[detailed balance]
= f(y, x)π(x)px(y)µ(dx)µ(dy) ...[exchange variables]
= f(y, x)π(dx)P(x, dy). (21)
Next, we show the converse. The definition of reversibility implies that
f(x, y)π(dx)P(x, dy) = f(x, y)π(dy)P(y, dx) (22)
Hence,
f(x, y)π(x)px(y)µ(dx)µ(dy) = f(x, y)π(y)py(x)µ(dx)µ(dy) (23)
Hence, f(x, y)π(x)px(y) = f(x, y)π(y)py(x) a.e. for arbitrary integrable function f, which im-
plies that π(x)px(y) = π(y)py(x) a.e.
Proposition 9. Given a stochastic kernel P and a probability measure π over (Ω, S). If
P(x, dy) = m(x)Ix(dy) + px(y)µ(dy) and π(x)px(y) = π(y)py(x) a.e, then P is reversible
w.r.t. π.
Proof. Under the given conditions, we have
f(x, y)π(dx)P(x, dy) = f(x, y)π(dx) (m(x)Ix(dy) + px(y)µ(dy))
= f(x, y)m(x)π(dx)Ix(dy) + f(x, y)px(y)µ(dy)
= f(x, x)m(x)π(dx) + f(x, y)px(y)π(dx)µ(dy)
= f(x, x)m(x)π(dx) + f(x, y)px(y)π(x)µ(dx)µ(dy). (24)
For the right hand side, we have
f(y, x)π(dx)P(x, dy) = f(y, x)π(dx) (m(x)Ix(dy) + px(y)µ(dy))
= f(y, x)m(x)π(dx)Ix(dy) + f(y, x)px(y)µ(dy)
= f(x, x)m(x)π(dx) + f(y, x)px(y)π(dx)µ(dy)
= f(x, x)m(x)π(dx) + f(y, x)px(y)π(x)µ(dx)µ(dy)
= f(x, x)m(x)π(dx) + f(x, y)py(x)π(y)µ(dx)µ(dy). (25)
With π(x)px(y) = π(y)py(x), we can see that the left and right hand sides are equal. This
completes the proof.
4
3 Justification of MCMC Methods
Proposition 10. Samples produced using the Metropolis-Hastings algorithm has the desired
distribution, and the resultant chain is reversible.
Proof. It suffices to show that the M-H update is reversible w.r.t. π, which implies that π is
invariant. The stochastic kernel of M-H update is given by
P(x, dy) = m(x)I(x, dy) + q(x, dy)a(x, y) = r(x)I(x, dy) + qx(y)a(x, y)µ(dy) (26)
Here, µ is the base measure, and I(x, dy) is the identity measure given by I(x, A) = 1(x ∈ A),
and m(x) is the probability that the proposal is rejected, which is given by
m(x) = 1 −
Ω
q(x, dy)a(x, y). (27)
Let g(x, y) = h(x)qx(y)a(x, y). With Proposition 9, it suffices to show that g(x, y) = g(y, x).
Here, a(x, y) = min{r(x, y), 1}. Also, from the definition r(x, y) =
h(y)qy(x)
h(x)qx(y) , it is easy to see
that r(x, y) = 1/r(y, x). We first consider the case where r(x, y) ≤ 1 (thus r(y, x) ≥ 1), then
g(x, y) = h(x)qx(y)a(x, y) = h(x)qx(y)
h(y)qy(x)
h(x)qx(y)
= h(y)qy(x), (28)
and
g(y, x) = h(y)qy(x)a(y, x) = h(y)qy(x). (29)
Hence, g(x, y) = g(y, x) when r(x, y) ≤ 1. Similarly, we can show that the equality holds when
r(x, y) ≥ 1. This completes the proof.
Proposition 11. The Metropolis algorithm is a special case of the Metropolis-Hastings algo-
rithm.
Proof. It suffices to show that when q is symmetric, i.e. qx(y) = qy(x), the acceptance rate
reduces to the form given in the Metropolis algorithm. Particularly, when qx(y) = qy(x), the
acceptance rate of the M-H algorithm is
a(x, y) = min{r(x, y), 1} = min
h(y)qy(x)
h(x)qx(y)
, 1 = min
h(y)
h(x)
, 1 . (30)
This completes the proof.
Proposition 12. The Gibbs sampling update is a special case of the Metropolis-Hastings update.
Proof. Without losing generality, we assume the sample is comprised of two components: x =
(x1, x2). Consider a proposal qx(dy) = π(dy1, x2)I(dx2). In this case, we have
r((x1, x2), (y1, x2)) =
π(y1, x2)π(x1, x2)
π(x1, x2)π(y1, x2)
= 1. (31)
This implies that the candidate is always accepted. Also, generating a sample from qx is
equivalent to drawing one from p(y|z). This completes the argument.
Proposition 13. Let K1, . . . , Km be stochastic kernels with invariant measure π, and q ∈ Rm
be a probability vector, then K = m
k=1 qiKi is also a stochastic kernel with invariant measure
π. Moreover, if K1, . . . , Km are all reversible, then K is reversible.
5
Proof. First, it is easy to see that convex combinations of probability measures remain proba-
bility measures. As an immediate consequence, Kx, a convex combination of Ki(x, ·), is also a
probability measure. Given a measurable subset A, Ki(·, A) is measurable for each i, so is their
convex combinations. Hence, we can conclude that K remains a stochastic kernel. Next, we
show that π invariant to K, as
πK = π
m
i=1
qiKi =
m
i=1
qi(πKi) =
m
i=1
qiπ = π. (32)
This proves the first statement. Then, we assume that K1, . . . , Km are reversible, then for K,
we have
f(x, y)π(dx)K(x, dy) =
m
i=1
qi f(x, y)π(dx)Ki(x, dy)
=
m
i=1
qi f(y, x)π(dx)Ki(x, dy)
= f(y, x)π(dx)K(x, dy). (33)
This implies that K is also reversible, thus completing the proof.
Proposition 14. Let K1, . . . , Km be stochastic kernels with invariant measure π. Then K =
Km ◦ · · · ◦ K1 is also a stochastic kernel with invariant measure π.
Proof. Consider K = K2 ◦ K1. To show that K is a stochastic kernel, we first show that
Kx(dy) = K(x, dy) is a probability measure. Given arbitrary measurable subset A, we have
K(x, A) = K1(x, dy)K2(y, A). (34)
As this is a bounded non-negative integration and K2(y, A) is measurable, it constitutes a
measure. Also,
K(x, Ω) = K1(x, dy)K2(y, Ω) = K1(x, dy) = 1. (35)
Hence, K(x, ·) is a probability measure, and thus K a stochastic kernel. Next, we show π is
invariant to K:
πK = π(K2 ◦ K1) = (πK1)K2 = πK2 = π. (36)
We have proved the statement for a composition of two kernels K1 ◦ K2. By induction, we can
further extend to finite composition, thus completing the proof.
6

More Related Content

PDF
So a webinar-2013-2
PDF
Fougeres Besancon Archimax
PDF
slides tails copulas
PDF
Nested sampling
PDF
Multivariate Distributions, an overview
PDF
slides CIRM copulas, extremes and actuarial science
PDF
Slides dauphine
PDF
transformations and nonparametric inference
So a webinar-2013-2
Fougeres Besancon Archimax
slides tails copulas
Nested sampling
Multivariate Distributions, an overview
slides CIRM copulas, extremes and actuarial science
Slides dauphine
transformations and nonparametric inference

What's hot (20)

PDF
Multiple estimators for Monte Carlo approximations
PDF
Poster for Bayesian Statistics in the Big Data Era conference
PDF
Inference in generative models using the Wasserstein distance [[INI]
PDF
ABC based on Wasserstein distances
PDF
Multiattribute utility copula
PDF
Bohmian
PDF
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
PDF
Slides astin
PDF
Lundi 16h15-copules-charpentier
PDF
the ABC of ABC
PDF
ABC convergence under well- and mis-specified models
PDF
Slides lausanne-2013-v2
PDF
Harmonic Analysis and Deep Learning
PDF
PDF
Slides guanauato
PDF
Mark Girolami's Read Paper 2010
PDF
Some recent developments in the traffic flow variational formulation
PDF
Sildes buenos aires
PDF
Solving High-order Non-linear Partial Differential Equations by Modified q-Ho...
PDF
Matrix calculus
Multiple estimators for Monte Carlo approximations
Poster for Bayesian Statistics in the Big Data Era conference
Inference in generative models using the Wasserstein distance [[INI]
ABC based on Wasserstein distances
Multiattribute utility copula
Bohmian
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Slides astin
Lundi 16h15-copules-charpentier
the ABC of ABC
ABC convergence under well- and mis-specified models
Slides lausanne-2013-v2
Harmonic Analysis and Deep Learning
Slides guanauato
Mark Girolami's Read Paper 2010
Some recent developments in the traffic flow variational formulation
Sildes buenos aires
Solving High-order Non-linear Partial Differential Equations by Modified q-Ho...
Matrix calculus
Ad

Viewers also liked (20)

PDF
sg247934
PDF
Markovian sequential decision-making in non-stationary environments: applicat...
PDF
Optimization of probabilistic argumentation with Markov processes
PDF
Thesis : Contraints and Observability In Markov Decision Processes
PDF
PDF
BayesianBeliefTrajectoryDiagram
PPT
POMDP Seminar Backup3
PDF
MLPI Lecture 0: Overview
PPT
Hierarchical Pomdp Planning And Execution
PDF
MLPI Lecture 3: Advanced Sampling Techniques
PDF
MLPI Lecture 1: Maths for Machine Learning
PDF
Lecture 5: Variational Estimation and Inference
PDF
MLPI Lecture 2: Monte Carlo Methods (Basics)
PPTX
Challenges for implementing Monte Carlo Tree Search in commercial games
PDF
Programs that Play better than Us
PDF
MLPI Lecture 4: Graphical Model and Exponential Family
PPTX
Mcts ai
PDF
Dynamic Information Retrieval Tutorial - SIGIR 2015
PPTX
Application of Monte Carlo Tree Search in a Fighting Game AI (GCCE 2016)
PDF
Introduction to Reinforcement Learning
sg247934
Markovian sequential decision-making in non-stationary environments: applicat...
Optimization of probabilistic argumentation with Markov processes
Thesis : Contraints and Observability In Markov Decision Processes
BayesianBeliefTrajectoryDiagram
POMDP Seminar Backup3
MLPI Lecture 0: Overview
Hierarchical Pomdp Planning And Execution
MLPI Lecture 3: Advanced Sampling Techniques
MLPI Lecture 1: Maths for Machine Learning
Lecture 5: Variational Estimation and Inference
MLPI Lecture 2: Monte Carlo Methods (Basics)
Challenges for implementing Monte Carlo Tree Search in commercial games
Programs that Play better than Us
MLPI Lecture 4: Graphical Model and Exponential Family
Mcts ai
Dynamic Information Retrieval Tutorial - SIGIR 2015
Application of Monte Carlo Tree Search in a Fighting Game AI (GCCE 2016)
Introduction to Reinforcement Learning
Ad

Similar to Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics) (20)

PDF
Machine learning (9)
PDF
Slides mc gill-v4
PDF
Slides mc gill-v3
PDF
Slides erasmus
PDF
Testing for mixtures by seeking components
PDF
Litvinenko_RWTH_UQ_Seminar_talk.pdf
DOCX
Mathsclass xii (exampler problems)
PPTX
Microeconomics-Help-Experts.pptx
PPTX
stochastic processes assignment help
PDF
HypergroupsAssociationSchemes_leejuntaek
PDF
Probability Formula sheet
PDF
PDF
PDF
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
PDF
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
PDF
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
PDF
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
PDF
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
PDF
Multivriada ppt ms
PDF
A Note on “   Geraghty contraction type mappings”
Machine learning (9)
Slides mc gill-v4
Slides mc gill-v3
Slides erasmus
Testing for mixtures by seeking components
Litvinenko_RWTH_UQ_Seminar_talk.pdf
Mathsclass xii (exampler problems)
Microeconomics-Help-Experts.pptx
stochastic processes assignment help
HypergroupsAssociationSchemes_leejuntaek
Probability Formula sheet
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Multivriada ppt ms
A Note on “   Geraghty contraction type mappings”

Recently uploaded (20)

PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
The scientific heritage No 166 (166) (2025)
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Microbiology with diagram medical studies .pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
famous lake in india and its disturibution and importance
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
HPLC-PPT.docx high performance liquid chromatography
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Classification Systems_TAXONOMY_SCIENCE8.pptx
Cell Membrane: Structure, Composition & Functions
AlphaEarth Foundations and the Satellite Embedding dataset
2Systematics of Living Organisms t-.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
The scientific heritage No 166 (166) (2025)
Biophysics 2.pdffffffffffffffffffffffffff
Microbiology with diagram medical studies .pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
famous lake in india and its disturibution and importance
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Comparative Structure of Integument in Vertebrates.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
TOTAL hIP ARTHROPLASTY Presentation.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...

Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)

  • 1. Appendix for Lecture 2: Monte Carlo Methods (Basics) Dahua Lin 1 Justification of Basic Sampling Methods Proposition 1. Let F be the cdf of a real-valued random variable with distribution D. Let U ∼ Uniform([0, 1]), then F−1(U) ∼ D. Proof. Let X = F−1(U). It suffices to show that the cdf of X is F. For any t ∈ R, P(X ≤ t) = P(F−1 (U) ≤ t) = P(U ≤ F(t)) = F(t). (1) Here, we utilize the fact that F is non-decreasing. Proposition 2. Samples producted using Rejection sampling has the desired distribution. Proof. Each iteration actually generate two random variables: x and u, where u ∈ {0, 1} is the indicator of acceptance. The join distribution of x and u is given by ˜p(dx, u = 1) = a(u = 1|x)q(dx) = p(x) Mq(x) q(x)dx = p(x) M µ(dx). (2) Here, a(u|x) is the conditional distribution of u on x, and µ is the base measure. On the other hand, we have Pr(u = 1) = ˜p(dx, u = 1) = p(x) M µ(dx) = 1 M . (3) Thus, the resultant distribution is ˜p(dx|u = 1) = ˜p(dx, u = 1) Pr(u = 1) = p(x)µ(dx). (4) This completes the proof. 2 Markov Chain Theory Proposition 3. When the state space Ω is countable, we have µ − ν TV = 1 2 x∈Ω |µ(x) − ν(x)|. (5) Proof. Let A = {x ∈ Ω : µ(x) ≥ nu(x)}. By definition, we have µ − ν TV ≥ |µ(A) − ν(A)| = µ(A) − ν(A), (6) µ − ν TV ≥ |µ(Ac ) − ν(Ac )| = ν(Ac ) − µ(Ac ). (7) We also have µ(A) − ν(A) = x∈A µ(x) − ν(x) = x∈A |µ(x) − ν(x)|, (8) ν(Ac ) − µ(Ac ) = x∈Ac ν(x) − µ(x) = x∈Ac |µ(x) − ν(x)|. (9) 1
  • 2. Combining the equations above results in µ − ν TV ≥ 1 2 (µ(A) − ν(A) + ν(Ac ) − µ(Ac )) = 1 2 x∈A |µ(x) − ν(x)| + x∈Ac |µ(x) − ν(x)| = 1 2 x∈Ω |µ(x) − ν(x)|. (10) Next we show the inequality of the other direction. For any A ⊂ Ω, we have |µ(Ac ) − ν(Ac )| = |(µ(Ω) − µ(A)) − (ν(Ω) − ν(A))| = |µ(A) − ν(A)| (11) Hence, |µ(A) − ν(A)| = 1 2 (|µ(A) − ν(A)| + |µ(Ac ) − ν(Ac )|) ≤ 1 2 x∈A |µ(x) − ν(x)| + x∈Ac |µ(x) − ν(x)| ≤ 1 2 x∈Ω |µ(x) − ν(x)|. (12) As A is arbitrary, we can conclude that µ − ν TV sup A |µ(A) − ν(A)| ≤ 1 2 x∈Ω |µ(x) − ν(x)|. (13) This completes the proof. Proposition 4. The total variation distance (µ, ν) → µ − ν TV is a metric. Proof. To show that it is a metric, we verify the four properties that a metric needs to satisfy one by one. 1. µ − ν TV is non-negative, as |µ(A) − ν(A)| is always non-negative. 2. When µ = ν, |µ(A) − ν(A)| is always zero, and hence µ − ν TV = 0. On the other hand, when µ = ν, there exists A ⊂ S such that |µ(A) − ν(A)| > 0, and therefore µ − ν TV ≥ |µ(A) − ν(A)| > 0. Together we can conclude that µ − ν TV = 0 iff µ = ν. 3. µ−ν TV = ν −µ TV as |µ(A)−ν(A)| = |ν(A)−µ(A)| holds for any measurable subset A. 4. Next, we show that the total variation distance satisfies the triangle inequality, as below. Let µ, ν, η be three probability measures over Ω: µ − ν TV = sup A∈S |µ(A) − ν(A)| = sup A∈S |µ(A) − η(A) + η(A) − ν(A)| ≤ sup A∈S (|µ(A) − η(A)| + |η(A) − ν(A)|) ≤ sup A∈S |µ(A) − η(A)| + sup A∈S |η(A) − ν(A)| = µ − η TV + η − ν TV . (14) 2
  • 3. The proof is completed. Proposition 5. Consider a Markov chain over a countable space Ω with transition proba- bility matrix P. Let π be a probability measure over Ω that is in detailed balance with P, i.e. π(x)P(x, y) = π(y)P(y, x), ∀x, y ∈ Ω. Then π is invariant to P, i.e. π = πP. Proof. With the assumption of detailed balance, we have (πP)(y) = x∈Ω π(x)P(x, y) = x∈Ω π(y)P(y, x) = π(y) x∈Ω P(y, x) = π(y). (15) Hence, π = πP, or in other words, π is invariant to P. Proposition 6. Let (Xt) be an ergodic Markov chain Markov(π, P) where π is in detailed balance with P, then given arbitrary sequence x0, . . . , xn ∈ Ω, we have Pr(X0 = x0, . . . , Xn = xn) = Pr(X0 = xn, . . . , Xn = x0). (16) Proof. First, we have Pr(X0 = x0, . . . , Xn = xn) = π(x0)P(x0, x1) · · · P(xn−1, xn). (17) On the other hand, by detailed balance, we have P(x, y) = π(y)P(y,x) π(x) , and thus Pr(X0 = xn, . . . , Xn = x0) = π(xn)P(xn, xn−1) · · · P(x1, x0) = π(xn) π(xn−1)P(xn−1, xn) π(xn) · · · π(x0)P(x0, x1) π(x1) = P(xn−1, xn) · · · P(x0, x1)π(x0). (18) Comparing Eq.(17) and Eq.(18) results in the equality that we intend to prove. Proposition 7. Over a measurable space (Ω, S), if a stochastic kernel P is reversible w.r.t. π, then π is invariant to P. Proof. Let π = πP, it suffices to show that π (A) = π(A) for every A ∈ S under the reversibility assumption. Given any A ∈ S, let fA(x, y) := 1(y ∈ A), then we have π (A) = π(dx)P(x, A) = π(dx) fA(x, y)P(x, dy) = fA(x, y)π(dx)P(x, dy) = fA(y, x)π(dx)P(x, dy) = 1(x ∈ A)π(dx)P(x, dy) = 1(x ∈ A)π(dx) P(x, dy) = 1(x ∈ A)π(dx) = π(A). (19) This completes the proof. 3
  • 4. Proposition 8. Given a stochastic kernel P and a probability measure π over (Ω, S). Suppose both Px and π are absolutely continuous w.r.t. a base measure µ, that is, π(dx) = π(x)µ(dx) and P(x, dy) = Px(dy) = px(y)µ(dy), then P is reversible w.r.t. π if and only if π(x)px(y) = π(y)py(x), a.e. (20) Proof. First, assuming detailed balance, i.e. π(x)px(y) = π(y)py(x), a.e., we show reversibility. f(x, y)π(dx)P(x, dy) = f(x, y)π(x)px(y)µ(dx)µ(dy) = f(x, y)π(y)py(x)µ(dx)µ(dy) ...[detailed balance] = f(y, x)π(x)px(y)µ(dx)µ(dy) ...[exchange variables] = f(y, x)π(dx)P(x, dy). (21) Next, we show the converse. The definition of reversibility implies that f(x, y)π(dx)P(x, dy) = f(x, y)π(dy)P(y, dx) (22) Hence, f(x, y)π(x)px(y)µ(dx)µ(dy) = f(x, y)π(y)py(x)µ(dx)µ(dy) (23) Hence, f(x, y)π(x)px(y) = f(x, y)π(y)py(x) a.e. for arbitrary integrable function f, which im- plies that π(x)px(y) = π(y)py(x) a.e. Proposition 9. Given a stochastic kernel P and a probability measure π over (Ω, S). If P(x, dy) = m(x)Ix(dy) + px(y)µ(dy) and π(x)px(y) = π(y)py(x) a.e, then P is reversible w.r.t. π. Proof. Under the given conditions, we have f(x, y)π(dx)P(x, dy) = f(x, y)π(dx) (m(x)Ix(dy) + px(y)µ(dy)) = f(x, y)m(x)π(dx)Ix(dy) + f(x, y)px(y)µ(dy) = f(x, x)m(x)π(dx) + f(x, y)px(y)π(dx)µ(dy) = f(x, x)m(x)π(dx) + f(x, y)px(y)π(x)µ(dx)µ(dy). (24) For the right hand side, we have f(y, x)π(dx)P(x, dy) = f(y, x)π(dx) (m(x)Ix(dy) + px(y)µ(dy)) = f(y, x)m(x)π(dx)Ix(dy) + f(y, x)px(y)µ(dy) = f(x, x)m(x)π(dx) + f(y, x)px(y)π(dx)µ(dy) = f(x, x)m(x)π(dx) + f(y, x)px(y)π(x)µ(dx)µ(dy) = f(x, x)m(x)π(dx) + f(x, y)py(x)π(y)µ(dx)µ(dy). (25) With π(x)px(y) = π(y)py(x), we can see that the left and right hand sides are equal. This completes the proof. 4
  • 5. 3 Justification of MCMC Methods Proposition 10. Samples produced using the Metropolis-Hastings algorithm has the desired distribution, and the resultant chain is reversible. Proof. It suffices to show that the M-H update is reversible w.r.t. π, which implies that π is invariant. The stochastic kernel of M-H update is given by P(x, dy) = m(x)I(x, dy) + q(x, dy)a(x, y) = r(x)I(x, dy) + qx(y)a(x, y)µ(dy) (26) Here, µ is the base measure, and I(x, dy) is the identity measure given by I(x, A) = 1(x ∈ A), and m(x) is the probability that the proposal is rejected, which is given by m(x) = 1 − Ω q(x, dy)a(x, y). (27) Let g(x, y) = h(x)qx(y)a(x, y). With Proposition 9, it suffices to show that g(x, y) = g(y, x). Here, a(x, y) = min{r(x, y), 1}. Also, from the definition r(x, y) = h(y)qy(x) h(x)qx(y) , it is easy to see that r(x, y) = 1/r(y, x). We first consider the case where r(x, y) ≤ 1 (thus r(y, x) ≥ 1), then g(x, y) = h(x)qx(y)a(x, y) = h(x)qx(y) h(y)qy(x) h(x)qx(y) = h(y)qy(x), (28) and g(y, x) = h(y)qy(x)a(y, x) = h(y)qy(x). (29) Hence, g(x, y) = g(y, x) when r(x, y) ≤ 1. Similarly, we can show that the equality holds when r(x, y) ≥ 1. This completes the proof. Proposition 11. The Metropolis algorithm is a special case of the Metropolis-Hastings algo- rithm. Proof. It suffices to show that when q is symmetric, i.e. qx(y) = qy(x), the acceptance rate reduces to the form given in the Metropolis algorithm. Particularly, when qx(y) = qy(x), the acceptance rate of the M-H algorithm is a(x, y) = min{r(x, y), 1} = min h(y)qy(x) h(x)qx(y) , 1 = min h(y) h(x) , 1 . (30) This completes the proof. Proposition 12. The Gibbs sampling update is a special case of the Metropolis-Hastings update. Proof. Without losing generality, we assume the sample is comprised of two components: x = (x1, x2). Consider a proposal qx(dy) = π(dy1, x2)I(dx2). In this case, we have r((x1, x2), (y1, x2)) = π(y1, x2)π(x1, x2) π(x1, x2)π(y1, x2) = 1. (31) This implies that the candidate is always accepted. Also, generating a sample from qx is equivalent to drawing one from p(y|z). This completes the argument. Proposition 13. Let K1, . . . , Km be stochastic kernels with invariant measure π, and q ∈ Rm be a probability vector, then K = m k=1 qiKi is also a stochastic kernel with invariant measure π. Moreover, if K1, . . . , Km are all reversible, then K is reversible. 5
  • 6. Proof. First, it is easy to see that convex combinations of probability measures remain proba- bility measures. As an immediate consequence, Kx, a convex combination of Ki(x, ·), is also a probability measure. Given a measurable subset A, Ki(·, A) is measurable for each i, so is their convex combinations. Hence, we can conclude that K remains a stochastic kernel. Next, we show that π invariant to K, as πK = π m i=1 qiKi = m i=1 qi(πKi) = m i=1 qiπ = π. (32) This proves the first statement. Then, we assume that K1, . . . , Km are reversible, then for K, we have f(x, y)π(dx)K(x, dy) = m i=1 qi f(x, y)π(dx)Ki(x, dy) = m i=1 qi f(y, x)π(dx)Ki(x, dy) = f(y, x)π(dx)K(x, dy). (33) This implies that K is also reversible, thus completing the proof. Proposition 14. Let K1, . . . , Km be stochastic kernels with invariant measure π. Then K = Km ◦ · · · ◦ K1 is also a stochastic kernel with invariant measure π. Proof. Consider K = K2 ◦ K1. To show that K is a stochastic kernel, we first show that Kx(dy) = K(x, dy) is a probability measure. Given arbitrary measurable subset A, we have K(x, A) = K1(x, dy)K2(y, A). (34) As this is a bounded non-negative integration and K2(y, A) is measurable, it constitutes a measure. Also, K(x, Ω) = K1(x, dy)K2(y, Ω) = K1(x, dy) = 1. (35) Hence, K(x, ·) is a probability measure, and thus K a stochastic kernel. Next, we show π is invariant to K: πK = π(K2 ◦ K1) = (πK1)K2 = πK2 = π. (36) We have proved the statement for a composition of two kernels K1 ◦ K2. By induction, we can further extend to finite composition, thus completing the proof. 6