Markov chain monte_carlo_methods_for_machine_learning

Markov Chain Monte Carlo Methods
Applications in Machine Learning
Andres Mendez-Vazquez
June 1, 2017
1 / 61

Images/cinvestav-
Outline
1 Introduction
The Main Reason
Examples of Application
Basically
2 The Monte Carlo Method
FERMIAC and ENIAC Computers
Immediate Applications
3 Markov Chains
Introduction
Enters Perron-Frobenious Theorem
Enter Google’s Page Rank
4 Markov Chain Monte Carlo Methods
Combining the Power of the two Methods
5 Metropolis Hastings
Introduction
A General Idea
6 The Gibbs Sampler
Introduction
The Simplest Algorithm
2 / 61

Images/cinvestav-
Chance
There are many phenomenas that introduce chance in their models
Therefore
Why not use SAMPLING to understand those
phenomenas?
4 / 61

Images/cinvestav-
Thus
Markov Chain Monte Carlo (MCMC) Methods
Algorithms that use Markov Chains to achieve samples of a target
phenomena!!!
Thus
They are computer based simulations able to obtain samples of π (x).
5 / 61

Images/cinvestav-
The Reason
There are several high dimensional problems
For Example, computing the volume of a convex body in d dimensions.
It is the only known general approach for providing a solution within a
reasonable time, O dk .
Therefore
MCMC plays signiﬁcant role in statistics, econometrics, physics and
computing science.
6 / 61

Images/cinvestav-
Examples of Application
Bayesian Inference and Learning
Given some unknown variables X1, X2, ..., XK and data Y , we want to
know properties about Y
Graphically
8 / 61

Images/cinvestav-
Deep Learning
Restricted Boltzmann Machines use MCMC to optimize their weights
E (v, h) = −
i
aivi −
j
bjhj −
i j
viwi,jhj
9 / 61

Images/cinvestav-
What do we want?
Given a probability distribution of interest
π (x) , x ∈ RN
Which has the following structure
π (x) =
1
Z
h (x)
where h (x) is a PDF and Z is a unknown normalization constant.
Thus,
We want to understand such distribution!!!
11 / 61

Images/cinvestav-
The beginning of Monte Carlo Methods
1945 two events change the world forever
The successful nuclear test at Alamogordo.
The building of the ﬁrst electronic computer, ENIAC.
Pushed for the creation of the Monte Carlo Methods
Original idea came from Stan Ulman... He loved relaxing by playing
poker and solitary!!!
Stan had an uncle who borrowed money from relatives because he
“just had to go to Monte Carlo.”
13 / 61

Images/cinvestav-
Together with Von Neumann
The Guy Behind the Minimax Algorithm
They started to develop an idea to trace the path of neutrons in a
spherical reactor.
14 / 61

Images/cinvestav-
Thus
We have then
At each stage a sequence of decisions has to be made based on statistical
probabilities appropriate to the physical and geometric factors.
For this, we only need a source of uniform random numbers!!!
Because it is possible to use the inverse of the cumulative of the target
function to obtain the necessary samples!!!
15 / 61

Images/cinvestav-
Thus, the FERMIAC was born!!!
When the ENIAC was moved and it was necessary to keep generating
target statistics
16 / 61

Images/cinvestav-
However
Once the ENIAC went on-line again
It took two months to have the basic controls for the Monte-Carlo
One fortnight for the last phases of the implementation.
Then, the tests were ran
And Monte Carlo was born!!!
17 / 61

Images/cinvestav-
Look At This, Von Neumann’s Programs
Design First and Programming After
18 / 61

Images/cinvestav-
Monte Carlo Integration
We can get integral of complex functions
I =
Ω
sin ln (x + y + 1)dxdy
Where Ω is a disk with
(x, y) | x −
1
2
2
+ y −
1
2
2
≤
1
4
We only need a source of randomly uniform points at that area
I ≈ Volume of Ω × Average Value of f in Ω
20 / 61

Images/cinvestav-
Getting the First Moment!!!
The goal is to compute the following expectation
E [f] = f (z) p (z) dz
Solution
Obtain a set of samples z(i) where i = 1, ..., N drawn independently from
p(z)
Approximate the expectation as
E [f] ≈ E [f] =
1
N
N
i=1
f z(i)
21 / 61

Images/cinvestav-
Clustering Using Stochastic Process
We use the following process
Imagine a Chinese restaurant with an infinite number of circular tables, each with
infinite capacity!!!
Customer ONE sits at the first table
The next customer either sits at the same table as customer ONE
Or the next table
Something like this
22 / 61

Images/cinvestav-
Thus, it is possible to build and entire Random Process
Simply Asking
p (customer i assigned to table j|D, α) =
f (dij) if j = i
α if i = j
Where
D is the distance between customers
with a similarity dij = d (ci, cj)
Let see the code
There we have a series of nice ideas.
23 / 61

Images/cinvestav-
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
If and only if
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-State Discrete Time Markov Chains
It can be completely speciﬁed by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
25 / 61

Images/cinvestav-
Example
We have the following transition matrix







0.2 01 0.1 0.1 0.5
0.4 01 0.1 0.2 0.1
0.1 0.3 0.2 0.1 0.1
0.2 0.4 0.2 0.3 0.2
0.1 0.1 0.4 0.3 0.1







Graphically
1
5
2
3
4
26 / 61

Images/cinvestav-
What kind of Markov Chain do we like to study?
Ergodic
A Markov chain is called an ergodic chain if it is possible to go from every
state to every state (not necessarily in one move).
Aperiodic
A state i has period k if any return to state i must occur in multiples of k
time steps.
k = gcd {n > 0|P (Xn = i|X0 = i) > 0}
27 / 61

Images/cinvestav-
Therefore
Thus
If k = 1, then the state is said to be aperiodic.
Otherwise (k > 1), the state is said to be periodic with period k.
A Markov chain is aperiodic if every state is aperiodic
28 / 61

Images/cinvestav-
The Theorem
Perron–Frobenius Theorem
Let A be a positive square matrix. Then:
a. ρ(A) is an eigenvalue, and it has a positive eigenvector.
b. ρ(A) is the only eigenvalue on the disc |λ| = ρ(A).
30 / 61

Images/cinvestav-
This and other theorems allows to calculate something
quite interesting
Using the Power method
The method is described by the recurrence relation
w(i+1)
=
Tw(i)
Tw(i)
where Tw(i) =
√
w(i)tTtTw(i)
Then
The sub-sequence {wki
}∞
i=1 converges to an eigenvector associated with
the dominant eigenvalue.
31 / 61

Images/cinvestav-
Long Ago in a Long Forgotten Land
Dozens of Companies fought for the Search Landscape
American On-Line
Netscape
Yahoo
Infoseek
Lycos
Altavista
33 / 61

Images/cinvestav-
This is Old
For Example
34 / 61

Images/cinvestav-
Enters Larry Paige and Sergey Brin (Circa 1996)
They Invented the Google Matrix (A Misspelling of Googol = 10100
)
G = αS + (1 − α) 1v
Where
S is a modiﬁed version of an adjacency matrix by converting the
number of links into a probability
35 / 61

Images/cinvestav-
In addition
Also
1 is the Column Vector of ones.
v is a row vector of probabilities
v =
1
n
,
1
n
, ...,
1
n
(At the initial experiments)
The Matrix n × n, (1 − α) 1v








(1 − α) 1
n (1 − α) 1
n · · · (1 − α) 1
n
(1 − α) 1
n (1 − α) 1
n · · · (1 − α) 1
n
...
... ... ...
(1 − α) 1
n (1 − α) 1
n · · · (1 − α) 1
n








36 / 61

Images/cinvestav-
The Dampening Factor
Finally α
In the Google matrix indicates that Random Web surfers move to a
diﬀerent web-page by some means other than selecting a link with
probability 1 − α
37 / 61

Images/cinvestav-
The Algorithm was the Edge!!!
First Google Server
38 / 61

Images/cinvestav-
After the introduction of the algorithm,
THE END for everybody else!!!
39 / 61

Images/cinvestav-
Now Imagine the following
You have a target distribution π that you want to sample
You can use a generative q distribution that you know to try to generate
the necessary samples.
Then, you have a process like this
1 Sample x ∼ q (x).
2 Use the functional form of the target distribution π to
Accept or Reject the sample x as being generated by π (x)
41 / 61

Images/cinvestav-
Basically
Markov Condition
The generation of x does not depend on previous states!!!
42 / 61

Images/cinvestav-
Further!!!
Monte Carlo Method
The algebraic and geometric properties of the target distribution helps
to accept or reject the sample!!!
43 / 61

Images/cinvestav-
History - The not so great remarks...
Metropolis
Generalized
−→ Metropolis-Hasitngs
Special Case
−→ Gibbs Sampling
All developments are done in Computational Physics.
The Landmark 1953 Paper N. Metropolis, A. Rosenbluth, M.
Rosenbluth, A. Teller, and E. Teller:
“Equation of state calculations by fast computing machines, Journal of
Chemical Physics.”
There is a quote by A. Rosenbluth
“Metropolis played no role in its development other than providing
computer time!”
After all, Metropolis was the supervisor in Los Alamos National Lab.
45 / 61

Images/cinvestav-
Steps of the Metropolis-Hastings
An M-H steps involves the following
A M-H step uses
1 The Target/Invariant Distribution l (x)
2 The Proposal/Sampling Distribution q (x |x)
Then
It involves sampling a candidate value x given the current value x
according to q (x |x)
The Markov chain then moves towards x
With acceptance probability
A (x, x ) = min 1,
l (x ) q (x|x )
l (x) q (x |x)
47 / 61

Images/cinvestav-
Logistic Regression
We know
l (w|x, y) =
n
i=1


exp wT xi
1 + exp {wT xi}


yi
1
1 + exp {wT xi}
1−yi
(1)
In our case, we use as q a Multi-variate Gaussian
π (w) ∼ N (µ, Σ)
49 / 61

Images/cinvestav-
Logistic Regression
The Markov chain then moves towards x
With acceptance probability
A (x, x ) = min 1,
l (x )
l (x)
50 / 61

Images/cinvestav-
Thus, we trigger the process of Sampling
Then, we look at the modes in the distribution
51 / 61

Images/cinvestav-
Let’s Take a Look at the Program
It is not so complex
A little bit of work!!!
52 / 61

Images/cinvestav-
The Assumptions
Suppose
We have an n-dimensional vector x.
The expressions for the full conditionals
p (xj|x1, ..., xj−1, xj+1, ..., xn)
Here, we have the following proposal distribution
q x |x(i)
=



p xj |x
(i)
−j if x−j = x
(i)
−j
0 otherwise
54 / 61

Images/cinvestav-
The Gibbs Sampler
The Algorithm
1 Init x0,1:n
2 For i = 0 to N − 1
Sample x
(i+1)
1 ∼ p x1|x
(i)
2 , ..., x
(i)
n
Sample x
(i+1)
2 ∼ p x2|x
(i)
1 , x
(i)
3 , ..., x
(i)
n
· · ·
Sample x
(i+1)
j ∼ p xj|x
(i)
1 , ..., x
(i)
j−1, x
(i)
j+1, ..., x
(i)
n
· · ·
Sample x
(i+1)
n ∼ p xn|x
(i)
1 , ..., x
(i)
n−1
56 / 61

Images/cinvestav-
Latent Dirichlet Allocation
It is an algorithm for ﬁnding topics composed by sets of words
You require to have documents!!!
Data consists
Of documents di consisting of a set of words wi
In a universe of W = {w1, ..., wn} words.
58 / 61

Images/cinvestav-
Then
We want to ﬁnd the mixture of words to topics
Thus, we can easily do this by counting:
Counts of topic k in document d
The distribution of topics in a document
Counts of word v in document d
The distribution of words in a document
Thus, we can compute the probability of topics Zi (Gibbs Term)
p (Zi|Z−i, W)
59 / 61

Images/cinvestav-
For this
We need to introduce some extra terms
Ωd,k - count of topic k in document d.
Ψk,v- counts of word v in document d.
Thus the Gibbs Term
p (Zi|Z−i, W) =
Ψ−i
k,v + β
v Ψ−i
k,v + Nv · β
×
Ω−i
d,k + α
k Ω−i
d,k + Kα
With
Nv = number of diﬀerent words.
β = Renovation Dirichlet Parameter for words
K = Number of topics
α= Renovation Dirichlet Parameter for topics
60 / 61

Images/cinvestav-
So, we have the following code
We have
The Following!!!
61 / 61

Markov chain monte_carlo_methods_for_machine_learning

More Related Content

What's hot (20)

Similar to Markov chain monte_carlo_methods_for_machine_learning (20)

More from Andres Mendez-Vazquez (20)

Recently uploaded (20)

Markov chain monte_carlo_methods_for_machine_learning