Learning to discover monte carlo algorithm on spin ice manifold

Learning to Discover Monte Carlo Algorithm on Spin Ice
Manifold
Kai-Wen Zhao, Yin-Jer Kao
National Taiwan University
kelispinor@gmail.com
June 14, 2018
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 1 / 79

Overview
1 Introduction
2 Backgrounds
Markov Chain Monte Carlo
Reinforcement Learning
Deep Learning
Variance Reduction and Actor-Critic
3 Reinforcement Learning Loop Algorithm
Learn to Discover
Square Ice Model and Loop Algorithm
4 Experiments and Results
Learning Procedure
Statistics of Loop
Memory Eﬀect
Correlation Time
5 Conclusion
6 Supplementary materials

Introduction

Introduction
What is physics?
Prediction
Purpose of machine learning:
Prediction
Generalization |Ein(h) − Eout(h)| <

Introduction
What is physics?
More than prediction
Theoretical framework
Supervised learning scheme:
Fit the theory with neural network model
Weights cannot be interpreted
In Monte Carlo simulation:
Data: Conﬁguration
Label: Physical observables
Usually imply we already solve the problem
We ﬁt the network in algorithmic level, not in theoretical level.

Machine learns how to behave in a environment by performing actions and
seeing the results.
Figure: General-purpose framework for machine-world interaction.(Image courtesy
Notfruit/Wikimedia Commons)

Motivation
Can machine invent or discover Monte Carlo update method?
Efficient Monte Carlo methods are usually model-dependent,
physicists have to design algorithms elaborately
We want to build a general framework can help us discover efficient
update method.
The following questions would be answered:
Can we establish a bridge between machine and physics?
How to parameterize Monte Carlo simulation?
Can machine realizes the physical rules?
Can machine discover efficient update strategies?

Backgrounds

Backgrounds
Markov Decision Process
Deep Learning
Neural Network
ConvNet
Backpropagation
Policy gradient
Value function
Actor-Critic algorithm

Markov Chain
Define stochastic process for a system with a finite set of possible states
{s0, s1, ...st, ..., sT }.
state st
transition operator T(st+1|st)
If the transition probability satisfy the Markov property, we called it
Markov process.
Markov Property
p(st+1|st) = p(st+1|st, st−1, st−2, ...s0) (1)
State captures all information about history. st is sufficient statistics of the
future.
p(s ) = dsT(s |s)p(s) (2)

MCMC works by constructing and simulating a Markov chain whose
equilibrium distribution is the distribution of interest.
T(s |s) = Q(s |s)A(s |s) (3)
If the operator satisﬁes the detailed balance condition, it garantees that
MCMC distribution will asymptotically converge to target distribution.
p(s)T(s |s) = p(s )T(s|s )
The simplest MCMC algorithm is Metropolis-Hastings method which
proceeds in two steps:
1 Give a sample from proposal
2 Accept this proposal with probability A(s |s) = min 1, p(s )Q(s|s )
p(s)Q(s |s)

Markov Decision Process M = {S, T, A, r}.
State Space, S: a set of states of environment
Action Space, A: a set of actions which the agent selects from at
each timestep
Reward function, S × A → R: a scalar value to characterize states
Tansition probability: p(r, st+1|st, at)

General-purpose framework for decision-making
Goal: select actions to maximise future reward
Reinforcement learning is for an agent with the capacity to act
Each action inﬂuences the agents future state
Deep reinforcement learning uses neural network as function
approximator
Agent
π(at|st)
Environment
Action atNew state st+1 Reward rt+1

Neural Networks
Perceptron deﬁnes the mapping y = hθ(x) from input data to output
target where θ is the learnable parameter. The function f is called
activation function which provides nonlinearity to produce rich feature
representation in neural network.
hθ(x) = f (θx + b) (4)
Figure: Fully-Connected Neural Network

Convolutional Neural Networks
There are three major components of ConvNet:
Convolutional Layer
Pooling Layer
Fully-Connected Layer
Figure: ConvNet Architecture

Convolutional Neural Networks
Convolution:
S(i, j) = (I ∗ K)(i, j) =
m n
I(m, n)Kθ(i − m, j − n) (5)
where I is the input image with dimensions m and n, K represents the
kernel and S is the output of location i, j.
There are three important characteristics that come with the use of
convolution operations:
Sparse interaction
Parameter sharing
Equivariance

Demonstration of Convolution
Convolution:
S(i, j) = (I ∗ K)(i, j) =
m n
I(m, n)Kθ(i − m, j − n) (6)
Figure: Convolutional Layer Operation (ﬁgure courtesy: Deep Learning)

Demonstration of Convolution
Convolution:
Figure: Sparsity and Depth of Convolution (ﬁgure courtesy: Deep Learning)

Backpropagation
Update each the weights with gradient descents, learning rate η:
θ ← θ + η θJ (7)
It is not easy to compute gradient with respects to all weights in one giant
network:
y = net(x)
The eﬃcient algorithm used to compute gradient is backpropagation.
∂J
∂θ
(l)
ij
=
∂J
∂y
∂y
∂net(l)
∂net(l)
∂θ
(l)
ij
(8)

Demonstration of Backpropagation
Two steps in computational graph (directed acyclic graph):
Forward pass: Inference
Backward pass: Learning
Figure: Gradient ﬂows within computational graph

Goal of the reinforcement learning is to maximize sum of the future
rewards, τ = (s1, a1, s2, a2, ...sT , aT )
J(π) = Eτ∼π
t
rt = dτπ(τ)R(τ) (9)
Figure: Integrate the path of events within MDP. (ﬁgure courtesy: Berkeley
cs294-112)

Score Function Gradient Estimator
To solve the optimization problem, we parameterize the policy (using
neural network) and compute the derivative of the objective function.
J(πθ) = Eτ∼πθ
t
rt (10)
Then apply the gradient ascent to update the policy, θ ← θ + θJ(πθ).
Score Function Gradient Estimator, Policy Gradient
Suppose that x is a random variable with probability density p(x|θ), f is a
scalar-valued funtion and we want to calculate θ Ex [f (x)].
θ Ex [f (x)] = Ex θ log p(x|θ)f (x) (11)

Policy Gradient: Online Model-free Method
Policy gradient is independent with initial state and environment
dynamics. It is the model-free method.
πθ(τ) = µ(s1)
T
t=1
πθ(at|st)p(st+1|st, at) (12)
θ log πθ(τ) = θ
log µ(s1) +
T
t=1
log πθ(at|st) + ((((((((
log p(st+1|st, at)
(13)
The explicit computational format is represented:
θJ(πθ) ≈
1
N
N
i=1
T
t=1
θ log πθ(ai
t|si
t)
T
t=1
r(si
t, ai
t) (14)

Variance Reduction
Causality: Event happened in t can not aﬀect reward at time t when
t t .
θJ(πθ) = Eπθ
T
t=1
θ log πθ(at|st)
T
t =t
r(st, at)
ˆQt , reward-to-go
(15)
Baseline: Remove ﬂoating reference for each trajectory.
θJ(πθ) = Eπθ
T
t=1
θ log πθ(at|st) ˆQt
reward-to-go
− b
baseline

Variance Reduction
Figure: Two diﬀerent policy due to oﬀset of reward-to-go

Value Functions
Replace scalar values to be state-dependent value functions (approximate
with neural network)
ˆQt → Q(st, at) =
T
t =t
Eπθ
r(st , at )|st, at
b → V (st) = Eat ∼πθ(at |st ) Q(st, at)
The diﬀerence of two value functions is called adavantage function.
A(st, at) = Q(st, at) − V (st) (16)

Value Functions
The diﬀerence of state-action value and value function is called advantage.
Aπ
(st, at) = Qπ
(st, at) − V π
(st) = r(st, at) + V π
(st+1) − V π
(st)

Advantage Actor-Critic Algorithm
Parameterize actor πθ(at|st) and criric Vφ(st) using neural networks.
Advantage Actor-Critic
sample (si , ai , si , ri ) from πθ(a|s)
Prepare D = (si , yi ) , where yi = ri + ˆV π
φ (si )
Fit Vφ(s) with loss L = 1/2 i || ˆV π
φ (si ) − yi ||2
φ ← φ + φL(φ)
Evaluate Aπ(si , ai ) = ri + V π
φ (si ) − V π
φ (si )
θJ(θ) ≈ (1/N) i θ log(πθ(ai |si ))Aπ(si , ai )
θ ← θ + θJ(θ)

Quick Summary
Extend Markov chain to Markov decision process by labeling actions
and rewards.
Deep neural networks are general diﬀerential functions. As long as we
have loss function, it can be updated by backprop.
Reinforcement learning is the framework built on Markov decision
process.
Actor-Criric algorithm train not only the policy to perform tasks but
also value function to predict future returns.

Mergence of Loop-like Algorithm

Learn to Discover
The expected reward guides discovery of update pattern.
Figure: Physical constraints imply the distribution of states

In our point of view, one global Monte Carlo step is composed with series
of decision makings.
Figure: Sequence of local decisions form a global movement

Markov Chain and Markov Decision Process
The global update trajectory in Markov decision process is described as
P(s1 → sT ) = P(s1)
T
t=1
πθ(at|st)P(st+1|st, at) (17)
Use policy gradient to ﬁnd global update rule (policy πθ)
π∗
= argmaxθ Eτ∼πθ(τ)
t
r(st, at) (18)
Ψ serves as information carrier or weak label, which weights the likelihood.
ˆg = Eπθ
∞
t
log πθ(at|st)Ψt (19)

Asynchronous Advantage Actor-Critic Algorithm (A3C)
In modern deep learning frameworks, we write down policy loss rather than
policy gradient and let the automatic diﬀerentiation algorithm do the work.
L(θ) =
t
log πθ(at, st) ˆAπ
t
Policy Loss
− ||V (st) − yt||2
Value Estimation
+ λ πθ(at|st) log πθ(at|st)
Entropy Regularization
Figure: Diagram of A3C high-level archiecture

Learn to Discover
We choose square ice model as our arena because of its constraints.
Figure: Distribution of ice states are delta-like

Square Ice Model
The square ice model serves the simplest toy model verifying our idea.
H =
i
|
α
sα| (20)
Each vertex satisﬁes ice rule (2 in, 2 out). There is an eﬃcient method to
transit between states called loop algorithm.
There are two things to be discovered, ice rule and loop update.

Demonstration of Long Loop Algorithm
Start with a lattice that is entirely in ice rule and choose at random a
single vertex.

Start with a lattice that is entirely in ice rule and choose at random a
single vertex.

Choose at random one of the two inward pointing spin of the vertex
Trace to the next vertex and repeat, always choosing incoming spins

Close the loop and complete one Monte Carlo step. All vertices still satisfy
the ice-rule and remain in ice state.
Figure: A closed loop with length 6.

IceGame
We design the environment called IceGame which can interact with RL
agent. Icegame is the factory for generating ice and the agent serves as
woker in it.

IceGame: Action
Two type of actions:
Direction action, ai ﬂips σi .
Update execution
a = [a0, a1, a2, a3, a4, a5, update]

Icegame: Observation
Combine the conﬁguration and physical quantity as the MDP state.
Local observation: Ol = [σi , ∆C, ∆E] Spins, conﬁgs, energy
Global observation: Og = St − S0 = Ct Trajectory

Icegame: Reward function
Design of reward function is crucial for reinforcement learning task
r(s, a) =



rg if proposed loop is accepted
rs if movement doesn’t increase energy
rf otherwise
(21)
Stepwise rewards rs are assigned relative small value to target reward rg
such that the agent pursues high rg ultimately.
rs
rg
∼ O(10−3
)
We usually set rg = +1, rs = +1/N and rf = 0.

Network Architecture Design
Multiple channel network:
Local channel: linear (relu), 32 - linear (relu) 64
Global channel: 3x3 Conv, 32 - 3x3 Conv, 16 - 1024
concat: 1088 - linear, 128, parallel: linear, 7; linear, 1

Sliding Horizon Mechanism
In order to adopt trained model to diﬀerent system with larger size, we
provide this mechanism.
Figure: Trained network scans over the new environment

Learning Procedure
Behaviour evolves along with diﬀerent stages of learning. The exploitation
and exploration process.

Statistics of Loop
Example: L = 16
Network Acceptance Eﬃciency Loop Size
Algorithm 96.2% 33 ∼ 96 % 4 ∼ 480
(96)
NN Policy 94.1% 24 ∼ 81 % 4 ∼ 436
(51)
CNN Policy 96.2% 31 ∼ 90 % 4 ∼ 518
(239)
Eﬃciency = updated/visted

Loop Length Distribution: L = 16

Loop Length Distribution, L = 32

Loop Length Distribution, L = 16, 32, 64
In special scenario, the agent pursues higher cumulative rewards and forms
ambitious policy for creating larger loops.
Figure: Specailized ConvNet policy for generating longer loop

Stepwise Policy Distribution
To remind, our action is ﬂipping corresponding neighboring spins.
a = [a0, a1, a2, a3, a4, a5]
s = [σ0, σ1, σ2, σ3, σ4, σ5]
For loop algorithm, formulated in MDP scenario, the decision policy is the
following.
πalgo(a|s) = [0, 0, 0, 0, 0.5, 0.5]
Figure: Local observation of the spins

Stepwise Policy Distribution
Stepwise decision making performed by the trained agent.

Generated Loop Patterns
Compare the generated loop patterns between loop algorithm and trained
policy in small size.
Figure: Samll loop. Left: Algorithm, Right: ConvNet Policy

policy in medium size.
Figure: Medium loop. Left: Algorithm, Right: ConvNet Policy

policy in large size.
Figure: Large loop. Left: Algorithm, Right: ConvNet Policy

Neural Network Memory Eﬀect
Memory-less algorithm is independent with initial position.
Figure: Loop Algorithm: Same Initial Point and Same Conﬁguration

Feedforward policy shows its locality preferences.
Figure: Feedforward Policy: Same Initial Point and Same Conﬁguration

ConvNet performs similarly to original loop algorithm.
Figure: ConvNet Policy: Same Initial Point and Same Conﬁguration

Conﬁguration Visited Heatmap
Number of visited times on each sites.
Figure: Left: Algorithm, Right: ConvNet

Correlation Time and Hybrid Method
The Markov decision process description allows us travelling and searching
the whole policy.
Hybrid Ratio =
Policy Execution Times
Algorithm Execution Times

Auto-Correlation Time
We measure correlation time for an observable ρsym which is the density of
the symmetric vertices with hybrid ratio 0.2.

Conclusion

Conclusions
To propose the acceptable loop update in square ice model, the machine
has to learn several things:
Realize ice rule and satisfy it to maintain the system in ice.
Distinguish open and closed loop, propose update at appropriate
moment.
Figure: Successful update samples (orange dots) work as supports of emergent
strategy

Conclusions
We conclude that:
Build the interface between machine and physical model on MDP
Proposal operator can be parameterized by neural network as πθ(a|s)
Loop-like update pattern would emerge due to ice rule
Machine realizes update rule by interating with ice rule, successful
examples serve as support
Both ice rule (local) and closed loop condition (global) are learned
From policy distribution, we can see how agent makes decisions
Trained machine can scale up to larger size and cooperate with
existing algorithm

Thanks Your Attention

Supplementary materials

To understand backpropagation, we ﬁrst consider a unit circuit.
Figure: Unit circuit in computational graph (ﬁgure courtesy: Stanford cs231n)

Figure: Unit circuit in computational graph: backprop

Score Function Gradient Estimator
The equality can be derived by explicitly writing the expectation
θ Ex [f (x)] = θ dxp(x|θ)f (x) (22)
= dx θp(x|θ)f (x) (23)
= dxp(x|θ)
θp(x|θ)
p(x|θ)
f (x) (24)
= Ex θ log p(x|θ)f (x) (25)
Apply the score function estimator to our objective function, we will get
the policy gradient:
θJ(πθ) = Eτ∼π
t
rt = Eτ∼πθ(τ) θ log πθ(τ)R(τ) (26)

Policy Gradient Methods
The detailed formalism of policy gradient is worked out as
θJ(πθ) = Eτ∼πθ(τ) θ log πθ(τ)R(τ) (27)
= Est+1,rt ∼πθ(st ,at )
T
t=1
θ log πθ(at|st)
T
t=1
r(st, at) (28)
The explicit computational format is represented:
θJ(πθ) ≈
1
N
N
i=1
T
t=1
θ log πθ(ai
t|si
t)
T
t=1
r(si
t, ai
t) (29)

Baseline
In order to remove the instability, we subtract the reward-to-go by a
constant baseline b.
θJ(πθ) ≈
1
N
N
i
θ log πθ(τ) ˆQ(τ) − b (30)
b =
1
N
N
i=1
R(τ) (31)
It is easy to show that substracting baseline is unbiased in expectation.
E θ log πθ(τ)b = dτπθ(τ) θ log πθ(τ)b = b θ dτπθ(τ) = 0
(32)
Besides, by variance analysis, we can get the variance free baseline
b = g(τ)R(τ)
E[g2(τ)]
, where g(τ) = θ log πθ(τ).

Review of Policy Gradient
General policy gradient is denoted as the following
Policy Gradient
ˆg = E
∞
t
log πθ(at|st)Ψt (33)
The Ψ is denoted as goodness, it can be
1: Maximum log-likelihood
t rt: Vanilla REINFORCE
t rt − b(st): Causal
REINFORCE algorithm with
baseline
Qπ: Actor-Critic algorithm
Aπ: Advantage actor-critic
algorithm
Ψ serves as information carrier or weak label, which weights the likelihood.

Learning to discover monte carlo algorithm on spin ice manifold

More Related Content

What's hot (20)

Similar to Learning to discover monte carlo algorithm on spin ice manifold (20)

More from Kai-Wen Zhao (8)

Recently uploaded (20)

Learning to discover monte carlo algorithm on spin ice manifold