SlideShare a Scribd company logo
Learning to Discover Monte Carlo Algorithm on Spin Ice
Manifold
Kai-Wen Zhao, Yin-Jer Kao
National Taiwan University
kelispinor@gmail.com
June 14, 2018
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 1 / 79
Overview
1 Introduction
2 Backgrounds
Markov Chain Monte Carlo
Reinforcement Learning
Deep Learning
Variance Reduction and Actor-Critic
3 Reinforcement Learning Loop Algorithm
Learn to Discover
Square Ice Model and Loop Algorithm
4 Experiments and Results
Learning Procedure
Statistics of Loop
Memory Effect
Correlation Time
5 Conclusion
6 Supplementary materials
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 2 / 79
Introduction
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 3 / 79
Introduction
What is physics?
Prediction
Purpose of machine learning:
Prediction
Generalization |Ein(h) − Eout(h)| <
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 4 / 79
Introduction
What is physics?
More than prediction
Theoretical framework
Supervised learning scheme:
Fit the theory with neural network model
Weights cannot be interpreted
In Monte Carlo simulation:
Data: Configuration
Label: Physical observables
Usually imply we already solve the problem
We fit the network in algorithmic level, not in theoretical level.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 5 / 79
Reinforcement Learning
Machine learns how to behave in a environment by performing actions and
seeing the results.
Figure: General-purpose framework for machine-world interaction.(Image courtesy
Notfruit/Wikimedia Commons)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 6 / 79
Motivation
Can machine invent or discover Monte Carlo update method?
Efficient Monte Carlo methods are usually model-dependent,
physicists have to design algorithms elaborately
We want to build a general framework can help us discover efficient
update method.
The following questions would be answered:
Can we establish a bridge between machine and physics?
How to parameterize Monte Carlo simulation?
Can machine realizes the physical rules?
Can machine discover efficient update strategies?
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 7 / 79
Backgrounds
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 8 / 79
Backgrounds
Markov Chain Monte Carlo
Markov Decision Process
Deep Learning
Neural Network
ConvNet
Backpropagation
Reinforcement Learning
Policy gradient
Value function
Actor-Critic algorithm
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 9 / 79
Markov Chain
Define stochastic process for a system with a finite set of possible states
{s0, s1, ...st, ..., sT }.
state st
transition operator T(st+1|st)
If the transition probability satisfy the Markov property, we called it
Markov process.
Markov Property
p(st+1|st) = p(st+1|st, st−1, st−2, ...s0) (1)
State captures all information about history. st is sufficient statistics of the
future.
p(s ) = dsT(s |s)p(s) (2)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 10 / 79
Markov Chain Monte Carlo
MCMC works by constructing and simulating a Markov chain whose
equilibrium distribution is the distribution of interest.
T(s |s) = Q(s |s)A(s |s) (3)
If the operator satisfies the detailed balance condition, it garantees that
MCMC distribution will asymptotically converge to target distribution.
p(s)T(s |s) = p(s )T(s|s )
The simplest MCMC algorithm is Metropolis-Hastings method which
proceeds in two steps:
1 Give a sample from proposal
2 Accept this proposal with probability A(s |s) = min 1, p(s )Q(s|s )
p(s)Q(s |s)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 11 / 79
Markov Decision Process
Markov Decision Process M = {S, T, A, r}.
State Space, S: a set of states of environment
Action Space, A: a set of actions which the agent selects from at
each timestep
Reward function, S × A → R: a scalar value to characterize states
Tansition probability: p(r, st+1|st, at)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 12 / 79
Reinforcement Learning
General-purpose framework for decision-making
Goal: select actions to maximise future reward
Reinforcement learning is for an agent with the capacity to act
Each action influences the agents future state
Deep reinforcement learning uses neural network as function
approximator
Agent
π(at|st)
Environment
Action atNew state st+1 Reward rt+1
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 13 / 79
Neural Networks
Perceptron defines the mapping y = hθ(x) from input data to output
target where θ is the learnable parameter. The function f is called
activation function which provides nonlinearity to produce rich feature
representation in neural network.
hθ(x) = f (θx + b) (4)
Figure: Fully-Connected Neural Network
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 14 / 79
Convolutional Neural Networks
There are three major components of ConvNet:
Convolutional Layer
Pooling Layer
Fully-Connected Layer
Figure: ConvNet Architecture
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 15 / 79
Convolutional Neural Networks
Convolution:
S(i, j) = (I ∗ K)(i, j) =
m n
I(m, n)Kθ(i − m, j − n) (5)
where I is the input image with dimensions m and n, K represents the
kernel and S is the output of location i, j.
There are three important characteristics that come with the use of
convolution operations:
Sparse interaction
Parameter sharing
Equivariance
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 16 / 79
Demonstration of Convolution
Convolution:
S(i, j) = (I ∗ K)(i, j) =
m n
I(m, n)Kθ(i − m, j − n) (6)
Figure: Convolutional Layer Operation (figure courtesy: Deep Learning)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 17 / 79
Demonstration of Convolution
Convolution:
Figure: Sparsity and Depth of Convolution (figure courtesy: Deep Learning)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 18 / 79
Backpropagation
Update each the weights with gradient descents, learning rate η:
θ ← θ + η θJ (7)
It is not easy to compute gradient with respects to all weights in one giant
network:
y = net(x)
The efficient algorithm used to compute gradient is backpropagation.
∂J
∂θ
(l)
ij
=
∂J
∂y
∂y
∂net(l)
∂net(l)
∂θ
(l)
ij
(8)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 19 / 79
Demonstration of Backpropagation
Two steps in computational graph (directed acyclic graph):
Forward pass: Inference
Backward pass: Learning
Figure: Gradient flows within computational graph
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 20 / 79
Reinforcement Learning
Goal of the reinforcement learning is to maximize sum of the future
rewards, τ = (s1, a1, s2, a2, ...sT , aT )
J(π) = Eτ∼π
t
rt = dτπ(τ)R(τ) (9)
Figure: Integrate the path of events within MDP. (figure courtesy: Berkeley
cs294-112)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 21 / 79
Score Function Gradient Estimator
To solve the optimization problem, we parameterize the policy (using
neural network) and compute the derivative of the objective function.
J(πθ) = Eτ∼πθ
t
rt (10)
Then apply the gradient ascent to update the policy, θ ← θ + θJ(πθ).
Score Function Gradient Estimator, Policy Gradient
Suppose that x is a random variable with probability density p(x|θ), f is a
scalar-valued funtion and we want to calculate θ Ex [f (x)].
θ Ex [f (x)] = Ex θ log p(x|θ)f (x) (11)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 22 / 79
Policy Gradient: Online Model-free Method
Policy gradient is independent with initial state and environment
dynamics. It is the model-free method.
πθ(τ) = µ(s1)
T
t=1
πθ(at|st)p(st+1|st, at) (12)
θ log πθ(τ) = θ 
log µ(s1) +
T
t=1
log πθ(at|st) + ((((((((
log p(st+1|st, at)
(13)
The explicit computational format is represented:
θJ(πθ) ≈
1
N
N
i=1
T
t=1
θ log πθ(ai
t|si
t)
T
t=1
r(si
t, ai
t) (14)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 23 / 79
Variance Reduction
Causality: Event happened in t can not affect reward at time t when
t  t .
θJ(πθ) = Eπθ
T
t=1
θ log πθ(at|st)
T
t =t
r(st, at)
ˆQt , reward-to-go
(15)
Baseline: Remove floating reference for each trajectory.
θJ(πθ) = Eπθ
T
t=1
θ log πθ(at|st) ˆQt
reward-to-go
− b
baseline
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 24 / 79
Variance Reduction
Figure: Two different policy due to offset of reward-to-go
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 25 / 79
Value Functions
Replace scalar values to be state-dependent value functions (approximate
with neural network)
ˆQt → Q(st, at) =
T
t =t
Eπθ
r(st , at )|st, at
b → V (st) = Eat ∼πθ(at |st ) Q(st, at)
The difference of two value functions is called adavantage function.
A(st, at) = Q(st, at) − V (st) (16)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 26 / 79
Value Functions
The difference of state-action value and value function is called advantage.
Aπ
(st, at) = Qπ
(st, at) − V π
(st) = r(st, at) + V π
(st+1) − V π
(st)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 27 / 79
Advantage Actor-Critic Algorithm
Parameterize actor πθ(at|st) and criric Vφ(st) using neural networks.
Advantage Actor-Critic
sample (si , ai , si , ri ) from πθ(a|s)
Prepare D = (si , yi ) , where yi = ri + ˆV π
φ (si )
Fit Vφ(s) with loss L = 1/2 i || ˆV π
φ (si ) − yi ||2
φ ← φ + φL(φ)
Evaluate Aπ(si , ai ) = ri + V π
φ (si ) − V π
φ (si )
θJ(θ) ≈ (1/N) i θ log(πθ(ai |si ))Aπ(si , ai )
θ ← θ + θJ(θ)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 28 / 79
Quick Summary
Extend Markov chain to Markov decision process by labeling actions
and rewards.
Deep neural networks are general differential functions. As long as we
have loss function, it can be updated by backprop.
Reinforcement learning is the framework built on Markov decision
process.
Actor-Criric algorithm train not only the policy to perform tasks but
also value function to predict future returns.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 29 / 79
Mergence of Loop-like Algorithm
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 30 / 79
Learn to Discover
The expected reward guides discovery of update pattern.
Figure: Physical constraints imply the distribution of states
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 31 / 79
Markov Decision Process
In our point of view, one global Monte Carlo step is composed with series
of decision makings.
Figure: Sequence of local decisions form a global movement
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 32 / 79
Markov Chain and Markov Decision Process
The global update trajectory in Markov decision process is described as
P(s1 → sT ) = P(s1)
T
t=1
πθ(at|st)P(st+1|st, at) (17)
Use policy gradient to find global update rule (policy πθ)
π∗
= argmaxθ Eτ∼πθ(τ)
t
r(st, at) (18)
Ψ serves as information carrier or weak label, which weights the likelihood.
ˆg = Eπθ
∞
t
log πθ(at|st)Ψt (19)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 33 / 79
Asynchronous Advantage Actor-Critic Algorithm (A3C)
In modern deep learning frameworks, we write down policy loss rather than
policy gradient and let the automatic differentiation algorithm do the work.
L(θ) =
t
log πθ(at, st) ˆAπ
t
Policy Loss
− ||V (st) − yt||2
Value Estimation
+ λ πθ(at|st) log πθ(at|st)
Entropy Regularization
Figure: Diagram of A3C high-level archiecture
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 34 / 79
Learn to Discover
We choose square ice model as our arena because of its constraints.
Figure: Distribution of ice states are delta-like
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 35 / 79
Square Ice Model
The square ice model serves the simplest toy model verifying our idea.
H =
i
|
α
sα| (20)
Each vertex satisfies ice rule (2 in, 2 out). There is an efficient method to
transit between states called loop algorithm.
There are two things to be discovered, ice rule and loop update.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 36 / 79
Demonstration of Long Loop Algorithm
Start with a lattice that is entirely in ice rule and choose at random a
single vertex.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 37 / 79
Demonstration of Long Loop Algorithm
Start with a lattice that is entirely in ice rule and choose at random a
single vertex.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 38 / 79
Demonstration of Long Loop Algorithm
Choose at random one of the two inward pointing spin of the vertex
Trace to the next vertex and repeat, always choosing incoming spins
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 39 / 79
Demonstration of Long Loop Algorithm
Choose at random one of the two inward pointing spin of the vertex
Trace to the next vertex and repeat, always choosing incoming spins
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 40 / 79
Demonstration of Long Loop Algorithm
Choose at random one of the two inward pointing spin of the vertex
Trace to the next vertex and repeat, always choosing incoming spins
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 41 / 79
Demonstration of Long Loop Algorithm
Choose at random one of the two inward pointing spin of the vertex
Trace to the next vertex and repeat, always choosing incoming spins
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 42 / 79
Demonstration of Long Loop Algorithm
Choose at random one of the two inward pointing spin of the vertex
Trace to the next vertex and repeat, always choosing incoming spins
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 43 / 79
Demonstration of Long Loop Algorithm
Close the loop and complete one Monte Carlo step. All vertices still satisfy
the ice-rule and remain in ice state.
Figure: A closed loop with length 6.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 44 / 79
IceGame
We design the environment called IceGame which can interact with RL
agent. Icegame is the factory for generating ice and the agent serves as
woker in it.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 45 / 79
IceGame: Action
Two type of actions:
Direction action, ai flips σi .
Update execution
a = [a0, a1, a2, a3, a4, a5, update]
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 46 / 79
Icegame: Observation
Combine the configuration and physical quantity as the MDP state.
Local observation: Ol = [σi , ∆C, ∆E] Spins, configs, energy
Global observation: Og = St − S0 = Ct Trajectory
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 47 / 79
Icegame: Reward function
Design of reward function is crucial for reinforcement learning task
r(s, a) =



rg if proposed loop is accepted
rs if movement doesn’t increase energy
rf otherwise
(21)
Stepwise rewards rs are assigned relative small value to target reward rg
such that the agent pursues high rg ultimately.
rs
rg
∼ O(10−3
)
We usually set rg = +1, rs = +1/N and rf = 0.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 48 / 79
Network Architecture Design
Multiple channel network:
Local channel: linear (relu), 32 - linear (relu) 64
Global channel: 3x3 Conv, 32 - 3x3 Conv, 16 - 1024
concat: 1088 - linear, 128, parallel: linear, 7; linear, 1
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 49 / 79
Sliding Horizon Mechanism
In order to adopt trained model to different system with larger size, we
provide this mechanism.
Figure: Trained network scans over the new environment
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 50 / 79
Learning Procedure
Behaviour evolves along with different stages of learning. The exploitation
and exploration process.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 51 / 79
Statistics of Loop
Example: L = 16
Network Acceptance Efficiency Loop Size
Algorithm 96.2% 33 ∼ 96 % 4 ∼ 480
(96)
NN Policy 94.1% 24 ∼ 81 % 4 ∼ 436
(51)
CNN Policy 96.2% 31 ∼ 90 % 4 ∼ 518
(239)
Efficiency = updated/visted
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 52 / 79
Loop Length Distribution: L = 16
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 53 / 79
Loop Length Distribution, L = 32
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 54 / 79
Loop Length Distribution, L = 16, 32, 64
In special scenario, the agent pursues higher cumulative rewards and forms
ambitious policy for creating larger loops.
Figure: Specailized ConvNet policy for generating longer loop
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 55 / 79
Stepwise Policy Distribution
To remind, our action is flipping corresponding neighboring spins.
a = [a0, a1, a2, a3, a4, a5]
s = [σ0, σ1, σ2, σ3, σ4, σ5]
For loop algorithm, formulated in MDP scenario, the decision policy is the
following.
πalgo(a|s) = [0, 0, 0, 0, 0.5, 0.5]
Figure: Local observation of the spins
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 56 / 79
Stepwise Policy Distribution
Stepwise decision making performed by the trained agent.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 57 / 79
Generated Loop Patterns
Compare the generated loop patterns between loop algorithm and trained
policy in small size.
Figure: Samll loop. Left: Algorithm, Right: ConvNet Policy
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 58 / 79
Generated Loop Patterns
Compare the generated loop patterns between loop algorithm and trained
policy in medium size.
Figure: Medium loop. Left: Algorithm, Right: ConvNet Policy
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 59 / 79
Generated Loop Patterns
Compare the generated loop patterns between loop algorithm and trained
policy in large size.
Figure: Large loop. Left: Algorithm, Right: ConvNet Policy
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 60 / 79
Neural Network Memory Effect
Memory-less algorithm is independent with initial position.
Figure: Loop Algorithm: Same Initial Point and Same Configuration
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 61 / 79
Neural Network Memory Effect
Feedforward policy shows its locality preferences.
Figure: Feedforward Policy: Same Initial Point and Same Configuration
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 62 / 79
Neural Network Memory Effect
ConvNet performs similarly to original loop algorithm.
Figure: ConvNet Policy: Same Initial Point and Same Configuration
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 63 / 79
Configuration Visited Heatmap
Number of visited times on each sites.
Figure: Left: Algorithm, Right: ConvNet
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 64 / 79
Correlation Time and Hybrid Method
The Markov decision process description allows us travelling and searching
the whole policy.
Hybrid Ratio =
Policy Execution Times
Algorithm Execution Times
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 65 / 79
Auto-Correlation Time
We measure correlation time for an observable ρsym which is the density of
the symmetric vertices with hybrid ratio 0.2.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 66 / 79
Conclusion
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 67 / 79
Conclusions
To propose the acceptable loop update in square ice model, the machine
has to learn several things:
Realize ice rule and satisfy it to maintain the system in ice.
Distinguish open and closed loop, propose update at appropriate
moment.
Figure: Successful update samples (orange dots) work as supports of emergent
strategy
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 68 / 79
Conclusions
We conclude that:
Build the interface between machine and physical model on MDP
Proposal operator can be parameterized by neural network as πθ(a|s)
Loop-like update pattern would emerge due to ice rule
Machine realizes update rule by interating with ice rule, successful
examples serve as support
Both ice rule (local) and closed loop condition (global) are learned
From policy distribution, we can see how agent makes decisions
Trained machine can scale up to larger size and cooperate with
existing algorithm
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 69 / 79
Thanks Your Attention
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 70 / 79
Supplementary materials
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 71 / 79
Demonstration of Backpropagation
To understand backpropagation, we first consider a unit circuit.
Figure: Unit circuit in computational graph (figure courtesy: Stanford cs231n)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 72 / 79
Demonstration of Backpropagation
To understand backpropagation, we first consider a unit circuit.
Figure: Unit circuit in computational graph: backprop
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 73 / 79
Demonstration of Backpropagation
To understand backpropagation, we first consider a unit circuit.
Figure: Unit circuit in computational graph: backprop
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 74 / 79
Score Function Gradient Estimator
The equality can be derived by explicitly writing the expectation
θ Ex [f (x)] = θ dxp(x|θ)f (x) (22)
= dx θp(x|θ)f (x) (23)
= dxp(x|θ)
θp(x|θ)
p(x|θ)
f (x) (24)
= Ex θ log p(x|θ)f (x) (25)
Apply the score function estimator to our objective function, we will get
the policy gradient:
θJ(πθ) = Eτ∼π
t
rt = Eτ∼πθ(τ) θ log πθ(τ)R(τ) (26)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 75 / 79
Policy Gradient Methods
The detailed formalism of policy gradient is worked out as
θJ(πθ) = Eτ∼πθ(τ) θ log πθ(τ)R(τ) (27)
= Est+1,rt ∼πθ(st ,at )
T
t=1
θ log πθ(at|st)
T
t=1
r(st, at) (28)
The explicit computational format is represented:
θJ(πθ) ≈
1
N
N
i=1
T
t=1
θ log πθ(ai
t|si
t)
T
t=1
r(si
t, ai
t) (29)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 76 / 79
Baseline
In order to remove the instability, we subtract the reward-to-go by a
constant baseline b.
θJ(πθ) ≈
1
N
N
i
θ log πθ(τ) ˆQ(τ) − b (30)
b =
1
N
N
i=1
R(τ) (31)
It is easy to show that substracting baseline is unbiased in expectation.
E θ log πθ(τ)b = dτπθ(τ) θ log πθ(τ)b = b θ dτπθ(τ) = 0
(32)
Besides, by variance analysis, we can get the variance free baseline
b = g(τ)R(τ)
E[g2(τ)]
, where g(τ) = θ log πθ(τ).
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 77 / 79
Review of Policy Gradient
General policy gradient is denoted as the following
Policy Gradient
ˆg = E
∞
t
log πθ(at|st)Ψt (33)
The Ψ is denoted as goodness, it can be
1: Maximum log-likelihood
t rt: Vanilla REINFORCE
t rt − b(st): Causal
REINFORCE algorithm with
baseline
Qπ: Actor-Critic algorithm
Aπ: Advantage actor-critic
algorithm
Ψ serves as information carrier or weak label, which weights the likelihood.
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 78 / 79
Can we have Detailed Balance in MDP?
For given policy, state-action pair transit in Markocv Chain.
PM((st+1, at+1)|(st, at)) = PM(st+1|st, at)π(at+1|st+1) (34)
The proposal distribution of
Q(s |s) ∼
ti
π(ati |sti )PM(sti
|sti , ati ) (35)
Expected discounted state distribution
dπ
(s) =
t
γt
Eπ st|s0 (36)
Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 79 / 79

More Related Content

PDF
Learning to Reconstruct
PDF
Data-Driven Recommender Systems
PDF
Dictionary Learning for Massive Matrix Factorization
PDF
Parallel Optimization in Machine Learning
PDF
Gradient Estimation Using Stochastic Computation Graphs
PDF
Bayesian Dark Knowledge and Matrix Factorization
PDF
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
PDF
05 history of cv a machine learning (theory) perspective on computer vision
Learning to Reconstruct
Data-Driven Recommender Systems
Dictionary Learning for Massive Matrix Factorization
Parallel Optimization in Machine Learning
Gradient Estimation Using Stochastic Computation Graphs
Bayesian Dark Knowledge and Matrix Factorization
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
05 history of cv a machine learning (theory) perspective on computer vision

What's hot (20)

PDF
Asynchronous Stochastic Optimization, New Analysis and Algorithms
PDF
Numerical approach for Hamilton-Jacobi equations on a network: application to...
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
PPT
An Introduction To Applied Evolutionary Meta Heuristics
PDF
Principal component analysis and matrix factorizations for learning (part 3) ...
PDF
4 avrachenkov
PDF
Meta-learning and the ELBO
PDF
Safe and Efficient Off-Policy Reinforcement Learning
PDF
A Tutorial of the EM-algorithm and Its Application to Outlier Detection
PDF
Particle Filters and Applications in Computer Vision
PDF
Self-Learning Systems for Cyber Security
PDF
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
PDF
A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...
PDF
Particle filtering in Computer Vision (2003)
PDF
Random Forest for Big Data
PDF
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
PDF
Large-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
PDF
Bat algorithm for Topology Optimization in Microelectronic Applications
PDF
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
PPT
Book figureschs6 10slides
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Numerical approach for Hamilton-Jacobi equations on a network: application to...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
An Introduction To Applied Evolutionary Meta Heuristics
Principal component analysis and matrix factorizations for learning (part 3) ...
4 avrachenkov
Meta-learning and the ELBO
Safe and Efficient Off-Policy Reinforcement Learning
A Tutorial of the EM-algorithm and Its Application to Outlier Detection
Particle Filters and Applications in Computer Vision
Self-Learning Systems for Cyber Security
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...
Particle filtering in Computer Vision (2003)
Random Forest for Big Data
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
Large-Scale Nonparametric Estimation of Vehicle Travel Time Distributions
Bat algorithm for Topology Optimization in Microelectronic Applications
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
Book figureschs6 10slides
Ad

Similar to Learning to discover monte carlo algorithm on spin ice manifold (20)

PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PPTX
Reinforcement learning
PPTX
Reinforcement Learning and Artificial Neural Nets
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
An introduction to reinforcement learning
PDF
Reinfrocement Learning
PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PPTX
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
PDF
A Brief Survey of Reinforcement Learning
PDF
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
PPTX
Designing an AI that gains experience for absolute beginners
PDF
Reinforcement learning in a nutshell
PDF
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
PPT
about reinforcement-learning ,reinforcement-learning.ppt
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PDF
Reinforcement Learning - DQN
PDF
Reinforcement Learning Overview | Marco Del Pra
PPTX
Jsai final final final
PDF
Introduction to Deep Reinforcement Learning
PDF
Head First Reinforcement Learning
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Reinforcement learning
Reinforcement Learning and Artificial Neural Nets
anintroductiontoreinforcementlearning-180912151720.pdf
An introduction to reinforcement learning
Reinfrocement Learning
An Introduction to Reinforcement Learning - The Doors to AGI
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
A Brief Survey of Reinforcement Learning
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
Designing an AI that gains experience for absolute beginners
Reinforcement learning in a nutshell
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
about reinforcement-learning ,reinforcement-learning.ppt
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
Reinforcement Learning - DQN
Reinforcement Learning Overview | Marco Del Pra
Jsai final final final
Introduction to Deep Reinforcement Learning
Head First Reinforcement Learning
Ad

More from Kai-Wen Zhao (8)

PDF
Learning visual representation without human label
PDF
Deep Double Descent
PDF
Recent Object Detection Research & Person Detection
PDF
Toward Disentanglement through Understand ELBO
PDF
Deep Reinforcement Learning: Q-Learning
PDF
Paper Review: An exact mapping between the Variational Renormalization Group ...
PDF
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
PDF
High Dimensional Data Visualization using t-SNE
Learning visual representation without human label
Deep Double Descent
Recent Object Detection Research & Person Detection
Toward Disentanglement through Understand ELBO
Deep Reinforcement Learning: Q-Learning
Paper Review: An exact mapping between the Variational Renormalization Group ...
NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accel...
High Dimensional Data Visualization using t-SNE

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Computer network topology notes for revision
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Foundation of Data Science unit number two notes
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Computer network topology notes for revision
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Knowledge Engineering Part 1
Data_Analytics_and_PowerBI_Presentation.pptx
Mega Projects Data Mega Projects Data
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Foundation of Data Science unit number two notes
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Learning to discover monte carlo algorithm on spin ice manifold

  • 1. Learning to Discover Monte Carlo Algorithm on Spin Ice Manifold Kai-Wen Zhao, Yin-Jer Kao National Taiwan University kelispinor@gmail.com June 14, 2018 Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 1 / 79
  • 2. Overview 1 Introduction 2 Backgrounds Markov Chain Monte Carlo Reinforcement Learning Deep Learning Variance Reduction and Actor-Critic 3 Reinforcement Learning Loop Algorithm Learn to Discover Square Ice Model and Loop Algorithm 4 Experiments and Results Learning Procedure Statistics of Loop Memory Effect Correlation Time 5 Conclusion 6 Supplementary materials Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 2 / 79
  • 3. Introduction Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 3 / 79
  • 4. Introduction What is physics? Prediction Purpose of machine learning: Prediction Generalization |Ein(h) − Eout(h)| < Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 4 / 79
  • 5. Introduction What is physics? More than prediction Theoretical framework Supervised learning scheme: Fit the theory with neural network model Weights cannot be interpreted In Monte Carlo simulation: Data: Configuration Label: Physical observables Usually imply we already solve the problem We fit the network in algorithmic level, not in theoretical level. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 5 / 79
  • 6. Reinforcement Learning Machine learns how to behave in a environment by performing actions and seeing the results. Figure: General-purpose framework for machine-world interaction.(Image courtesy Notfruit/Wikimedia Commons) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 6 / 79
  • 7. Motivation Can machine invent or discover Monte Carlo update method? Efficient Monte Carlo methods are usually model-dependent, physicists have to design algorithms elaborately We want to build a general framework can help us discover efficient update method. The following questions would be answered: Can we establish a bridge between machine and physics? How to parameterize Monte Carlo simulation? Can machine realizes the physical rules? Can machine discover efficient update strategies? Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 7 / 79
  • 8. Backgrounds Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 8 / 79
  • 9. Backgrounds Markov Chain Monte Carlo Markov Decision Process Deep Learning Neural Network ConvNet Backpropagation Reinforcement Learning Policy gradient Value function Actor-Critic algorithm Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 9 / 79
  • 10. Markov Chain Define stochastic process for a system with a finite set of possible states {s0, s1, ...st, ..., sT }. state st transition operator T(st+1|st) If the transition probability satisfy the Markov property, we called it Markov process. Markov Property p(st+1|st) = p(st+1|st, st−1, st−2, ...s0) (1) State captures all information about history. st is sufficient statistics of the future. p(s ) = dsT(s |s)p(s) (2) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 10 / 79
  • 11. Markov Chain Monte Carlo MCMC works by constructing and simulating a Markov chain whose equilibrium distribution is the distribution of interest. T(s |s) = Q(s |s)A(s |s) (3) If the operator satisfies the detailed balance condition, it garantees that MCMC distribution will asymptotically converge to target distribution. p(s)T(s |s) = p(s )T(s|s ) The simplest MCMC algorithm is Metropolis-Hastings method which proceeds in two steps: 1 Give a sample from proposal 2 Accept this proposal with probability A(s |s) = min 1, p(s )Q(s|s ) p(s)Q(s |s) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 11 / 79
  • 12. Markov Decision Process Markov Decision Process M = {S, T, A, r}. State Space, S: a set of states of environment Action Space, A: a set of actions which the agent selects from at each timestep Reward function, S × A → R: a scalar value to characterize states Tansition probability: p(r, st+1|st, at) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 12 / 79
  • 13. Reinforcement Learning General-purpose framework for decision-making Goal: select actions to maximise future reward Reinforcement learning is for an agent with the capacity to act Each action influences the agents future state Deep reinforcement learning uses neural network as function approximator Agent π(at|st) Environment Action atNew state st+1 Reward rt+1 Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 13 / 79
  • 14. Neural Networks Perceptron defines the mapping y = hθ(x) from input data to output target where θ is the learnable parameter. The function f is called activation function which provides nonlinearity to produce rich feature representation in neural network. hθ(x) = f (θx + b) (4) Figure: Fully-Connected Neural Network Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 14 / 79
  • 15. Convolutional Neural Networks There are three major components of ConvNet: Convolutional Layer Pooling Layer Fully-Connected Layer Figure: ConvNet Architecture Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 15 / 79
  • 16. Convolutional Neural Networks Convolution: S(i, j) = (I ∗ K)(i, j) = m n I(m, n)Kθ(i − m, j − n) (5) where I is the input image with dimensions m and n, K represents the kernel and S is the output of location i, j. There are three important characteristics that come with the use of convolution operations: Sparse interaction Parameter sharing Equivariance Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 16 / 79
  • 17. Demonstration of Convolution Convolution: S(i, j) = (I ∗ K)(i, j) = m n I(m, n)Kθ(i − m, j − n) (6) Figure: Convolutional Layer Operation (figure courtesy: Deep Learning) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 17 / 79
  • 18. Demonstration of Convolution Convolution: Figure: Sparsity and Depth of Convolution (figure courtesy: Deep Learning) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 18 / 79
  • 19. Backpropagation Update each the weights with gradient descents, learning rate η: θ ← θ + η θJ (7) It is not easy to compute gradient with respects to all weights in one giant network: y = net(x) The efficient algorithm used to compute gradient is backpropagation. ∂J ∂θ (l) ij = ∂J ∂y ∂y ∂net(l) ∂net(l) ∂θ (l) ij (8) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 19 / 79
  • 20. Demonstration of Backpropagation Two steps in computational graph (directed acyclic graph): Forward pass: Inference Backward pass: Learning Figure: Gradient flows within computational graph Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 20 / 79
  • 21. Reinforcement Learning Goal of the reinforcement learning is to maximize sum of the future rewards, τ = (s1, a1, s2, a2, ...sT , aT ) J(π) = Eτ∼π t rt = dτπ(τ)R(τ) (9) Figure: Integrate the path of events within MDP. (figure courtesy: Berkeley cs294-112) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 21 / 79
  • 22. Score Function Gradient Estimator To solve the optimization problem, we parameterize the policy (using neural network) and compute the derivative of the objective function. J(πθ) = Eτ∼πθ t rt (10) Then apply the gradient ascent to update the policy, θ ← θ + θJ(πθ). Score Function Gradient Estimator, Policy Gradient Suppose that x is a random variable with probability density p(x|θ), f is a scalar-valued funtion and we want to calculate θ Ex [f (x)]. θ Ex [f (x)] = Ex θ log p(x|θ)f (x) (11) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 22 / 79
  • 23. Policy Gradient: Online Model-free Method Policy gradient is independent with initial state and environment dynamics. It is the model-free method. πθ(τ) = µ(s1) T t=1 πθ(at|st)p(st+1|st, at) (12) θ log πθ(τ) = θ log µ(s1) + T t=1 log πθ(at|st) + (((((((( log p(st+1|st, at) (13) The explicit computational format is represented: θJ(πθ) ≈ 1 N N i=1 T t=1 θ log πθ(ai t|si t) T t=1 r(si t, ai t) (14) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 23 / 79
  • 24. Variance Reduction Causality: Event happened in t can not affect reward at time t when t t . θJ(πθ) = Eπθ T t=1 θ log πθ(at|st) T t =t r(st, at) ˆQt , reward-to-go (15) Baseline: Remove floating reference for each trajectory. θJ(πθ) = Eπθ T t=1 θ log πθ(at|st) ˆQt reward-to-go − b baseline Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 24 / 79
  • 25. Variance Reduction Figure: Two different policy due to offset of reward-to-go Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 25 / 79
  • 26. Value Functions Replace scalar values to be state-dependent value functions (approximate with neural network) ˆQt → Q(st, at) = T t =t Eπθ r(st , at )|st, at b → V (st) = Eat ∼πθ(at |st ) Q(st, at) The difference of two value functions is called adavantage function. A(st, at) = Q(st, at) − V (st) (16) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 26 / 79
  • 27. Value Functions The difference of state-action value and value function is called advantage. Aπ (st, at) = Qπ (st, at) − V π (st) = r(st, at) + V π (st+1) − V π (st) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 27 / 79
  • 28. Advantage Actor-Critic Algorithm Parameterize actor πθ(at|st) and criric Vφ(st) using neural networks. Advantage Actor-Critic sample (si , ai , si , ri ) from πθ(a|s) Prepare D = (si , yi ) , where yi = ri + ˆV π φ (si ) Fit Vφ(s) with loss L = 1/2 i || ˆV π φ (si ) − yi ||2 φ ← φ + φL(φ) Evaluate Aπ(si , ai ) = ri + V π φ (si ) − V π φ (si ) θJ(θ) ≈ (1/N) i θ log(πθ(ai |si ))Aπ(si , ai ) θ ← θ + θJ(θ) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 28 / 79
  • 29. Quick Summary Extend Markov chain to Markov decision process by labeling actions and rewards. Deep neural networks are general differential functions. As long as we have loss function, it can be updated by backprop. Reinforcement learning is the framework built on Markov decision process. Actor-Criric algorithm train not only the policy to perform tasks but also value function to predict future returns. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 29 / 79
  • 30. Mergence of Loop-like Algorithm Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 30 / 79
  • 31. Learn to Discover The expected reward guides discovery of update pattern. Figure: Physical constraints imply the distribution of states Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 31 / 79
  • 32. Markov Decision Process In our point of view, one global Monte Carlo step is composed with series of decision makings. Figure: Sequence of local decisions form a global movement Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 32 / 79
  • 33. Markov Chain and Markov Decision Process The global update trajectory in Markov decision process is described as P(s1 → sT ) = P(s1) T t=1 πθ(at|st)P(st+1|st, at) (17) Use policy gradient to find global update rule (policy πθ) π∗ = argmaxθ Eτ∼πθ(τ) t r(st, at) (18) Ψ serves as information carrier or weak label, which weights the likelihood. ˆg = Eπθ ∞ t log πθ(at|st)Ψt (19) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 33 / 79
  • 34. Asynchronous Advantage Actor-Critic Algorithm (A3C) In modern deep learning frameworks, we write down policy loss rather than policy gradient and let the automatic differentiation algorithm do the work. L(θ) = t log πθ(at, st) ˆAπ t Policy Loss − ||V (st) − yt||2 Value Estimation + λ πθ(at|st) log πθ(at|st) Entropy Regularization Figure: Diagram of A3C high-level archiecture Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 34 / 79
  • 35. Learn to Discover We choose square ice model as our arena because of its constraints. Figure: Distribution of ice states are delta-like Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 35 / 79
  • 36. Square Ice Model The square ice model serves the simplest toy model verifying our idea. H = i | α sα| (20) Each vertex satisfies ice rule (2 in, 2 out). There is an efficient method to transit between states called loop algorithm. There are two things to be discovered, ice rule and loop update. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 36 / 79
  • 37. Demonstration of Long Loop Algorithm Start with a lattice that is entirely in ice rule and choose at random a single vertex. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 37 / 79
  • 38. Demonstration of Long Loop Algorithm Start with a lattice that is entirely in ice rule and choose at random a single vertex. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 38 / 79
  • 39. Demonstration of Long Loop Algorithm Choose at random one of the two inward pointing spin of the vertex Trace to the next vertex and repeat, always choosing incoming spins Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 39 / 79
  • 40. Demonstration of Long Loop Algorithm Choose at random one of the two inward pointing spin of the vertex Trace to the next vertex and repeat, always choosing incoming spins Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 40 / 79
  • 41. Demonstration of Long Loop Algorithm Choose at random one of the two inward pointing spin of the vertex Trace to the next vertex and repeat, always choosing incoming spins Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 41 / 79
  • 42. Demonstration of Long Loop Algorithm Choose at random one of the two inward pointing spin of the vertex Trace to the next vertex and repeat, always choosing incoming spins Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 42 / 79
  • 43. Demonstration of Long Loop Algorithm Choose at random one of the two inward pointing spin of the vertex Trace to the next vertex and repeat, always choosing incoming spins Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 43 / 79
  • 44. Demonstration of Long Loop Algorithm Close the loop and complete one Monte Carlo step. All vertices still satisfy the ice-rule and remain in ice state. Figure: A closed loop with length 6. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 44 / 79
  • 45. IceGame We design the environment called IceGame which can interact with RL agent. Icegame is the factory for generating ice and the agent serves as woker in it. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 45 / 79
  • 46. IceGame: Action Two type of actions: Direction action, ai flips σi . Update execution a = [a0, a1, a2, a3, a4, a5, update] Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 46 / 79
  • 47. Icegame: Observation Combine the configuration and physical quantity as the MDP state. Local observation: Ol = [σi , ∆C, ∆E] Spins, configs, energy Global observation: Og = St − S0 = Ct Trajectory Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 47 / 79
  • 48. Icegame: Reward function Design of reward function is crucial for reinforcement learning task r(s, a) =    rg if proposed loop is accepted rs if movement doesn’t increase energy rf otherwise (21) Stepwise rewards rs are assigned relative small value to target reward rg such that the agent pursues high rg ultimately. rs rg ∼ O(10−3 ) We usually set rg = +1, rs = +1/N and rf = 0. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 48 / 79
  • 49. Network Architecture Design Multiple channel network: Local channel: linear (relu), 32 - linear (relu) 64 Global channel: 3x3 Conv, 32 - 3x3 Conv, 16 - 1024 concat: 1088 - linear, 128, parallel: linear, 7; linear, 1 Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 49 / 79
  • 50. Sliding Horizon Mechanism In order to adopt trained model to different system with larger size, we provide this mechanism. Figure: Trained network scans over the new environment Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 50 / 79
  • 51. Learning Procedure Behaviour evolves along with different stages of learning. The exploitation and exploration process. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 51 / 79
  • 52. Statistics of Loop Example: L = 16 Network Acceptance Efficiency Loop Size Algorithm 96.2% 33 ∼ 96 % 4 ∼ 480 (96) NN Policy 94.1% 24 ∼ 81 % 4 ∼ 436 (51) CNN Policy 96.2% 31 ∼ 90 % 4 ∼ 518 (239) Efficiency = updated/visted Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 52 / 79
  • 53. Loop Length Distribution: L = 16 Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 53 / 79
  • 54. Loop Length Distribution, L = 32 Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 54 / 79
  • 55. Loop Length Distribution, L = 16, 32, 64 In special scenario, the agent pursues higher cumulative rewards and forms ambitious policy for creating larger loops. Figure: Specailized ConvNet policy for generating longer loop Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 55 / 79
  • 56. Stepwise Policy Distribution To remind, our action is flipping corresponding neighboring spins. a = [a0, a1, a2, a3, a4, a5] s = [σ0, σ1, σ2, σ3, σ4, σ5] For loop algorithm, formulated in MDP scenario, the decision policy is the following. πalgo(a|s) = [0, 0, 0, 0, 0.5, 0.5] Figure: Local observation of the spins Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 56 / 79
  • 57. Stepwise Policy Distribution Stepwise decision making performed by the trained agent. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 57 / 79
  • 58. Generated Loop Patterns Compare the generated loop patterns between loop algorithm and trained policy in small size. Figure: Samll loop. Left: Algorithm, Right: ConvNet Policy Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 58 / 79
  • 59. Generated Loop Patterns Compare the generated loop patterns between loop algorithm and trained policy in medium size. Figure: Medium loop. Left: Algorithm, Right: ConvNet Policy Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 59 / 79
  • 60. Generated Loop Patterns Compare the generated loop patterns between loop algorithm and trained policy in large size. Figure: Large loop. Left: Algorithm, Right: ConvNet Policy Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 60 / 79
  • 61. Neural Network Memory Effect Memory-less algorithm is independent with initial position. Figure: Loop Algorithm: Same Initial Point and Same Configuration Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 61 / 79
  • 62. Neural Network Memory Effect Feedforward policy shows its locality preferences. Figure: Feedforward Policy: Same Initial Point and Same Configuration Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 62 / 79
  • 63. Neural Network Memory Effect ConvNet performs similarly to original loop algorithm. Figure: ConvNet Policy: Same Initial Point and Same Configuration Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 63 / 79
  • 64. Configuration Visited Heatmap Number of visited times on each sites. Figure: Left: Algorithm, Right: ConvNet Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 64 / 79
  • 65. Correlation Time and Hybrid Method The Markov decision process description allows us travelling and searching the whole policy. Hybrid Ratio = Policy Execution Times Algorithm Execution Times Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 65 / 79
  • 66. Auto-Correlation Time We measure correlation time for an observable ρsym which is the density of the symmetric vertices with hybrid ratio 0.2. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 66 / 79
  • 67. Conclusion Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 67 / 79
  • 68. Conclusions To propose the acceptable loop update in square ice model, the machine has to learn several things: Realize ice rule and satisfy it to maintain the system in ice. Distinguish open and closed loop, propose update at appropriate moment. Figure: Successful update samples (orange dots) work as supports of emergent strategy Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 68 / 79
  • 69. Conclusions We conclude that: Build the interface between machine and physical model on MDP Proposal operator can be parameterized by neural network as πθ(a|s) Loop-like update pattern would emerge due to ice rule Machine realizes update rule by interating with ice rule, successful examples serve as support Both ice rule (local) and closed loop condition (global) are learned From policy distribution, we can see how agent makes decisions Trained machine can scale up to larger size and cooperate with existing algorithm Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 69 / 79
  • 70. Thanks Your Attention Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 70 / 79
  • 71. Supplementary materials Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 71 / 79
  • 72. Demonstration of Backpropagation To understand backpropagation, we first consider a unit circuit. Figure: Unit circuit in computational graph (figure courtesy: Stanford cs231n) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 72 / 79
  • 73. Demonstration of Backpropagation To understand backpropagation, we first consider a unit circuit. Figure: Unit circuit in computational graph: backprop Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 73 / 79
  • 74. Demonstration of Backpropagation To understand backpropagation, we first consider a unit circuit. Figure: Unit circuit in computational graph: backprop Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 74 / 79
  • 75. Score Function Gradient Estimator The equality can be derived by explicitly writing the expectation θ Ex [f (x)] = θ dxp(x|θ)f (x) (22) = dx θp(x|θ)f (x) (23) = dxp(x|θ) θp(x|θ) p(x|θ) f (x) (24) = Ex θ log p(x|θ)f (x) (25) Apply the score function estimator to our objective function, we will get the policy gradient: θJ(πθ) = Eτ∼π t rt = Eτ∼πθ(τ) θ log πθ(τ)R(τ) (26) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 75 / 79
  • 76. Policy Gradient Methods The detailed formalism of policy gradient is worked out as θJ(πθ) = Eτ∼πθ(τ) θ log πθ(τ)R(τ) (27) = Est+1,rt ∼πθ(st ,at ) T t=1 θ log πθ(at|st) T t=1 r(st, at) (28) The explicit computational format is represented: θJ(πθ) ≈ 1 N N i=1 T t=1 θ log πθ(ai t|si t) T t=1 r(si t, ai t) (29) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 76 / 79
  • 77. Baseline In order to remove the instability, we subtract the reward-to-go by a constant baseline b. θJ(πθ) ≈ 1 N N i θ log πθ(τ) ˆQ(τ) − b (30) b = 1 N N i=1 R(τ) (31) It is easy to show that substracting baseline is unbiased in expectation. E θ log πθ(τ)b = dτπθ(τ) θ log πθ(τ)b = b θ dτπθ(τ) = 0 (32) Besides, by variance analysis, we can get the variance free baseline b = g(τ)R(τ) E[g2(τ)] , where g(τ) = θ log πθ(τ). Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 77 / 79
  • 78. Review of Policy Gradient General policy gradient is denoted as the following Policy Gradient ˆg = E ∞ t log πθ(at|st)Ψt (33) The Ψ is denoted as goodness, it can be 1: Maximum log-likelihood t rt: Vanilla REINFORCE t rt − b(st): Causal REINFORCE algorithm with baseline Qπ: Actor-Critic algorithm Aπ: Advantage actor-critic algorithm Ψ serves as information carrier or weak label, which weights the likelihood. Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 78 / 79
  • 79. Can we have Detailed Balance in MDP? For given policy, state-action pair transit in Markocv Chain. PM((st+1, at+1)|(st, at)) = PM(st+1|st, at)π(at+1|st+1) (34) The proposal distribution of Q(s |s) ∼ ti π(ati |sti )PM(sti |sti , ati ) (35) Expected discounted state distribution dπ (s) = t γt Eπ st|s0 (36) Kai-Wen Zhao, Yin-Jer Kao (NTU) RLLoop June 14, 2018 79 / 79