Reinforcement Learning 5. Monte Carlo Methods

Chapter 5: Monte Carlo Methods
Seungjae Ryan Lee

New method: Monte Carlo method
● Do not assume complete knowledge of environment
○ Only need experience
○ Can use simulated experience
● Average sample returns
● Use General Policy Iteration (GPI)
○ Prediction: compute value functions
○ Policy Improvement: improve policy from value functions
○ Control: discover optimal policy

Monte Carlo Prediction:
● Estimate from sample return
● Converges as more returns are observed
Return 1 observed 10 times
Return 0 observed 2 times
Return -1 observed 0 times

First-visit MC vs. Every-visit MC
● First-visit
○ Average of returns following first visits to states
○ Studied widely
○ Primary focus for this chapter
● Every-visit
○ Average returns following all visits to states
○ Extended naturally to function approximation (Ch. 9) and eligibility traces (Ch. 12)
s1 s2 s3 s2 s3Sample Trajectory:
s1
s2 s3

First-visit MC prediction in Practice:

Blackjack Example
● States: (Sum of cards, Has usable ace, Dealer’s card)
● Action: Hit (request card), Stick (stop)
● Reward: +1, 0, -1 for win, draw, loss
● Policy: request cards if and only if sum < 20
● Difficult to use DP although environment dynamics is known

Blackjack Example Results
● Less common experience have uncertain estimates
○ ex) States with usable ace

MC vs. DP
● No bootstrapping
● Estimates for each state are independent
● Can estimate the value of a subset of all states
Monte Carlo Dynamic Programming

Soap Bubble Example
● Compute shape of soap surface for a closed wire frame
● Height of surface is average of heights at neighboring points
● Surface must meet boundaries with the wire frame

Soap Bubble Example: DP vs. MC
http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%205.pdf
DP
● Update heights by its neighboring heights
● Iteratively sweep the grid
MC
● Take random walk until boundary is reached
● Average sampled boundary height

Monte Carlo Prediction:
● More useful if model is not available
○ Can determine policy without model
● Converges quadratically to when infinite samples
● Need exploration: all state-action pairs need to be visited infinitely
https://www.youtube.com/watch?v=qaMdN6LS9rA

Exploring Starts (ES)
● Specify state-action pair to start episode on
● Cannot be used when learning from actual interactions

Monte Carlo ES
● Control: approximate optimal policies
● Use Generalized Policy Iteration (GPI)
○ Maintain approximate policy and approximate value function
○ Policy evaluation: Monte Carlo Prediction for one episode with start chosen by ES
○ Policy Improvement: Greedy selection
● No proof of convergence

Blackjack Example Revisited
● Prediction → Control

ε-soft Policy
● Avoid exploring starts → Add exploration to policy
● Soft policy: every action has nonzero probability of being selected
● ε-soft policy: every action has at least probability of being selected
● ex) ε-greedy policy
○ Select greedily for probability
○ Select randomly for probability (including greedy)

ε-soft vs ε-greedy
Softer
Random
ε = 1
Greedy
ε = 0
ε-greedy
ε = 0.1
ε-soft
ε = 0.1

On-policy ε-soft MC control Pseudocode
● On-policy: Evaluate / improve policy that is used to make decisions

On-policy vs. Off-policy
● On-policy: Evaluate / improve policy that is used to make decisions
○ Requires ε-soft policy: near optimal but never optimal
○ Simple, low variance
● Off-policy: Evaluate / improve policy different from that used to generate data
○ Target policy : policy to evaluate
○ Behavior policy : policy for taking actions
○ More powerful and general
○ High variance, slower convergence
○ Can learn from non-learning controller or human expert

Coverage assumption for off-policy learning
● To estimate values under , all possible actions of must be taken by
● must be stochastic in states where

Importance Sampling
● Trajectories have different probabilities under different policies
● Estimate expected value from one distribution given samples from another
● Weight returns by importance sampling ratio
○ Relative probability of trajectory occurring under the target and behavior policies

Ordinary Importance Sampling
● Zero bias but unbounded variance
● With single return:

Ordinary Importance Sampling: Zero Bias

Ordinary Importance Sampling: Unbounded
Variance● 1-state, 2-action undiscounted MDP
● Off-policy first-visit MC
● Variance of an estimator:

Variance● Just consider all-left episodes with different lengths
○ Any trajectory with right has importance sampling ratio of 0
○ All-left trajectory have importance sampling ratio of

Variance

Weighted Importance Sampling
● Has bias that converges asymptotically to zero
● Strongly preferred due to lower variance
● With single return:

Blackjack example for Importance Sampling
● Evaluated for a single state
○ player’s sum = 13, has usable ace, dealer’s card = 2
○ Behavior policy: uniform random policy
○ Target policy: stick iff player’s sum >= 20

Incremental Monte Carlo
● Update value without tracking all returns
● Ordinary importance sampling:
● Weighted importance sampling:

Incremental Monte Carlo Pseudocode

Off-policy Monte Carlo Control
● Off-policy: target policy and behavior policy
● Monte Carlo: Learn from samples without bootstrapping
● Control: Find optimal policy through GPI

Off-policy Monte Carlo Control Pseudocode

Discounting-aware Importance Sampling: Intuition*
● Exploit return’s internal structure to reduce variance
○ Return = Discounted sum of rewards
● Consider myopic discount
Irrelevant to return: adds variance

Discounting as Partial Termination*
● Consider discount as degree of partial termination
○ If , all episodes terminate after receiving first reward
○ If , episode could terminate after n steps with probability
○ Premature termination results in partial returns
● Full Return as flat (undiscounted) partial return

Discounting-aware Ordinary Importance Sampling*
● Scale flat partial returns by a truncated importance sampling ratio
● Estimator for Ordinary importance sampling:
● Estimator for Discounting-aware ordinary importance sampling

Discounting-aware Weighted Importance Sampling*
● Scale flat partial returns by a truncated importance sampling ratio
● Estimator for Weighted importance sampling
● Estimator for Discounting-aware weighted importance sampling

Per-decision Importance Sampling: Intuition*
● Unroll returns as sum of rewards
● Can ignore trajectory after the reward since they are uncorrelated

Per-decision Importance Sampling: Process*
● Simplify expectation
● Equivalent expectation for return

Per-decision Ordinary Importance Sampling*
● Estimator for Ordinary Importance Sampling:
● Estimator for Per-reward Ordinary Importance Sampling:

Per-decision Weighted Importance Sampling?*
● Unclear if per-reward weighted importance sampling is possible
● All proposed estimators are inconsistent
○ Do not converge asymptotically

Summary
● Learn from experience (sample episodes)
○ Learn directly from interaction without model
○ Can learn with simulation
○ Can focus to subset of states
○ No bootstrapping → less harmed by violation of Markov property
● Need to maintain exploration for Control
○ Exploring starts: unlikely in learning from real experience
○ On-policy: maintain exploration in policy
○ Off-policy: separate behavior and target policies
■ Importance Sampling
● Ordinary importance sampling
● Weighted importance sampling

Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

Reinforcement Learning 5. Monte Carlo Methods

More Related Content

What's hot (20)

Similar to Reinforcement Learning 5. Monte Carlo Methods (20)

Recently uploaded (20)

Reinforcement Learning 5. Monte Carlo Methods