SlideShare a Scribd company logo
Chapter 5: Monte Carlo Methods
Seungjae Ryan Lee
New method: Monte Carlo method
● Do not assume complete knowledge of environment
○ Only need experience
○ Can use simulated experience
● Average sample returns
● Use General Policy Iteration (GPI)
○ Prediction: compute value functions
○ Policy Improvement: improve policy from value functions
○ Control: discover optimal policy
Monte Carlo Prediction:
● Estimate from sample return
● Converges as more returns are observed
Return 1 observed 10 times
Return 0 observed 2 times
Return -1 observed 0 times
First-visit MC vs. Every-visit MC
● First-visit
○ Average of returns following first visits to states
○ Studied widely
○ Primary focus for this chapter
● Every-visit
○ Average returns following all visits to states
○ Extended naturally to function approximation (Ch. 9) and eligibility traces (Ch. 12)
s1 s2 s3 s2 s3Sample Trajectory:
s1
s2 s3
First-visit MC prediction in Practice:
Blackjack Example
● States: (Sum of cards, Has usable ace, Dealer’s card)
● Action: Hit (request card), Stick (stop)
● Reward: +1, 0, -1 for win, draw, loss
● Policy: request cards if and only if sum < 20
● Difficult to use DP although environment dynamics is known
Blackjack Example Results
● Less common experience have uncertain estimates
○ ex) States with usable ace
MC vs. DP
● No bootstrapping
● Estimates for each state are independent
● Can estimate the value of a subset of all states
Monte Carlo Dynamic Programming
Soap Bubble Example
● Compute shape of soap surface for a closed wire frame
● Height of surface is average of heights at neighboring points
● Surface must meet boundaries with the wire frame
Soap Bubble Example: DP vs. MC
http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%205.pdf
DP
● Update heights by its neighboring heights
● Iteratively sweep the grid
MC
● Take random walk until boundary is reached
● Average sampled boundary height
Monte Carlo Prediction:
● More useful if model is not available
○ Can determine policy without model
● Converges quadratically to when infinite samples
● Need exploration: all state-action pairs need to be visited infinitely
https://www.youtube.com/watch?v=qaMdN6LS9rA
Exploring Starts (ES)
● Specify state-action pair to start episode on
● Cannot be used when learning from actual interactions
Monte Carlo ES
● Control: approximate optimal policies
● Use Generalized Policy Iteration (GPI)
○ Maintain approximate policy and approximate value function
○ Policy evaluation: Monte Carlo Prediction for one episode with start chosen by ES
○ Policy Improvement: Greedy selection
● No proof of convergence
Monte Carlo ES Pseudocode
Blackjack Example Revisited
● Prediction → Control
ε-soft Policy
● Avoid exploring starts → Add exploration to policy
● Soft policy: every action has nonzero probability of being selected
● ε-soft policy: every action has at least probability of being selected
● ex) ε-greedy policy
○ Select greedily for probability
○ Select randomly for probability (including greedy)
ε-soft vs ε-greedy
Softer
Random
ε = 1
Greedy
ε = 0
ε-greedy
ε = 0.1
ε-soft
ε = 0.1
On-policy ε-soft MC control Pseudocode
● On-policy: Evaluate / improve policy that is used to make decisions
On-policy vs. Off-policy
● On-policy: Evaluate / improve policy that is used to make decisions
○ Requires ε-soft policy: near optimal but never optimal
○ Simple, low variance
● Off-policy: Evaluate / improve policy different from that used to generate data
○ Target policy : policy to evaluate
○ Behavior policy : policy for taking actions
○ More powerful and general
○ High variance, slower convergence
○ Can learn from non-learning controller or human expert
Coverage assumption for off-policy learning
● To estimate values under , all possible actions of must be taken by
● must be stochastic in states where
Importance Sampling
● Trajectories have different probabilities under different policies
● Estimate expected value from one distribution given samples from another
● Weight returns by importance sampling ratio
○ Relative probability of trajectory occurring under the target and behavior policies
Ordinary Importance Sampling
● Zero bias but unbounded variance
● With single return:
Ordinary Importance Sampling: Zero Bias
Ordinary Importance Sampling: Unbounded
Variance● 1-state, 2-action undiscounted MDP
● Off-policy first-visit MC
● Variance of an estimator:
Ordinary Importance Sampling: Unbounded
Variance● Just consider all-left episodes with different lengths
○ Any trajectory with right has importance sampling ratio of 0
○ All-left trajectory have importance sampling ratio of
Ordinary Importance Sampling: Unbounded
Variance
Weighted Importance Sampling
● Has bias that converges asymptotically to zero
● Strongly preferred due to lower variance
● With single return:
Blackjack example for Importance Sampling
● Evaluated for a single state
○ player’s sum = 13, has usable ace, dealer’s card = 2
○ Behavior policy: uniform random policy
○ Target policy: stick iff player’s sum >= 20
Incremental Monte Carlo
● Update value without tracking all returns
● Ordinary importance sampling:
● Weighted importance sampling:
Incremental Monte Carlo Pseudocode
Off-policy Monte Carlo Control
● Off-policy: target policy and behavior policy
● Monte Carlo: Learn from samples without bootstrapping
● Control: Find optimal policy through GPI
Off-policy Monte Carlo Control Pseudocode
Discounting-aware Importance Sampling: Intuition*
● Exploit return’s internal structure to reduce variance
○ Return = Discounted sum of rewards
● Consider myopic discount
Irrelevant to return: adds variance
Discounting as Partial Termination*
● Consider discount as degree of partial termination
○ If , all episodes terminate after receiving first reward
○ If , episode could terminate after n steps with probability
○ Premature termination results in partial returns
● Full Return as flat (undiscounted) partial return
Discounting-aware Ordinary Importance Sampling*
● Scale flat partial returns by a truncated importance sampling ratio
● Estimator for Ordinary importance sampling:
● Estimator for Discounting-aware ordinary importance sampling
Discounting-aware Weighted Importance Sampling*
● Scale flat partial returns by a truncated importance sampling ratio
● Estimator for Weighted importance sampling
● Estimator for Discounting-aware weighted importance sampling
Per-decision Importance Sampling: Intuition*
● Unroll returns as sum of rewards
● Can ignore trajectory after the reward since they are uncorrelated
Per-decision Importance Sampling: Process*
● Simplify expectation
● Equivalent expectation for return
Per-decision Ordinary Importance Sampling*
● Estimator for Ordinary Importance Sampling:
● Estimator for Per-reward Ordinary Importance Sampling:
Per-decision Weighted Importance Sampling?*
● Unclear if per-reward weighted importance sampling is possible
● All proposed estimators are inconsistent
○ Do not converge asymptotically
Summary
● Learn from experience (sample episodes)
○ Learn directly from interaction without model
○ Can learn with simulation
○ Can focus to subset of states
○ No bootstrapping → less harmed by violation of Markov property
● Need to maintain exploration for Control
○ Exploring starts: unlikely in learning from real experience
○ On-policy: maintain exploration in policy
○ Off-policy: separate behavior and target policies
■ Importance Sampling
● Ordinary importance sampling
● Weighted importance sampling
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

More Related Content

PDF
Reinforcement Learning 1. Introduction
PDF
Reinforcement Learning 4. Dynamic Programming
PPT
Reinforcement learning
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PDF
Multi-armed Bandits
PDF
Reinforcement Learning 2. Multi-armed Bandits
PPT
CHEMICAL REACTION (Updated)
PPTX
Web of Science
Reinforcement Learning 1. Introduction
Reinforcement Learning 4. Dynamic Programming
Reinforcement learning
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
CHEMICAL REACTION (Updated)
Web of Science

What's hot (20)

PDF
Reinforcement Learning 6. Temporal Difference Learning
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Reinforcement Learning 7. n-step Bootstrapping
PDF
Temporal difference learning
PPTX
Deep Reinforcement Learning
PDF
An introduction to deep reinforcement learning
PDF
Deep Q-Learning
PDF
Planning and Learning with Tabular Methods
PPTX
An introduction to reinforcement learning
PDF
Reinforcement learning
PDF
Policy gradient
PPTX
Intro to Deep Reinforcement Learning
PDF
Lecture 9 Markov decision process
PDF
Deep reinforcement learning from scratch
PPTX
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
PDF
An introduction to reinforcement learning
PDF
Deep reinforcement learning
PDF
Deep Reinforcement Learning and Its Applications
PDF
Reinforcement learning, Q-Learning
PPTX
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 7. n-step Bootstrapping
Temporal difference learning
Deep Reinforcement Learning
An introduction to deep reinforcement learning
Deep Q-Learning
Planning and Learning with Tabular Methods
An introduction to reinforcement learning
Reinforcement learning
Policy gradient
Intro to Deep Reinforcement Learning
Lecture 9 Markov decision process
Deep reinforcement learning from scratch
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
An introduction to reinforcement learning
Deep reinforcement learning
Deep Reinforcement Learning and Its Applications
Reinforcement learning, Q-Learning
Ad

Similar to Reinforcement Learning 5. Monte Carlo Methods (20)

PDF
Structured prediction with reinforcement learning
PPTX
The Monte Carlo method! A computational technique.pptx
PDF
Intro to Reinforcement learning - part II
PPTX
Reinforcement learning:policy gradient (part 1)
PDF
Rl chapter 1 introduction
PPTX
reinforcement-learning-141009013546-conversion-gate02.pptx
PDF
Reinforcement Learning
PPTX
Paper Reading: Smooth Scan
ODP
Dimensionality Reduction
PDF
Aaa ped-24- Reinforcement Learning
PDF
Reinforcement Learning 10. On-policy Control with Approximation
PDF
A brief introduction to Searn Algorithm
PPTX
Reinforcement learning
PPTX
Proximal Policy Optimization
PPTX
Making Complex Decisions(Artificial Intelligence)
PPTX
Taxi surge pricing
PDF
Content-Based approaches for Cold-Start Job Recommendations
PDF
GTC 2021: Counterfactual Learning to Rank in E-commerce
PDF
Reinforcement Learning
PDF
Introduction to reinforcement learning
Structured prediction with reinforcement learning
The Monte Carlo method! A computational technique.pptx
Intro to Reinforcement learning - part II
Reinforcement learning:policy gradient (part 1)
Rl chapter 1 introduction
reinforcement-learning-141009013546-conversion-gate02.pptx
Reinforcement Learning
Paper Reading: Smooth Scan
Dimensionality Reduction
Aaa ped-24- Reinforcement Learning
Reinforcement Learning 10. On-policy Control with Approximation
A brief introduction to Searn Algorithm
Reinforcement learning
Proximal Policy Optimization
Making Complex Decisions(Artificial Intelligence)
Taxi surge pricing
Content-Based approaches for Cold-Start Job Recommendations
GTC 2021: Counterfactual Learning to Rank in E-commerce
Reinforcement Learning
Introduction to reinforcement learning
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
Modernizing your data center with Dell and AMD
Big Data Technologies - Introduction.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Digital-Transformation-Roadmap-for-Companies.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Modernizing your data center with Dell and AMD

Reinforcement Learning 5. Monte Carlo Methods

  • 1. Chapter 5: Monte Carlo Methods Seungjae Ryan Lee
  • 2. New method: Monte Carlo method ● Do not assume complete knowledge of environment ○ Only need experience ○ Can use simulated experience ● Average sample returns ● Use General Policy Iteration (GPI) ○ Prediction: compute value functions ○ Policy Improvement: improve policy from value functions ○ Control: discover optimal policy
  • 3. Monte Carlo Prediction: ● Estimate from sample return ● Converges as more returns are observed Return 1 observed 10 times Return 0 observed 2 times Return -1 observed 0 times
  • 4. First-visit MC vs. Every-visit MC ● First-visit ○ Average of returns following first visits to states ○ Studied widely ○ Primary focus for this chapter ● Every-visit ○ Average returns following all visits to states ○ Extended naturally to function approximation (Ch. 9) and eligibility traces (Ch. 12) s1 s2 s3 s2 s3Sample Trajectory: s1 s2 s3
  • 6. Blackjack Example ● States: (Sum of cards, Has usable ace, Dealer’s card) ● Action: Hit (request card), Stick (stop) ● Reward: +1, 0, -1 for win, draw, loss ● Policy: request cards if and only if sum < 20 ● Difficult to use DP although environment dynamics is known
  • 7. Blackjack Example Results ● Less common experience have uncertain estimates ○ ex) States with usable ace
  • 8. MC vs. DP ● No bootstrapping ● Estimates for each state are independent ● Can estimate the value of a subset of all states Monte Carlo Dynamic Programming
  • 9. Soap Bubble Example ● Compute shape of soap surface for a closed wire frame ● Height of surface is average of heights at neighboring points ● Surface must meet boundaries with the wire frame
  • 10. Soap Bubble Example: DP vs. MC http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%205.pdf DP ● Update heights by its neighboring heights ● Iteratively sweep the grid MC ● Take random walk until boundary is reached ● Average sampled boundary height
  • 11. Monte Carlo Prediction: ● More useful if model is not available ○ Can determine policy without model ● Converges quadratically to when infinite samples ● Need exploration: all state-action pairs need to be visited infinitely https://www.youtube.com/watch?v=qaMdN6LS9rA
  • 12. Exploring Starts (ES) ● Specify state-action pair to start episode on ● Cannot be used when learning from actual interactions
  • 13. Monte Carlo ES ● Control: approximate optimal policies ● Use Generalized Policy Iteration (GPI) ○ Maintain approximate policy and approximate value function ○ Policy evaluation: Monte Carlo Prediction for one episode with start chosen by ES ○ Policy Improvement: Greedy selection ● No proof of convergence
  • 14. Monte Carlo ES Pseudocode
  • 15. Blackjack Example Revisited ● Prediction → Control
  • 16. ε-soft Policy ● Avoid exploring starts → Add exploration to policy ● Soft policy: every action has nonzero probability of being selected ● ε-soft policy: every action has at least probability of being selected ● ex) ε-greedy policy ○ Select greedily for probability ○ Select randomly for probability (including greedy)
  • 17. ε-soft vs ε-greedy Softer Random ε = 1 Greedy ε = 0 ε-greedy ε = 0.1 ε-soft ε = 0.1
  • 18. On-policy ε-soft MC control Pseudocode ● On-policy: Evaluate / improve policy that is used to make decisions
  • 19. On-policy vs. Off-policy ● On-policy: Evaluate / improve policy that is used to make decisions ○ Requires ε-soft policy: near optimal but never optimal ○ Simple, low variance ● Off-policy: Evaluate / improve policy different from that used to generate data ○ Target policy : policy to evaluate ○ Behavior policy : policy for taking actions ○ More powerful and general ○ High variance, slower convergence ○ Can learn from non-learning controller or human expert
  • 20. Coverage assumption for off-policy learning ● To estimate values under , all possible actions of must be taken by ● must be stochastic in states where
  • 21. Importance Sampling ● Trajectories have different probabilities under different policies ● Estimate expected value from one distribution given samples from another ● Weight returns by importance sampling ratio ○ Relative probability of trajectory occurring under the target and behavior policies
  • 22. Ordinary Importance Sampling ● Zero bias but unbounded variance ● With single return:
  • 24. Ordinary Importance Sampling: Unbounded Variance● 1-state, 2-action undiscounted MDP ● Off-policy first-visit MC ● Variance of an estimator:
  • 25. Ordinary Importance Sampling: Unbounded Variance● Just consider all-left episodes with different lengths ○ Any trajectory with right has importance sampling ratio of 0 ○ All-left trajectory have importance sampling ratio of
  • 26. Ordinary Importance Sampling: Unbounded Variance
  • 27. Weighted Importance Sampling ● Has bias that converges asymptotically to zero ● Strongly preferred due to lower variance ● With single return:
  • 28. Blackjack example for Importance Sampling ● Evaluated for a single state ○ player’s sum = 13, has usable ace, dealer’s card = 2 ○ Behavior policy: uniform random policy ○ Target policy: stick iff player’s sum >= 20
  • 29. Incremental Monte Carlo ● Update value without tracking all returns ● Ordinary importance sampling: ● Weighted importance sampling:
  • 31. Off-policy Monte Carlo Control ● Off-policy: target policy and behavior policy ● Monte Carlo: Learn from samples without bootstrapping ● Control: Find optimal policy through GPI
  • 32. Off-policy Monte Carlo Control Pseudocode
  • 33. Discounting-aware Importance Sampling: Intuition* ● Exploit return’s internal structure to reduce variance ○ Return = Discounted sum of rewards ● Consider myopic discount Irrelevant to return: adds variance
  • 34. Discounting as Partial Termination* ● Consider discount as degree of partial termination ○ If , all episodes terminate after receiving first reward ○ If , episode could terminate after n steps with probability ○ Premature termination results in partial returns ● Full Return as flat (undiscounted) partial return
  • 35. Discounting-aware Ordinary Importance Sampling* ● Scale flat partial returns by a truncated importance sampling ratio ● Estimator for Ordinary importance sampling: ● Estimator for Discounting-aware ordinary importance sampling
  • 36. Discounting-aware Weighted Importance Sampling* ● Scale flat partial returns by a truncated importance sampling ratio ● Estimator for Weighted importance sampling ● Estimator for Discounting-aware weighted importance sampling
  • 37. Per-decision Importance Sampling: Intuition* ● Unroll returns as sum of rewards ● Can ignore trajectory after the reward since they are uncorrelated
  • 38. Per-decision Importance Sampling: Process* ● Simplify expectation ● Equivalent expectation for return
  • 39. Per-decision Ordinary Importance Sampling* ● Estimator for Ordinary Importance Sampling: ● Estimator for Per-reward Ordinary Importance Sampling:
  • 40. Per-decision Weighted Importance Sampling?* ● Unclear if per-reward weighted importance sampling is possible ● All proposed estimators are inconsistent ○ Do not converge asymptotically
  • 41. Summary ● Learn from experience (sample episodes) ○ Learn directly from interaction without model ○ Can learn with simulation ○ Can focus to subset of states ○ No bootstrapping → less harmed by violation of Markov property ● Need to maintain exploration for Control ○ Exploring starts: unlikely in learning from real experience ○ On-policy: maintain exploration in policy ○ Off-policy: separate behavior and target policies ■ Importance Sampling ● Ordinary importance sampling ● Weighted importance sampling
  • 42. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai