SlideShare a Scribd company logo
Chapter 7: n-step Bootstrapping
Seungjae Ryan Lee
● Monte Carlo: wait until end of episode
● 1-step TD / TD(0): wait until next time step
Bootstrapping target
TD error
MC error
Recap: MC vs TD
Return
n-step Bootstrapping
● Perform update based on intermediate number of rewards
● Freed from the “tyranny of the time step” of TD
○ Different time step for action selection (1) and bootstrapping interval (n)
● Called n-step TD since they still bootstrap
n-step Bootstrapping
n-step TD Prediction
● Use truncated n-step return as target
○ Use n rewards and bootstrap
● Needs future rewards not available at timestep
● cannot be updated until timestep
n-step TD Prediction: Pseudocode
Compute n-step return
Update V
n-step TD Prediction: Convergence
● The n-step return has the error reduction property
○ Expectation of n-step return is a better estimate of than in the worst-state sense
● Converges to true value under appropriate technical conditions
Random Walk Example
● Rewards only on exit (-1 on left exit, 1 on right exit)
● n-step return: propagate reward up to n latest states
S17 S18 S19S1 S2 S3
R = -1 R = 1
Sample
trajectory
1-step
2-step
Random Walk Example: n-step TD Prediction
● Intermediate n does best
n-step Sarsa
● Extend n-step TD Prediction to Control (Sarsa)
○ Need to use Q instead of V
○ Use ε-greedy policy
● Redefine n-step return with Q
● Naturally extend to Sarsa
n-step Sarsa vs. Sarsa(0)
● Gridworld with nonzero reward only at the end
● n-step can learn much more from one episode
n-step Sarsa: Pseudocode
n-step Expected Sarsa
● Same update as Sarsa except the last element
○ Consider all possible actions in the last step
● Same n-step return as Sarsa except the last step
● Same update as Sarsa
Off-policy n-step Learning
● Need importance sampling
● Update target policy’s values with behavior policy’s returns
● Generalizes the on-policy case
○ If , then
Off-policy n-step Sarsa
● Update Q instead of V
● Importance sampling ratio starts one step later for Q values
○ is already chosen
Off-policy n-step Sarsa: Pseudocode
Off-policy n-step Expected Sarsa
● Importance sampling ratio ends one step earlier for Expected Sarsa
● Use expected n-step return
Per-decision Off-policy Methods: Intuition*
● More efficient off-policy n-step method
● Write returns recursively:
● Naive importance sampling
○ If ,
○ Estimate shrinks, higher variance
Per-decision Off-policy Methods*
● Better: If , leave the estimate unchanged
● Expected update is unchanged since
● Used with TD update without importance sampling
Control Variate
Per-decision Off-policy Methods: Q*
● Use Expected Sarsa’s n-step return
● Off-policy form with control variate:
● Analogous to Expected Sarsa after combining with TD update algorithm
http://auai.org/uai2018/proceedings/papers/282.pdf
n-step Tree Backup Algorithm
● Off-policy without importance sampling
● Update from entire tree of estimated action values
○ Leaf action nodes (not selected) contribute to the target
○ Selected action nodes does not contribute but weighs all next-level action values
n-step Tree Backup Algorithm: n-step Return
● 1-step return
● 2-step return
n-step Tree Backup Algorithm: n-step Return
● 2-step return
● n-step return
n-step Tree Backup Algorithm: Pseudocode
A Unifying Algorithm: n-step *
● Unify Sarsa, Tree Backup and Expected Sarsa
○ Decide on each step to use sample action (Sarsa) or expectation of all actions (Tree Backup)
A Unifying Algorithm: n-step : Equations*
● : degree of sampling on timestep t
● Slide linearly between two weights:
○ Sarsa: Importance sampling ratio
○ Tree Backup: Policy probability
A Unifying Algorithm: n-step : Pseudocode*
Summary
● n-step: Look ahead to the next n rewards, states, and actions
+ Perform better than either MC or TD
+ Escapes the tyranny of the single time step
- Delay of n steps before learning
- More memory and computation per timestep
● Extended to Eligibility Traces (Ch. 12)
+ Minimize additional memory and computation
- More complex
● Two approaches to off-policy n-step learning
○ Importance sampling: high variance
○ Tree backup: limited to few-step bootstrapping if policies are very different (even if n is large)
Summary
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

More Related Content

PDF
Reinforcement Learning 5. Monte Carlo Methods
PDF
Reinforcement Learning 6. Temporal Difference Learning
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PDF
Reinforcement Learning 4. Dynamic Programming
PDF
Reinforcement Learning 1. Introduction
PDF
Reinforcement Learning 10. On-policy Control with Approximation
PDF
Deep Q-Learning
PDF
Model-Based Reinforcement Learning @NIPS2017
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 1. Introduction
Reinforcement Learning 10. On-policy Control with Approximation
Deep Q-Learning
Model-Based Reinforcement Learning @NIPS2017

What's hot (20)

PDF
Reinforcement Learning 2. Multi-armed Bandits
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PPTX
An introduction to reinforcement learning
PDF
Multi-armed Bandits
PDF
Introduction to SAC(Soft Actor-Critic)
PPTX
Deep Reinforcement Learning
PDF
Deep reinforcement learning
PDF
Planning and Learning with Tabular Methods
PDF
Markov Chain Monte Carlo Methods
PDF
Intro to Reinforcement learning - part III
PDF
Temporal difference learning
PPT
Reinforcement Learning Q-Learning
PDF
Reinforcement learning
PDF
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
PDF
Multi-Armed Bandit and Applications
PDF
An introduction to deep reinforcement learning
PDF
Syntax Directed Definition and its applications
PDF
Introduction of Deep Reinforcement Learning
PDF
Understanding Bagging and Boosting
PPTX
Neural Networks
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 3. Finite Markov Decision Processes
An introduction to reinforcement learning
Multi-armed Bandits
Introduction to SAC(Soft Actor-Critic)
Deep Reinforcement Learning
Deep reinforcement learning
Planning and Learning with Tabular Methods
Markov Chain Monte Carlo Methods
Intro to Reinforcement learning - part III
Temporal difference learning
Reinforcement Learning Q-Learning
Reinforcement learning
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Multi-Armed Bandit and Applications
An introduction to deep reinforcement learning
Syntax Directed Definition and its applications
Introduction of Deep Reinforcement Learning
Understanding Bagging and Boosting
Neural Networks
Ad

Similar to Reinforcement Learning 7. n-step Bootstrapping (20)

PDF
Temporal difference learning
PPTX
Reinforcement learning
PDF
Discrete sequential prediction of continuous actions for deep RL
PPTX
Reinforcement learning:policy gradient (part 1)
PDF
Deep Reinforcement learning
PDF
Intro to Reinforcement learning - part II
PPTX
reinforcement-learning-141009013546-conversion-gate02.pptx
PDF
Introduction to reinforcement learning
PDF
Reinfrocement Learning
PDF
Machine learning
PDF
Introduction to Deep Reinforcement Learning
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PDF
Aaa ped-24- Reinforcement Learning
PPTX
Nondeterministic rewards and actions.pptx
PDF
Structured prediction with reinforcement learning
PPTX
Training DNN Models - II.pptx
PDF
Head First Reinforcement Learning
PDF
Deep Learning for Computer Vision: Optimization (UPC 2016)
PPTX
Making Complex Decisions(Artificial Intelligence)
PPTX
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
Temporal difference learning
Reinforcement learning
Discrete sequential prediction of continuous actions for deep RL
Reinforcement learning:policy gradient (part 1)
Deep Reinforcement learning
Intro to Reinforcement learning - part II
reinforcement-learning-141009013546-conversion-gate02.pptx
Introduction to reinforcement learning
Reinfrocement Learning
Machine learning
Introduction to Deep Reinforcement Learning
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Aaa ped-24- Reinforcement Learning
Nondeterministic rewards and actions.pptx
Structured prediction with reinforcement learning
Training DNN Models - II.pptx
Head First Reinforcement Learning
Deep Learning for Computer Vision: Optimization (UPC 2016)
Making Complex Decisions(Artificial Intelligence)
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
Ad

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Electronic commerce courselecture one. Pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
A Presentation on Artificial Intelligence
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Electronic commerce courselecture one. Pdf
Modernizing your data center with Dell and AMD
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
A Presentation on Artificial Intelligence

Reinforcement Learning 7. n-step Bootstrapping

  • 1. Chapter 7: n-step Bootstrapping Seungjae Ryan Lee
  • 2. ● Monte Carlo: wait until end of episode ● 1-step TD / TD(0): wait until next time step Bootstrapping target TD error MC error Recap: MC vs TD Return
  • 3. n-step Bootstrapping ● Perform update based on intermediate number of rewards ● Freed from the “tyranny of the time step” of TD ○ Different time step for action selection (1) and bootstrapping interval (n) ● Called n-step TD since they still bootstrap
  • 5. n-step TD Prediction ● Use truncated n-step return as target ○ Use n rewards and bootstrap ● Needs future rewards not available at timestep ● cannot be updated until timestep
  • 6. n-step TD Prediction: Pseudocode Compute n-step return Update V
  • 7. n-step TD Prediction: Convergence ● The n-step return has the error reduction property ○ Expectation of n-step return is a better estimate of than in the worst-state sense ● Converges to true value under appropriate technical conditions
  • 8. Random Walk Example ● Rewards only on exit (-1 on left exit, 1 on right exit) ● n-step return: propagate reward up to n latest states S17 S18 S19S1 S2 S3 R = -1 R = 1 Sample trajectory 1-step 2-step
  • 9. Random Walk Example: n-step TD Prediction ● Intermediate n does best
  • 10. n-step Sarsa ● Extend n-step TD Prediction to Control (Sarsa) ○ Need to use Q instead of V ○ Use ε-greedy policy ● Redefine n-step return with Q ● Naturally extend to Sarsa
  • 11. n-step Sarsa vs. Sarsa(0) ● Gridworld with nonzero reward only at the end ● n-step can learn much more from one episode
  • 13. n-step Expected Sarsa ● Same update as Sarsa except the last element ○ Consider all possible actions in the last step ● Same n-step return as Sarsa except the last step ● Same update as Sarsa
  • 14. Off-policy n-step Learning ● Need importance sampling ● Update target policy’s values with behavior policy’s returns ● Generalizes the on-policy case ○ If , then
  • 15. Off-policy n-step Sarsa ● Update Q instead of V ● Importance sampling ratio starts one step later for Q values ○ is already chosen
  • 17. Off-policy n-step Expected Sarsa ● Importance sampling ratio ends one step earlier for Expected Sarsa ● Use expected n-step return
  • 18. Per-decision Off-policy Methods: Intuition* ● More efficient off-policy n-step method ● Write returns recursively: ● Naive importance sampling ○ If , ○ Estimate shrinks, higher variance
  • 19. Per-decision Off-policy Methods* ● Better: If , leave the estimate unchanged ● Expected update is unchanged since ● Used with TD update without importance sampling Control Variate
  • 20. Per-decision Off-policy Methods: Q* ● Use Expected Sarsa’s n-step return ● Off-policy form with control variate: ● Analogous to Expected Sarsa after combining with TD update algorithm http://auai.org/uai2018/proceedings/papers/282.pdf
  • 21. n-step Tree Backup Algorithm ● Off-policy without importance sampling ● Update from entire tree of estimated action values ○ Leaf action nodes (not selected) contribute to the target ○ Selected action nodes does not contribute but weighs all next-level action values
  • 22. n-step Tree Backup Algorithm: n-step Return ● 1-step return ● 2-step return
  • 23. n-step Tree Backup Algorithm: n-step Return ● 2-step return ● n-step return
  • 24. n-step Tree Backup Algorithm: Pseudocode
  • 25. A Unifying Algorithm: n-step * ● Unify Sarsa, Tree Backup and Expected Sarsa ○ Decide on each step to use sample action (Sarsa) or expectation of all actions (Tree Backup)
  • 26. A Unifying Algorithm: n-step : Equations* ● : degree of sampling on timestep t ● Slide linearly between two weights: ○ Sarsa: Importance sampling ratio ○ Tree Backup: Policy probability
  • 27. A Unifying Algorithm: n-step : Pseudocode*
  • 28. Summary ● n-step: Look ahead to the next n rewards, states, and actions + Perform better than either MC or TD + Escapes the tyranny of the single time step - Delay of n steps before learning - More memory and computation per timestep ● Extended to Eligibility Traces (Ch. 12) + Minimize additional memory and computation - More complex ● Two approaches to off-policy n-step learning ○ Importance sampling: high variance ○ Tree backup: limited to few-step bootstrapping if policies are very different (even if n is large)
  • 30. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai