Reinforcement Learning 7. n-step Bootstrapping

Chapter 7: n-step Bootstrapping
Seungjae Ryan Lee

● Monte Carlo: wait until end of episode
● 1-step TD / TD(0): wait until next time step
Bootstrapping target
TD error
MC error
Recap: MC vs TD
Return

n-step Bootstrapping
● Perform update based on intermediate number of rewards
● Freed from the “tyranny of the time step” of TD
○ Different time step for action selection (1) and bootstrapping interval (n)
● Called n-step TD since they still bootstrap

n-step TD Prediction
● Use truncated n-step return as target
○ Use n rewards and bootstrap
● Needs future rewards not available at timestep
● cannot be updated until timestep

n-step TD Prediction: Pseudocode
Compute n-step return
Update V

n-step TD Prediction: Convergence
● The n-step return has the error reduction property
○ Expectation of n-step return is a better estimate of than in the worst-state sense
● Converges to true value under appropriate technical conditions

Random Walk Example
● Rewards only on exit (-1 on left exit, 1 on right exit)
● n-step return: propagate reward up to n latest states
S17 S18 S19S1 S2 S3
R = -1 R = 1
Sample
trajectory
1-step
2-step

Random Walk Example: n-step TD Prediction
● Intermediate n does best

n-step Sarsa
● Extend n-step TD Prediction to Control (Sarsa)
○ Need to use Q instead of V
○ Use ε-greedy policy
● Redefine n-step return with Q
● Naturally extend to Sarsa

n-step Sarsa vs. Sarsa(0)
● Gridworld with nonzero reward only at the end
● n-step can learn much more from one episode

n-step Expected Sarsa
● Same update as Sarsa except the last element
○ Consider all possible actions in the last step
● Same n-step return as Sarsa except the last step
● Same update as Sarsa

Off-policy n-step Learning
● Need importance sampling
● Update target policy’s values with behavior policy’s returns
● Generalizes the on-policy case
○ If , then

Off-policy n-step Sarsa
● Update Q instead of V
● Importance sampling ratio starts one step later for Q values
○ is already chosen

Off-policy n-step Sarsa: Pseudocode

Off-policy n-step Expected Sarsa
● Importance sampling ratio ends one step earlier for Expected Sarsa
● Use expected n-step return

Per-decision Off-policy Methods: Intuition*
● More efficient off-policy n-step method
● Write returns recursively:
● Naive importance sampling
○ If ,
○ Estimate shrinks, higher variance

Per-decision Off-policy Methods*
● Better: If , leave the estimate unchanged
● Expected update is unchanged since
● Used with TD update without importance sampling
Control Variate

Per-decision Off-policy Methods: Q*
● Use Expected Sarsa’s n-step return
● Off-policy form with control variate:
● Analogous to Expected Sarsa after combining with TD update algorithm
http://auai.org/uai2018/proceedings/papers/282.pdf

n-step Tree Backup Algorithm
● Off-policy without importance sampling
● Update from entire tree of estimated action values
○ Leaf action nodes (not selected) contribute to the target
○ Selected action nodes does not contribute but weighs all next-level action values

n-step Tree Backup Algorithm: n-step Return
● 1-step return
● 2-step return

n-step Tree Backup Algorithm: n-step Return
● 2-step return
● n-step return

n-step Tree Backup Algorithm: Pseudocode

A Unifying Algorithm: n-step *
● Unify Sarsa, Tree Backup and Expected Sarsa
○ Decide on each step to use sample action (Sarsa) or expectation of all actions (Tree Backup)

A Unifying Algorithm: n-step : Equations*
● : degree of sampling on timestep t
● Slide linearly between two weights:
○ Sarsa: Importance sampling ratio
○ Tree Backup: Policy probability

A Unifying Algorithm: n-step : Pseudocode*

Summary
● n-step: Look ahead to the next n rewards, states, and actions
+ Perform better than either MC or TD
+ Escapes the tyranny of the single time step
- Delay of n steps before learning
- More memory and computation per timestep
● Extended to Eligibility Traces (Ch. 12)
+ Minimize additional memory and computation
- More complex
● Two approaches to off-policy n-step learning
○ Importance sampling: high variance
○ Tree backup: limited to few-step bootstrapping if policies are very different (even if n is large)

Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

Reinforcement Learning 7. n-step Bootstrapping

More Related Content

What's hot (20)

Similar to Reinforcement Learning 7. n-step Bootstrapping (20)

Recently uploaded (20)

Reinforcement Learning 7. n-step Bootstrapping