SlideShare a Scribd company logo
2
Most read
3
Most read
14
Most read
Chapter 4: Dynamic Programming
Seungjae Ryan Lee
Dynamic Programming
● Algorithms to compute optimal policies with a perfect model of environment
● Use value functions to structure searching for good policies
● Foundation of all methods hereafter
Dynamic
Programming
Monte Carlo
Temporal
Difference
Policy Evaluation (Prediction)
● Compute state-value for some policy
● Use the Bellman Equation:
Iterative Policy Evaluation
● Solving linear systems is tedious → Use iterative methods
● Define sequence of approximate value functions
● Expected update using the Bellman equation:
○ Update based on expectation of all possible next states
Iterative Policy Evaluation in Practice
● In-place methods usually converge faster than keeping two arrays
● Terminate policy evaluation when is sufficiently small
Gridworld Example
● Deterministic state transition
● Off-the-grid actions leave the state unchanged
● Undiscounted, episodic task
Policy Evaluation in Gridworld
● Random policy
Policy Improvement - One state
● Suppose we know for some policy
● For a state , see if there is a better action
● Check if
○ If true, greedily selecting is better than
○ Special case of Policy Improvement Theorem
0
-1
-1
-1
0
-1
-1
-1
Policy Improvement Theorem
For policies , if for all state ,
Then, is at least as good a policy as .
(Strict inequality if )
Policy Improvement
● Find better policies with the computed value function
● Use a new greedy policy
● Satisfies the conditions of Policy Improvement Theorem
Guarantees of Policy Improvement
● If , then the Bellman Optimality Equation holds.
→ Policy Improvement always returns a better policy unless already optimal
Policy Iteration
● Repeat Policy Evaluation and Policy Improvement
● Guaranteed improvement for each policy
● Guaranteed convergence in finite number of steps for finite MDPs
Policy Iteration in Practice
● Initialize with for quicker policy evaluation
● Often converges in surprisingly few iterations
Value Iteration
● “Truncate” policy evaluation
○ Don’t wait until is sufficiently small
○ Update state values once for each state
● Evaluation and improvement can be simplified to one update operation
○ Bellman optimality equation turned into an update rule
Value Iteration in Practice
● Terminate when is sufficiently small
Asynchronous Dynamic Programming
● Don’t sweep over the entire state set systematically
○ Some states are updated multiple times before other state is updated once
○ Order/skip states to propagate information efficiently
● Can intermix with real-time interaction
○ Update states according to the agent’s experience
○ Allow focusing updates to relevant states
● To converge, all states must be continuously updated
Generalized Policy Iteration
● Idea of interaction between policy evaluation and policy improvement
○ Policy improved w.r.t. value function
○ Value function updated for new policy
● Describes most RL methods
● Stabilized process guarantees optimal policy
Efficiency of Dynamic Programming
● Polynomial in and
○ Exponentially faster than direct search in policy space
● More practical than linear programming methods in larger problems
○ Asynchronous DP preferred for large state spaces
● Typically converge faster than their worst-case guarantee
○ Initial values can help faster convergence
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

More Related Content

PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Reinforcement Learning 6. Temporal Difference Learning
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PDF
Reinforcement Learning 1. Introduction
PDF
Markov decision process
PDF
Reinforcement Learning 2. Multi-armed Bandits
PDF
Temporal difference learning
PDF
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 1. Introduction
Markov decision process
Reinforcement Learning 2. Multi-armed Bandits
Temporal difference learning
Reinforcement Learning 5. Monte Carlo Methods

What's hot (20)

PPTX
Intro to Deep Reinforcement Learning
PDF
Deep Q-Learning
PDF
An introduction to deep reinforcement learning
PPTX
Deep Reinforcement Learning
PPT
Reinforcement Learning Q-Learning
PPTX
An introduction to reinforcement learning
PDF
Multi-armed Bandits
PPT
Reinforcement learning 7313
PDF
Inverse Reinforcement Learning Algorithms
PDF
Lecture 9 Markov decision process
PDF
Reinforcement Learning
PDF
Markov Chain Monte Carlo Methods
PDF
Actor critic algorithm
PDF
Reinforcement Learning 7. n-step Bootstrapping
PDF
Deep Reinforcement Learning: Q-Learning
PPTX
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
PDF
An introduction to reinforcement learning
PDF
Reinforcement Learning 10. On-policy Control with Approximation
PDF
Deep Reinforcement Learning and Its Applications
PDF
Policy gradient
Intro to Deep Reinforcement Learning
Deep Q-Learning
An introduction to deep reinforcement learning
Deep Reinforcement Learning
Reinforcement Learning Q-Learning
An introduction to reinforcement learning
Multi-armed Bandits
Reinforcement learning 7313
Inverse Reinforcement Learning Algorithms
Lecture 9 Markov decision process
Reinforcement Learning
Markov Chain Monte Carlo Methods
Actor critic algorithm
Reinforcement Learning 7. n-step Bootstrapping
Deep Reinforcement Learning: Q-Learning
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
An introduction to reinforcement learning
Reinforcement Learning 10. On-policy Control with Approximation
Deep Reinforcement Learning and Its Applications
Policy gradient
Ad

Similar to Reinforcement Learning 4. Dynamic Programming (20)

PDF
Intro to Reinforcement learning - part I
PPT
reinforcement-learning.ppt
PPT
reinforcement-learning.prsentation for c
PPT
reinforcement-learning its based on the slide of university
PPT
about reinforcement-learning ,reinforcement-learning.ppt
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PPTX
14_ReinforcementLearning.pptx
PDF
PDF
Head First Reinforcement Learning
PDF
Reinfrocement Learning
PPTX
value and policy iteration presentation.pptx
PDF
Role of Bellman's Equation in Reinforcement Learning
PDF
Cs229 notes12
PPTX
How to formulate reinforcement learning in illustrative ways
PPTX
Making Complex Decisions(Artificial Intelligence)
PDF
Deep reinforcement learning from scratch
PDF
Deep Reinforcement learning
PPTX
Reinforcement Learning: An Introduction.pptx
PPTX
What is Reinforcement Algorithms and how worked.pptx
PPTX
Reinforcement Learning
Intro to Reinforcement learning - part I
reinforcement-learning.ppt
reinforcement-learning.prsentation for c
reinforcement-learning its based on the slide of university
about reinforcement-learning ,reinforcement-learning.ppt
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
14_ReinforcementLearning.pptx
Head First Reinforcement Learning
Reinfrocement Learning
value and policy iteration presentation.pptx
Role of Bellman's Equation in Reinforcement Learning
Cs229 notes12
How to formulate reinforcement learning in illustrative ways
Making Complex Decisions(Artificial Intelligence)
Deep reinforcement learning from scratch
Deep Reinforcement learning
Reinforcement Learning: An Introduction.pptx
What is Reinforcement Algorithms and how worked.pptx
Reinforcement Learning
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation theory and applications.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Modernizing your data center with Dell and AMD
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
Encapsulation theory and applications.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Bridging biosciences and deep learning for revolutionary discoveries: a compr...

Reinforcement Learning 4. Dynamic Programming

  • 1. Chapter 4: Dynamic Programming Seungjae Ryan Lee
  • 2. Dynamic Programming ● Algorithms to compute optimal policies with a perfect model of environment ● Use value functions to structure searching for good policies ● Foundation of all methods hereafter Dynamic Programming Monte Carlo Temporal Difference
  • 3. Policy Evaluation (Prediction) ● Compute state-value for some policy ● Use the Bellman Equation:
  • 4. Iterative Policy Evaluation ● Solving linear systems is tedious → Use iterative methods ● Define sequence of approximate value functions ● Expected update using the Bellman equation: ○ Update based on expectation of all possible next states
  • 5. Iterative Policy Evaluation in Practice ● In-place methods usually converge faster than keeping two arrays ● Terminate policy evaluation when is sufficiently small
  • 6. Gridworld Example ● Deterministic state transition ● Off-the-grid actions leave the state unchanged ● Undiscounted, episodic task
  • 7. Policy Evaluation in Gridworld ● Random policy
  • 8. Policy Improvement - One state ● Suppose we know for some policy ● For a state , see if there is a better action ● Check if ○ If true, greedily selecting is better than ○ Special case of Policy Improvement Theorem 0 -1 -1 -1 0 -1 -1 -1
  • 9. Policy Improvement Theorem For policies , if for all state , Then, is at least as good a policy as . (Strict inequality if )
  • 10. Policy Improvement ● Find better policies with the computed value function ● Use a new greedy policy ● Satisfies the conditions of Policy Improvement Theorem
  • 11. Guarantees of Policy Improvement ● If , then the Bellman Optimality Equation holds. → Policy Improvement always returns a better policy unless already optimal
  • 12. Policy Iteration ● Repeat Policy Evaluation and Policy Improvement ● Guaranteed improvement for each policy ● Guaranteed convergence in finite number of steps for finite MDPs
  • 13. Policy Iteration in Practice ● Initialize with for quicker policy evaluation ● Often converges in surprisingly few iterations
  • 14. Value Iteration ● “Truncate” policy evaluation ○ Don’t wait until is sufficiently small ○ Update state values once for each state ● Evaluation and improvement can be simplified to one update operation ○ Bellman optimality equation turned into an update rule
  • 15. Value Iteration in Practice ● Terminate when is sufficiently small
  • 16. Asynchronous Dynamic Programming ● Don’t sweep over the entire state set systematically ○ Some states are updated multiple times before other state is updated once ○ Order/skip states to propagate information efficiently ● Can intermix with real-time interaction ○ Update states according to the agent’s experience ○ Allow focusing updates to relevant states ● To converge, all states must be continuously updated
  • 17. Generalized Policy Iteration ● Idea of interaction between policy evaluation and policy improvement ○ Policy improved w.r.t. value function ○ Value function updated for new policy ● Describes most RL methods ● Stabilized process guarantees optimal policy
  • 18. Efficiency of Dynamic Programming ● Polynomial in and ○ Exponentially faster than direct search in policy space ● More practical than linear programming methods in larger problems ○ Asynchronous DP preferred for large state spaces ● Typically converge faster than their worst-case guarantee ○ Initial values can help faster convergence
  • 19. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai