Reinforcement Learning 4. Dynamic Programming

Chapter 4: Dynamic Programming
Seungjae Ryan Lee

Dynamic Programming
● Algorithms to compute optimal policies with a perfect model of environment
● Use value functions to structure searching for good policies
● Foundation of all methods hereafter
Dynamic
Programming
Monte Carlo
Temporal
Difference

Policy Evaluation (Prediction)
● Compute state-value for some policy
● Use the Bellman Equation:

Iterative Policy Evaluation
● Solving linear systems is tedious → Use iterative methods
● Define sequence of approximate value functions
● Expected update using the Bellman equation:
○ Update based on expectation of all possible next states

Iterative Policy Evaluation in Practice
● In-place methods usually converge faster than keeping two arrays
● Terminate policy evaluation when is sufficiently small

Gridworld Example
● Deterministic state transition
● Off-the-grid actions leave the state unchanged
● Undiscounted, episodic task

Policy Evaluation in Gridworld
● Random policy

Policy Improvement - One state
● Suppose we know for some policy
● For a state , see if there is a better action
● Check if
○ If true, greedily selecting is better than
○ Special case of Policy Improvement Theorem
0
-1
-1
-1
0
-1
-1
-1

Policy Improvement Theorem
For policies , if for all state ,
Then, is at least as good a policy as .
(Strict inequality if )

Policy Improvement
● Find better policies with the computed value function
● Use a new greedy policy
● Satisfies the conditions of Policy Improvement Theorem

Guarantees of Policy Improvement
● If , then the Bellman Optimality Equation holds.
→ Policy Improvement always returns a better policy unless already optimal

Policy Iteration
● Repeat Policy Evaluation and Policy Improvement
● Guaranteed improvement for each policy
● Guaranteed convergence in finite number of steps for finite MDPs

Policy Iteration in Practice
● Initialize with for quicker policy evaluation
● Often converges in surprisingly few iterations

Value Iteration
● “Truncate” policy evaluation
○ Don’t wait until is sufficiently small
○ Update state values once for each state
● Evaluation and improvement can be simplified to one update operation
○ Bellman optimality equation turned into an update rule

Value Iteration in Practice
● Terminate when is sufficiently small

Asynchronous Dynamic Programming
● Don’t sweep over the entire state set systematically
○ Some states are updated multiple times before other state is updated once
○ Order/skip states to propagate information efficiently
● Can intermix with real-time interaction
○ Update states according to the agent’s experience
○ Allow focusing updates to relevant states
● To converge, all states must be continuously updated

Generalized Policy Iteration
● Idea of interaction between policy evaluation and policy improvement
○ Policy improved w.r.t. value function
○ Value function updated for new policy
● Describes most RL methods
● Stabilized process guarantees optimal policy

Efficiency of Dynamic Programming
● Polynomial in and
○ Exponentially faster than direct search in policy space
● More practical than linear programming methods in larger problems
○ Asynchronous DP preferred for large state spaces
● Typically converge faster than their worst-case guarantee
○ Initial values can help faster convergence

Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

Reinforcement Learning 4. Dynamic Programming

More Related Content

What's hot (20)

Similar to Reinforcement Learning 4. Dynamic Programming (20)

Recently uploaded (20)

Reinforcement Learning 4. Dynamic Programming