SlideShare a Scribd company logo
Understanding (Exact) Dynamic Programming through
Bellman Operators
Ashwin Rao
ICME, Stanford University
January 14, 2019
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 1 / 11
Overview
1 Vector Space of Value Functions
2 Bellman Operators
3 Contraction and Monotonicity
4 Policy Evaluation
5 Policy Iteration
6 Value Iteration
7 Policy Optimality
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 2 / 11
Vector Space of Value Functions
Assume State pace S consists of n states: {s1, s2, . . . , sn}
Assume Action space A consists of m actions {a1, a2, . . . , am}
This exposition extends easily to continuous state/action spaces too
We denote a stochastic policy as π(a|s) (probability of “a given s”)
Abusing notation, deterministic policy denoted as π(s) = a
Consider a n-dim vector space, each dim corresponding to a state in S
A vector in this space is a specific Value Function (VF) v: S → R
With coordinates [v(s1), v(s2), . . . , v(sn)]
Value Function (VF) for a policy π is denoted as vπ : S → R
Optimal VF denoted as v∗ : S → R such that for any s ∈ S,
v∗(s) = max
π
vπ(s)
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 3 / 11
Some more notation
Denote Ra
s as the Expected Reward upon action a in state s
Denote Pa
s,s as the probability of transition s → s upon action a
Define
Rπ(s) =
a∈A
π(a|s) · Ra
s
Pπ(s, s ) =
a∈A
π(a|s) · Pa
s,s
Denote Rπ as the vector [Rπ(s1), Rπ(s2), . . . , Rπ(sn)]
Denote Pπ as the matrix [Pπ(si , si )], 1 ≤ i, i ≤ n
Denote γ as the MDP discount factor
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 4 / 11
Bellman Operators Bπ and B∗
We define operators that transform a VF vector to another VF vector
Bellman Policy Operator Bπ (for policy π) operating on VF vector v:
Bπv = Rπ + γPπ · v
Bπ is a linear operator with fixed point vπ, meaning Bπvπ = vπ
Bellman Optimality Operator B∗ operating on VF vector v:
(B∗v)(s) = max
a
{Ra
s + γ
s ∈S
Pa
s,s · v(s )}
B∗ is a non-linear operator with fixed point v∗, meaning B∗v∗ = v∗
Define a function G mapping a VF v to a deterministic “greedy”
policy G(v) as follows:
G(v)(s) = arg max
a
{Ra
s + γ
s ∈S
Pa
s,s · v(s )}
BG(v)v = B∗v for any VF v (Policy G(v) achieves the max in B∗)
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 5 / 11
Contraction and Monotonicity of Operators
Both Bπ and B∗ are γ-contraction operators in L∞ norm, meaning:
For any two VFs v1 and v2,
Bπv1 − Bπv2 ∞ ≤ γ v1 − v2 ∞
B∗v1 − B∗v2 ∞ ≤ γ v1 − v2 ∞
So we can invoke Contraction Mapping Theorem to claim fixed point
We use the notation v1 ≤ v2 for any two VFs v1, v2 to mean:
v1(s) ≤ v2(s) for all s ∈ S
Also, both Bπ and B∗ are monotonic, meaning:
For any two VFs v1 and v2,
v1 ≤ v2 ⇒ Bπv1 ≤ Bπv2
v1 ≤ v2 ⇒ B∗v1 ≤ B∗v2
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 6 / 11
Policy Evaluation
Bπ satisfies the conditions of Contraction Mapping Theorem
Bπ has a unique fixed point vπ, meaning Bπvπ = vπ
This is a succinct representation of Bellman Expectation Equation
Starting with any VF v and repeatedly applying Bπ, we will reach vπ
lim
N→∞
BN
π v = vπ for any VF v
This is a succinct representation of the Policy Evaluation Algorithm
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 7 / 11
Policy Improvement
Let πk and vπk
denote the Policy and the VF for the Policy in
iteration k of Policy Iteration
Policy Improvement Step is: πk+1 = G(vπk
), i.e. deterministic greedy
Earlier we argued that B∗v = BG(v)v for any VF v. Therefore,
B∗vπk
= BG(vπk
)vπk
= Bπk+1
vπk
(1)
We also know from operator definitions that B∗v ≥ Bπv for all π, v
B∗vπk
≥ Bπk
vπk
= vπk
(2)
Combining (1) and (2), we get:
Bπk+1
vπk
≥ vπk
Monotonicity of Bπk+1
implies
BN
πk+1
vπk
≥ . . . B2
πk+1
vπk
≥ Bπk+1
vπk
≥ vπk
vπk+1
= lim
N→∞
BN
πk+1
vπk
≥ vπk
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 8 / 11
Policy Iteration
We have shown that in iteration k + 1 of Policy Iteration, vπk+1
≥ vπk
If vπk+1
= vπk
, the above inequalities would hold as equalities
So this would mean B∗vπk
= vπk
But B∗ has a unique fixed point v∗
So this would mean vπk
= v∗
Thus, at each iteration, Policy Iteration either strictly improves the
VF or achieves the optimal VF v∗
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 9 / 11
Value Iteration
B∗ satisfies the conditions of Contraction Mapping Theorem
B∗ has a unique fixed point v∗, meaning B∗v∗ = v∗
This is a succinct representation of Bellman Optimality Equation
Starting with any VF v and repeatedly applying B∗, we will reach v∗
lim
N→∞
BN
∗ v = v∗ for any VF v
This is a succinct representation of the Value Iteration Algorithm
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 10 / 11
Greedy Policy from Optimal VF is an Optimal Policy
Earlier we argued that BG(v)v = B∗v for any VF v. Therefore,
BG(v∗)v∗ = B∗v∗
But v∗ is the fixed point of B∗, meaning B∗v∗ = v∗. Therefore,
BG(v∗)v∗ = v∗
But we know that BG(v∗) has a unique fixed point vG(v∗). Therefore,
v∗ = vG(v∗)
This says that simply following the deterministic greedy policy G(v∗)
(created from the Optimal VF v∗) in fact achieves the Optimal VF v∗
In other words, G(v∗) is an Optimal (Deterministic) Policy
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 11 / 11

More Related Content

PDF
Value Function Geometry and Gradient TD
PDF
Policy Gradient Theorem
PDF
A Quick and Terse Introduction to Efficient Frontier Mathematics
PPTX
Tsp is NP-Complete
PDF
Consistency proof of a feasible arithmetic inside a bounded arithmetic
PDF
Consistency proof of a feasible arithmetic inside a bounded arithmetic
PDF
Wireless Localization: Ranging (second part)
PPT
Algorithum Analysis
Value Function Geometry and Gradient TD
Policy Gradient Theorem
A Quick and Terse Introduction to Efficient Frontier Mathematics
Tsp is NP-Complete
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Wireless Localization: Ranging (second part)
Algorithum Analysis

What's hot (20)

PDF
Fine Grained Complexity of Rainbow Coloring and its Variants
PDF
Consistency proof of a feasible arithmetic inside a bounded arithmetic
PPTX
P, NP and NP-Complete, Theory of NP-Completeness V2
PDF
Path Contraction Faster than 2^n
PDF
Reductions
PDF
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
PDF
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
PDF
Kernel for Chordal Vertex Deletion
PPTX
Analysis of algorithms
PDF
smtlecture.3
PPTX
Asymptotic notation
PDF
Overview of Stochastic Calculus Foundations
PDF
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
PDF
RSS discussion of Girolami and Calderhead, October 13, 2010
PDF
Semi-automatic ABC: a discussion
PDF
Guarding Terrains though the Lens of Parameterized Complexity
PPTX
Lec 5 asymptotic notations and recurrences
PPT
Cs ps, sat, fol resolution strategies
PPTX
1.7. eqivalence of nfa and dfa
Fine Grained Complexity of Rainbow Coloring and its Variants
Consistency proof of a feasible arithmetic inside a bounded arithmetic
P, NP and NP-Complete, Theory of NP-Completeness V2
Path Contraction Faster than 2^n
Reductions
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Kernel for Chordal Vertex Deletion
Analysis of algorithms
smtlecture.3
Asymptotic notation
Overview of Stochastic Calculus Foundations
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
RSS discussion of Girolami and Calderhead, October 13, 2010
Semi-automatic ABC: a discussion
Guarding Terrains though the Lens of Parameterized Complexity
Lec 5 asymptotic notations and recurrences
Cs ps, sat, fol resolution strategies
1.7. eqivalence of nfa and dfa
Ad

Similar to Understanding Dynamic Programming through Bellman Operators (20)

PDF
Stochastic optimal control & rl
PDF
Lecture20.pdf
PPTX
Reinforcement learning Markov decisions process mdp ppt
PDF
Cs229 notes12
PDF
Machine learning (13)
PDF
RL unit 5 part 1.pdf
PDF
Intro to Reinforcement learning - part I
PDF
the bellman equation
PDF
Machine Learning - Reinforcement Learning
PPTX
14_ReinforcementLearning.pptx
PDF
Presentation on stochastic control problem with financial applications (Merto...
PDF
Research internship on optimal stochastic theory with financial application u...
PDF
Policy-Gradient for deep reinforcement learning.pdf
PDF
Intro to Reinforcement learning - part II
PDF
PDF
Reinforcement learning
PDF
Reinfrocement Learning
PDF
Reinforcement Learning in Economics and Finance
PDF
4971 improved-and-generalized-upper-bounds-on-the-complexity-of-policy-iteration
PDF
Reinforcement learning Markov principle
Stochastic optimal control & rl
Lecture20.pdf
Reinforcement learning Markov decisions process mdp ppt
Cs229 notes12
Machine learning (13)
RL unit 5 part 1.pdf
Intro to Reinforcement learning - part I
the bellman equation
Machine Learning - Reinforcement Learning
14_ReinforcementLearning.pptx
Presentation on stochastic control problem with financial applications (Merto...
Research internship on optimal stochastic theory with financial application u...
Policy-Gradient for deep reinforcement learning.pdf
Intro to Reinforcement learning - part II
Reinforcement learning
Reinfrocement Learning
Reinforcement Learning in Economics and Finance
4971 improved-and-generalized-upper-bounds-on-the-complexity-of-policy-iteration
Reinforcement learning Markov principle
Ad

More from Ashwin Rao (20)

PDF
Stochastic Control/Reinforcement Learning for Optimal Market Making
PDF
Fundamental Theorems of Asset Pricing
PDF
Evolutionary Strategies as an alternative to Reinforcement Learning
PDF
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
PDF
Stochastic Control of Optimal Trade Order Execution
PDF
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
PDF
Risk-Aversion, Risk-Premium and Utility Theory
PDF
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
PDF
HJB Equation and Merton's Portfolio Problem
PDF
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
PDF
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
PDF
Demystifying the Bias-Variance Tradeoff
PDF
Category Theory made easy with (ugly) pictures
PDF
Risk Pooling sensitivity to Correlation
PDF
Abstract Algebra in 3 Hours
PDF
OmniChannelNewsvendor
PDF
The Newsvendor meets the Options Trader
PDF
The Fuss about || Haskell | Scala | F# ||
PDF
Career Advice at University College of London, Mathematics Department.
PDF
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Stochastic Control/Reinforcement Learning for Optimal Market Making
Fundamental Theorems of Asset Pricing
Evolutionary Strategies as an alternative to Reinforcement Learning
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Stochastic Control of Optimal Trade Order Execution
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
Risk-Aversion, Risk-Premium and Utility Theory
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
HJB Equation and Merton's Portfolio Problem
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Demystifying the Bias-Variance Tradeoff
Category Theory made easy with (ugly) pictures
Risk Pooling sensitivity to Correlation
Abstract Algebra in 3 Hours
OmniChannelNewsvendor
The Newsvendor meets the Options Trader
The Fuss about || Haskell | Scala | F# ||
Career Advice at University College of London, Mathematics Department.
Careers outside Academia - USC Computer Science Masters and Ph.D. Students

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
Spectroscopy.pptx food analysis technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction

Understanding Dynamic Programming through Bellman Operators

  • 1. Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 14, 2019 Ashwin Rao (Stanford) Bellman Operators January 14, 2019 1 / 11
  • 2. Overview 1 Vector Space of Value Functions 2 Bellman Operators 3 Contraction and Monotonicity 4 Policy Evaluation 5 Policy Iteration 6 Value Iteration 7 Policy Optimality Ashwin Rao (Stanford) Bellman Operators January 14, 2019 2 / 11
  • 3. Vector Space of Value Functions Assume State pace S consists of n states: {s1, s2, . . . , sn} Assume Action space A consists of m actions {a1, a2, . . . , am} This exposition extends easily to continuous state/action spaces too We denote a stochastic policy as π(a|s) (probability of “a given s”) Abusing notation, deterministic policy denoted as π(s) = a Consider a n-dim vector space, each dim corresponding to a state in S A vector in this space is a specific Value Function (VF) v: S → R With coordinates [v(s1), v(s2), . . . , v(sn)] Value Function (VF) for a policy π is denoted as vπ : S → R Optimal VF denoted as v∗ : S → R such that for any s ∈ S, v∗(s) = max π vπ(s) Ashwin Rao (Stanford) Bellman Operators January 14, 2019 3 / 11
  • 4. Some more notation Denote Ra s as the Expected Reward upon action a in state s Denote Pa s,s as the probability of transition s → s upon action a Define Rπ(s) = a∈A π(a|s) · Ra s Pπ(s, s ) = a∈A π(a|s) · Pa s,s Denote Rπ as the vector [Rπ(s1), Rπ(s2), . . . , Rπ(sn)] Denote Pπ as the matrix [Pπ(si , si )], 1 ≤ i, i ≤ n Denote γ as the MDP discount factor Ashwin Rao (Stanford) Bellman Operators January 14, 2019 4 / 11
  • 5. Bellman Operators Bπ and B∗ We define operators that transform a VF vector to another VF vector Bellman Policy Operator Bπ (for policy π) operating on VF vector v: Bπv = Rπ + γPπ · v Bπ is a linear operator with fixed point vπ, meaning Bπvπ = vπ Bellman Optimality Operator B∗ operating on VF vector v: (B∗v)(s) = max a {Ra s + γ s ∈S Pa s,s · v(s )} B∗ is a non-linear operator with fixed point v∗, meaning B∗v∗ = v∗ Define a function G mapping a VF v to a deterministic “greedy” policy G(v) as follows: G(v)(s) = arg max a {Ra s + γ s ∈S Pa s,s · v(s )} BG(v)v = B∗v for any VF v (Policy G(v) achieves the max in B∗) Ashwin Rao (Stanford) Bellman Operators January 14, 2019 5 / 11
  • 6. Contraction and Monotonicity of Operators Both Bπ and B∗ are γ-contraction operators in L∞ norm, meaning: For any two VFs v1 and v2, Bπv1 − Bπv2 ∞ ≤ γ v1 − v2 ∞ B∗v1 − B∗v2 ∞ ≤ γ v1 − v2 ∞ So we can invoke Contraction Mapping Theorem to claim fixed point We use the notation v1 ≤ v2 for any two VFs v1, v2 to mean: v1(s) ≤ v2(s) for all s ∈ S Also, both Bπ and B∗ are monotonic, meaning: For any two VFs v1 and v2, v1 ≤ v2 ⇒ Bπv1 ≤ Bπv2 v1 ≤ v2 ⇒ B∗v1 ≤ B∗v2 Ashwin Rao (Stanford) Bellman Operators January 14, 2019 6 / 11
  • 7. Policy Evaluation Bπ satisfies the conditions of Contraction Mapping Theorem Bπ has a unique fixed point vπ, meaning Bπvπ = vπ This is a succinct representation of Bellman Expectation Equation Starting with any VF v and repeatedly applying Bπ, we will reach vπ lim N→∞ BN π v = vπ for any VF v This is a succinct representation of the Policy Evaluation Algorithm Ashwin Rao (Stanford) Bellman Operators January 14, 2019 7 / 11
  • 8. Policy Improvement Let πk and vπk denote the Policy and the VF for the Policy in iteration k of Policy Iteration Policy Improvement Step is: πk+1 = G(vπk ), i.e. deterministic greedy Earlier we argued that B∗v = BG(v)v for any VF v. Therefore, B∗vπk = BG(vπk )vπk = Bπk+1 vπk (1) We also know from operator definitions that B∗v ≥ Bπv for all π, v B∗vπk ≥ Bπk vπk = vπk (2) Combining (1) and (2), we get: Bπk+1 vπk ≥ vπk Monotonicity of Bπk+1 implies BN πk+1 vπk ≥ . . . B2 πk+1 vπk ≥ Bπk+1 vπk ≥ vπk vπk+1 = lim N→∞ BN πk+1 vπk ≥ vπk Ashwin Rao (Stanford) Bellman Operators January 14, 2019 8 / 11
  • 9. Policy Iteration We have shown that in iteration k + 1 of Policy Iteration, vπk+1 ≥ vπk If vπk+1 = vπk , the above inequalities would hold as equalities So this would mean B∗vπk = vπk But B∗ has a unique fixed point v∗ So this would mean vπk = v∗ Thus, at each iteration, Policy Iteration either strictly improves the VF or achieves the optimal VF v∗ Ashwin Rao (Stanford) Bellman Operators January 14, 2019 9 / 11
  • 10. Value Iteration B∗ satisfies the conditions of Contraction Mapping Theorem B∗ has a unique fixed point v∗, meaning B∗v∗ = v∗ This is a succinct representation of Bellman Optimality Equation Starting with any VF v and repeatedly applying B∗, we will reach v∗ lim N→∞ BN ∗ v = v∗ for any VF v This is a succinct representation of the Value Iteration Algorithm Ashwin Rao (Stanford) Bellman Operators January 14, 2019 10 / 11
  • 11. Greedy Policy from Optimal VF is an Optimal Policy Earlier we argued that BG(v)v = B∗v for any VF v. Therefore, BG(v∗)v∗ = B∗v∗ But v∗ is the fixed point of B∗, meaning B∗v∗ = v∗. Therefore, BG(v∗)v∗ = v∗ But we know that BG(v∗) has a unique fixed point vG(v∗). Therefore, v∗ = vG(v∗) This says that simply following the deterministic greedy policy G(v∗) (created from the Optimal VF v∗) in fact achieves the Optimal VF v∗ In other words, G(v∗) is an Optimal (Deterministic) Policy Ashwin Rao (Stanford) Bellman Operators January 14, 2019 11 / 11