SlideShare a Scribd company logo
An Introduction to Reinforcement Learning
Akshay A. Salunkhe
Roll No.: 183021002
Department of Electrical Engineering
Indian Institute of Technology Dharwad
June 8, 2018
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 1 / 41
Overview
1 Introduction to Reinforcement Learning
2 MC MRP
3 Bellman Equation
4 MDP
5 Bellman Optimality Equations
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 2 / 41
Introduction to Reinforcement Learning
Real life example
Baby trying to walk...
Baby seeks support
You deny, baby tries himself
Trial and error
Figure 1: baby learning how to walk
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 3 / 41
Introduction to Reinforcement Learning
Introduction
baby exploring environment
reward: reach mother
action: try crawling or walking...
Figure 2: system model
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 4 / 41
Introduction to Reinforcement Learning
Different in the Sense
An active way of learning
Feedback delayed, not instantaneous
Only reward signal, no supervisor
Sequential data, NOT i.i.d.
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 5 / 41
MC MRP
Markov State
A state St is markov state iff
P [St+1|St] = P [St+1|S1, S2, ...St] (1)
Future is independent of past given present
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 6 / 41
MC MRP
Markov Process
A Markov Process (or Markov Chain) is a tuple S, P
S is a finite set of states
P is a state transition probability matrix
Pss = Pr St+1 = s |St = s (2)
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 7 / 41
MC MRP
Example: Student Markov Process (or Markov Chain)
A simple MC example with transition probability matrix
Figure 3: Student Markov Process
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 8 / 41
MC MRP
Markov Reward Process
a Markov Chain with values
Markov Reward Process is a tuple S, P, R, γ
S is a finite set of states
P is a state transition probability matrix
Pss = Pr St+1 = s |St = s (3)
R is a reward function,
Rs = E [Rt+1|St = s] (4)
γ is a discount factor, γ ∈ [0, 1]
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 9 / 41
MC MRP
Return
The return Gt is the total discounted reward from time-step t.
Gt = Rt+1 + γRt+2 + ... =
∞
k=0
γk
Rt+k+1 (5)
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 10 / 41
MC MRP
Discount γ used because
math works out well
model is not perfect
avoids infinite returns in Cyclic Markov Processes
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 11 / 41
MC MRP
State Value Function of an MRP
The state value function v(s) of an MRP is the expected return
starting from state s
v(s) = E [Gt|St = s] (6)
Expectation is over all the paths
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 12 / 41
MC MRP
Sample Returns for Student MRP
Starting with C1, γ=1/2,
Figure 4: Student MRP
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 13 / 41
MC MRP
State Value for C1
Will be averaged over all outward paths from C1
Figure 6: State values
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 14 / 41
MC MRP
State Values for Student MRP for γ=0
Figure 7: State values
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 15 / 41
MC MRP
State Values for Student MRP for γ=0.9
Figure 8: State values
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 16 / 41
Bellman Equation
Bellman Equation
The value function can be decomposed into two parts
immediate reward Rt+1
discounted value of successor state γv(St+1)
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 17 / 41
Bellman Equation
Example for state C3
v(s) = Rs + γ
s ∈S
Pss v(s ) (7)
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 18 / 41
Bellman Equation
Bellman Equation in matrix form
The Bellman equation can be written in matrix form as
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 19 / 41
Bellman Equation
Solving the Bellman Equation
Linear Equation
Direct Solution possible for Finite MRP
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 20 / 41
MDP
Markov Decision Process
A Markov Decision Process is a tuple S, A, P, R, γ
S is a finite set of states
A is a finite set of actions
P is a state transition probability matrix
Pss = Pr St+1 = s |St = s, At = a (8)
R is a reward function,
Ra
s = E [Rt+1|St = s, At = a] (9)
γ is a discount factor, γ ∈ [0, 1]
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 21 / 41
MDP
Example: student MDP
MDP is MRP with decisions
All states are markov
MRP arises when we fix policy for MDP
Figure 9: student MDP with actions
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 22 / 41
MDP
Policy
A policy is a distribution over actions given states,
A policy defines the behaviur of an agent fully
MDP policies depend on current state
i.e. policies are stationary (time independent)
At ∼ π(.|St), ∀t > 0 (10)
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 23 / 41
MDP
Value Functions: MDP
State Value Function:
The State Value Function, vπ(s) of an MDP is the expected return
starting from s, and following policy π
vπ(s) = Eπ [Gt|St = s] (11)
Action Value Function:
The Action Value Function, qπ(s, a) is the expected return starting
from s, and taking action a, and then following policy π
qπ(s, a) = Eπ [Gt|St = s, At = a] (12)
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 24 / 41
MDP
State Value Functions for Student MDP
Figure 10: State values
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 25 / 41
MDP
Value Functions: MDP
The state-value function can again be decomposed into immediate reward
plus discounted value of successor state
The action-value function can similarly be decomposed,
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 26 / 41
MDP
Value Functions: MDP
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 27 / 41
MDP
Value Functions: MDP
Figure 11: recursive relation between state value functions
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 28 / 41
MDP
Value Functions: MDP
Figure 12: recursive relation between action value functions
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 29 / 41
MDP
Example: Student MDP
Figure 13: calculating State value for Student MDP
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 30 / 41
MDP
Bellman Expectation Function: Matrix Form for MDP
Bellman expectation function can be expressed in Matrix Form as
with direct solution
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 31 / 41
Bellman Optimality Equations
Optimal Value Function
The Optimal state value function v∗(s) is the maximum state value
function over all policies
The Optimal action value function q∗(s, a) is the maximum action value
function over all policies
MDP is solved when we know optimal action-value function
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 32 / 41
Bellman Optimality Equations
Optimal State-Value Function for Student MDP
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 33 / 41
Bellman Optimality Equations
Optimal Action-Value Function for Student MDP
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 34 / 41
Bellman Optimality Equations
Optimal Policy for student MDP
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 35 / 41
Bellman Optimality Equations
Bellman Optimality Equations
The optimal state value and action value functions are recursively related
by the Bellman optimality equations
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 36 / 41
Bellman Optimality Equations
Bellman Optimality Equations contd.
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 37 / 41
Bellman Optimality Equations
solving Bellman Optimality Equations
presence of max makes it non linear
only iterative solution methods such as
value iteration
policy iteration
linear programming
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 38 / 41
Bellman Optimality Equations
Value iteration
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 39 / 41
Bellman Optimality Equations
Policy iteration
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 40 / 41
Bellman Optimality Equations
The End
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 41 / 41

More Related Content

PDF
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
PDF
Machine Learning Algorithm - Logistic Regression
PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PPTX
RAIL: Risk-Averse Imitation Learning | Invited talk at Intel AI Workshop at K...
PDF
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...
PDF
A single stage single constraints linear fractional programming problem an ap...
PPTX
Introduction to optimization technique
PDF
Efficient Bayesian Inference for Online Matrix Factorization in Bandit Settings
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Machine Learning Algorithm - Logistic Regression
An Introduction to Reinforcement Learning - The Doors to AGI
RAIL: Risk-Averse Imitation Learning | Invited talk at Intel AI Workshop at K...
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...
A single stage single constraints linear fractional programming problem an ap...
Introduction to optimization technique
Efficient Bayesian Inference for Online Matrix Factorization in Bandit Settings

Similar to Introduction to reinforcement learning (20)

PDF
Reinforcement-Learning-in-LLM-such-as-GPT-and-Deepseek.pdf
PPTX
sec3.2.pptx chapter 3 for math for business
PPTX
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
PDF
2018.01.12 AHClab SD-study paper reading
DOCX
Quantitative Analysis for ManagementThirteenth EditionLesson.docx
PDF
Deep Reinforcement Learning: Q-Learning
PDF
Scikit-learn1
PPTX
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
PPTX
Quantitative Analysis for Management new ppt
PDF
MUMS: Transition & SPUQ Workshop - Dimension Reduction and Global Sensititvit...
PDF
5th Module_Machine Learning_Reinforc.pdf
PDF
Discovery In Commerce Search
PDF
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
PDF
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
PDF
Stochastic optimization from mirror descent to recent algorithms
PPTX
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
PDF
Presentation jitendra
PDF
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
PDF
reinforcement-learning-141009013546-conversion-gate02.pdf
Reinforcement-Learning-in-LLM-such-as-GPT-and-Deepseek.pdf
sec3.2.pptx chapter 3 for math for business
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
2018.01.12 AHClab SD-study paper reading
Quantitative Analysis for ManagementThirteenth EditionLesson.docx
Deep Reinforcement Learning: Q-Learning
Scikit-learn1
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Quantitative Analysis for Management new ppt
MUMS: Transition & SPUQ Workshop - Dimension Reduction and Global Sensititvit...
5th Module_Machine Learning_Reinforc.pdf
Discovery In Commerce Search
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
Stochastic optimization from mirror descent to recent algorithms
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Presentation jitendra
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
Ad

Recently uploaded (20)

PPTX
UNIT 4 Total Quality Management .pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
Construction Project Organization Group 2.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
PPT on Performance Review to get promotions
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Digital Logic Computer Design lecture notes
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
web development for engineering and engineering
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
UNIT 4 Total Quality Management .pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Sustainable Sites - Green Building Construction
Construction Project Organization Group 2.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT on Performance Review to get promotions
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
573137875-Attendance-Management-System-original
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Digital Logic Computer Design lecture notes
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CH1 Production IntroductoryConcepts.pptx
Safety Seminar civil to be ensured for safe working.
Embodied AI: Ushering in the Next Era of Intelligent Systems
Operating System & Kernel Study Guide-1 - converted.pdf
web development for engineering and engineering
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Ad

Introduction to reinforcement learning

  • 1. An Introduction to Reinforcement Learning Akshay A. Salunkhe Roll No.: 183021002 Department of Electrical Engineering Indian Institute of Technology Dharwad June 8, 2018 Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 1 / 41
  • 2. Overview 1 Introduction to Reinforcement Learning 2 MC MRP 3 Bellman Equation 4 MDP 5 Bellman Optimality Equations Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 2 / 41
  • 3. Introduction to Reinforcement Learning Real life example Baby trying to walk... Baby seeks support You deny, baby tries himself Trial and error Figure 1: baby learning how to walk Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 3 / 41
  • 4. Introduction to Reinforcement Learning Introduction baby exploring environment reward: reach mother action: try crawling or walking... Figure 2: system model Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 4 / 41
  • 5. Introduction to Reinforcement Learning Different in the Sense An active way of learning Feedback delayed, not instantaneous Only reward signal, no supervisor Sequential data, NOT i.i.d. Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 5 / 41
  • 6. MC MRP Markov State A state St is markov state iff P [St+1|St] = P [St+1|S1, S2, ...St] (1) Future is independent of past given present Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 6 / 41
  • 7. MC MRP Markov Process A Markov Process (or Markov Chain) is a tuple S, P S is a finite set of states P is a state transition probability matrix Pss = Pr St+1 = s |St = s (2) Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 7 / 41
  • 8. MC MRP Example: Student Markov Process (or Markov Chain) A simple MC example with transition probability matrix Figure 3: Student Markov Process Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 8 / 41
  • 9. MC MRP Markov Reward Process a Markov Chain with values Markov Reward Process is a tuple S, P, R, γ S is a finite set of states P is a state transition probability matrix Pss = Pr St+1 = s |St = s (3) R is a reward function, Rs = E [Rt+1|St = s] (4) γ is a discount factor, γ ∈ [0, 1] Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 9 / 41
  • 10. MC MRP Return The return Gt is the total discounted reward from time-step t. Gt = Rt+1 + γRt+2 + ... = ∞ k=0 γk Rt+k+1 (5) Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 10 / 41
  • 11. MC MRP Discount γ used because math works out well model is not perfect avoids infinite returns in Cyclic Markov Processes Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 11 / 41
  • 12. MC MRP State Value Function of an MRP The state value function v(s) of an MRP is the expected return starting from state s v(s) = E [Gt|St = s] (6) Expectation is over all the paths Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 12 / 41
  • 13. MC MRP Sample Returns for Student MRP Starting with C1, γ=1/2, Figure 4: Student MRP Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 13 / 41
  • 14. MC MRP State Value for C1 Will be averaged over all outward paths from C1 Figure 6: State values Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 14 / 41
  • 15. MC MRP State Values for Student MRP for γ=0 Figure 7: State values Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 15 / 41
  • 16. MC MRP State Values for Student MRP for γ=0.9 Figure 8: State values Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 16 / 41
  • 17. Bellman Equation Bellman Equation The value function can be decomposed into two parts immediate reward Rt+1 discounted value of successor state γv(St+1) Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 17 / 41
  • 18. Bellman Equation Example for state C3 v(s) = Rs + γ s ∈S Pss v(s ) (7) Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 18 / 41
  • 19. Bellman Equation Bellman Equation in matrix form The Bellman equation can be written in matrix form as Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 19 / 41
  • 20. Bellman Equation Solving the Bellman Equation Linear Equation Direct Solution possible for Finite MRP Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 20 / 41
  • 21. MDP Markov Decision Process A Markov Decision Process is a tuple S, A, P, R, γ S is a finite set of states A is a finite set of actions P is a state transition probability matrix Pss = Pr St+1 = s |St = s, At = a (8) R is a reward function, Ra s = E [Rt+1|St = s, At = a] (9) γ is a discount factor, γ ∈ [0, 1] Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 21 / 41
  • 22. MDP Example: student MDP MDP is MRP with decisions All states are markov MRP arises when we fix policy for MDP Figure 9: student MDP with actions Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 22 / 41
  • 23. MDP Policy A policy is a distribution over actions given states, A policy defines the behaviur of an agent fully MDP policies depend on current state i.e. policies are stationary (time independent) At ∼ π(.|St), ∀t > 0 (10) Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 23 / 41
  • 24. MDP Value Functions: MDP State Value Function: The State Value Function, vπ(s) of an MDP is the expected return starting from s, and following policy π vπ(s) = Eπ [Gt|St = s] (11) Action Value Function: The Action Value Function, qπ(s, a) is the expected return starting from s, and taking action a, and then following policy π qπ(s, a) = Eπ [Gt|St = s, At = a] (12) Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 24 / 41
  • 25. MDP State Value Functions for Student MDP Figure 10: State values Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 25 / 41
  • 26. MDP Value Functions: MDP The state-value function can again be decomposed into immediate reward plus discounted value of successor state The action-value function can similarly be decomposed, Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 26 / 41
  • 27. MDP Value Functions: MDP Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 27 / 41
  • 28. MDP Value Functions: MDP Figure 11: recursive relation between state value functions Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 28 / 41
  • 29. MDP Value Functions: MDP Figure 12: recursive relation between action value functions Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 29 / 41
  • 30. MDP Example: Student MDP Figure 13: calculating State value for Student MDP Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 30 / 41
  • 31. MDP Bellman Expectation Function: Matrix Form for MDP Bellman expectation function can be expressed in Matrix Form as with direct solution Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 31 / 41
  • 32. Bellman Optimality Equations Optimal Value Function The Optimal state value function v∗(s) is the maximum state value function over all policies The Optimal action value function q∗(s, a) is the maximum action value function over all policies MDP is solved when we know optimal action-value function Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 32 / 41
  • 33. Bellman Optimality Equations Optimal State-Value Function for Student MDP Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 33 / 41
  • 34. Bellman Optimality Equations Optimal Action-Value Function for Student MDP Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 34 / 41
  • 35. Bellman Optimality Equations Optimal Policy for student MDP Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 35 / 41
  • 36. Bellman Optimality Equations Bellman Optimality Equations The optimal state value and action value functions are recursively related by the Bellman optimality equations Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 36 / 41
  • 37. Bellman Optimality Equations Bellman Optimality Equations contd. Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 37 / 41
  • 38. Bellman Optimality Equations solving Bellman Optimality Equations presence of max makes it non linear only iterative solution methods such as value iteration policy iteration linear programming Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 38 / 41
  • 39. Bellman Optimality Equations Value iteration Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 39 / 41
  • 40. Bellman Optimality Equations Policy iteration Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 40 / 41
  • 41. Bellman Optimality Equations The End Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 41 / 41