SlideShare a Scribd company logo
Value Functions and Markov Decision Process
Easwar Subramanian
TCS Innovation Labs, Hyderabad
Email : easwar.subramanian@tcs.com / cs5500.2020@iith.ac.in
August 12, 2022
Overview
1 Review
2 Value Function
3 Markov Decision Process
Easwar Subramanian, IIT Hyderabad 2 of 32
Review
Easwar Subramanian, IIT Hyderabad 3 of 32
Markov Property
Markov Property
A state st of a stochastic process {st}t∈T is said to have Markov property if
P(st+1|st) = P(st+1|s1, · · · , st)
The state st at time t captures all relevant information from history and is a sufficient
statistic of the future
Easwar Subramanian, IIT Hyderabad 4 of 32
State Transition Matrix
State Transition Probability
For a Markov state s and a successor state s0
, the state transition probability is defined by
Pss0 = P(st+1 = s0
|st = s)
State transition matrix P then denotes the transition probabilities from all states s to all
successor states s0
(with each row summing to 1)
P =



P11 P12 · · · P1n
.
.
.
Pn1 Pn2 · · · Pnn



Easwar Subramanian, IIT Hyderabad 5 of 32
Markov Chain
A stochastic process {st}t∈T is a Markov process or Markov Chain if it satisfies
Markov property for every state st. It is represented by tuple < S, P > where S denote
the set of states and P denote the state transition probablity
No notion of reward or action
Easwar Subramanian, IIT Hyderabad 6 of 32
Markov Reward Process
Markov Reward Process
A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values
I S : (Finite) set of states
I P : State transition probablity
I R : Reward for being in state st is given by a deterministic function R
rt+1 = R(st)
I γ : Discount factor such that γ ∈ [0, 1]
No notion of action
Easwar Subramanian, IIT Hyderabad 7 of 32
Markov Reward Process
Markov Reward Process
A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values
I S : (Finite) set of states
I P : State transition probablity
I R : Reward for being in state st is given by a deterministic function R
rt+1 = R(st)
I γ : Discount factor such that γ ∈ [0, 1]
I In general, the reward function can also be an expectation R(st = s) = E[rt+1|st = s]
Easwar Subramanian, IIT Hyderabad 8 of 32
Value Function
Easwar Subramanian, IIT Hyderabad 9 of 32
Snakes and Ladders : Revisited
I Reward R : R(s) = −1 for s ∈ s1, · · · , s99 and for R(s100) = 0
I Discount Factor γ = 1
Easwar Subramanian, IIT Hyderabad 10 of 32
Snakes and Ladders : Revisited
Question : Are all intermediate states equally ’valuable ’ just because they have equal
reward ?
Easwar Subramanian, IIT Hyderabad 11 of 32
Value Function
The value function V (s) gives the long-term value of state s ∈ S
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
I Value function V (s) determines the value of being in state s
I V (s) measures the potential future rewards we may get from being in state s
I V (s) is independent of t
Easwar Subramanian, IIT Hyderabad 12 of 32
Value Function Computation : Example
Consider the following MRP. Assume γ = 1
I V (s1) = 6.8
I V (s2) = 1 + γ ∗ 6 = 7
I V (s3) = 3 + γ ∗ 6 = 9
I V (s4) = 6
Easwar Subramanian, IIT Hyderabad 13 of 32
Example : Snakes and Ladders
Question : How can we evaluate the value of each state in a large MRP such as ’Snakes
and Ladders ’ ?
Easwar Subramanian, IIT Hyderabad 14 of 32
Decomposition of Value Function
Let s and s0
be successor states at time steps t and t + 1, the value function can be
decomposed into sum of two parts
I Immediate reward rt+1
I Discounted value of next state s0
(i.e. γV (s0
))
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
= E (rt+1 + γV (st+1)|st = s)
Easwar Subramanian, IIT Hyderabad 15 of 32
Decomposition of Value Function
Recall that,
Gt = rt+1 + γrt+2 + γ2
rt+3 + · · ·

=
∞
X
k=0
γk
rt+k+1

V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
= E rt+1 + γrt+2 + γ2
rt+3 + · · · |st = s

= E(rt+1|st = s) +
∞
X
k=1
γk
E (rt+k+1|st = s)
= E(rt+1|st = s) + γ
X
s0∈S
P(s0
|s)
∞
X
k=0
γk
E (rt+k+1|st = s, st+1 = s0
)
= E(rt+1|st = s) + γ
X
s0∈S
P(s0
|s)
∞
X
k=0
γk
E (rt+k+1|st+1 = s0
) (Markov property)
= E(rt+1 + γV (st+1)|st = s)
Easwar Subramanian, IIT Hyderabad 16 of 32
Value Function : Evaluation
We have
V (s) = E(rt+1 + γV (st+1)|st = s)
V (s) = R(s) + γ
h
Pss0
a
V (s
0
a) + Pss
0
b
V (s
0
b) + Pss0
c
V (s
0
c) + Pss
0
d
V (s
0
d)
i
Easwar Subramanian, IIT Hyderabad 17 of 32
Value Function Computation : Example
Consider the following MRP. Assume γ = 1
I V (s4) = 6
I V (s3) = 3 + γ ∗ 6 = 9
I V (s2) = 1 + γ ∗ 6 = 7
I V (s1) = − 1 + γ ∗ (0.6 ∗ 7 + 0.4 ∗ 9) = 6.8
Easwar Subramanian, IIT Hyderabad 18 of 32
Bellman Equation for Markov Reward Process
V (s) = E(rt+1 + γV (st+1)|st = s)
For any s0
∈ S a successor state of s with transition probability Pss0 , we can rewrite the
above equation as (using definition of Expectation)
V (s) = E(rt+1|st = s) + γ
X
s0∈S
Pss0 V (s0
)
This is the Bellman Equation for value functions
Easwar Subramanian, IIT Hyderabad 19 of 32
Snakes and Ladders
Question : How can we evaluate the value of (all) states using the value function
decomposition ?
V (s) = E(rt+1|st = s) + γ
X
s0∈S
Pss0 V (s0
)
Easwar Subramanian, IIT Hyderabad 20 of 32
Bellman Equation in Matrix Form
Let S = {1, 2, · · · , n} and P be known. Then one can write the Bellman equation can as,
V = R + γPV
where 




V (1)
V (2)
.
.
.
V (n)





=





R(1)
R(2)
.
.
.
R(n)





+ γ





P11 P12 · · · P1n
P21 P22 · · · P2n
.
.
.
Pn1 Pn2 · · · Pnn





×





V (1)
V (2)
.
.
.
V (n)





Solving for V , we get,
V = (I − γP)−1
R
The discount factor should be γ  1 for the inverse to exist
Easwar Subramanian, IIT Hyderabad 21 of 32
Example : Snakes and Ladders
I We can now compute the value of states in such ’large’ MRP using the matrix form of
Bellman equation
I Value function computed for a particular state provides the expected number of
plays to reach the goal state s100 from that state
Easwar Subramanian, IIT Hyderabad 22 of 32
Few Remarks on Discounting
V (s) = E (Gt|st = s) = E
∞
X
k=0
γk
rt+k+1|st = s
!
I Mathematically convienient to discount rewards
I Avoids infinite returns in cyclic and infinite horizon setting
I Discount rate determines the present value of future reward
I Offers trade-off between being ’myopic’ and ’far-sighted’ reward
I In certain class of MDPs, it is sometimes possible to use undiscounted reward (i.e.
γ = 1), for example, if all sequences terminate
Easwar Subramanian, IIT Hyderabad 23 of 32
Markov Decision Process
Easwar Subramanian, IIT Hyderabad 24 of 32
Markov Decision Process
Markov decision process is a tuple  S, A, P, R, γ  where
I S : (Finite) set of states
I A : (Finite) set of actions
I P : State transition probability
Pa
ss0 = P(st+1 = s0
|st = s, at = a), at ∈ A
I R : Reward for taking action at at state st and transitioning to state st+1 is given by
the deterministic function R
rt+1 = R(st, at, st+1)
I γ : Discount factor such that γ ∈ [0, 1]
Easwar Subramanian, IIT Hyderabad 25 of 32
Wealth Management Problem
I States S : Current value of the portfolio and current valuation of instruments in the
portfolio
I Actions A : Buy / Sell instruments of the portfolio
I Reward R : Return on portfolio compared to previous decision epoch
Easwar Subramanian, IIT Hyderabad 26 of 32
Navigation Problem
I States S : Squares of the grid
I Actions A : Any of the four directions possible
I Reward R : -1 for every move made until reaching goal state
Easwar Subramanian, IIT Hyderabad 27 of 32
Example : Atari Games
I States S : Possible set of all (Atari) images
I Actions A : Move the paddle up or down
I Reward R : +1 for making the opponent miss the ball; -1 if the agent miss the ball; 0
otherwise;
Easwar Subramanian, IIT Hyderabad 28 of 32
Flow Diagram
I The goal is to choose a sequence of actions such that the expected total discounted
future reward E(Gt|st = s) is maximized where
Gt =
∞
X
k=0
γk
rt+k+1

Easwar Subramanian, IIT Hyderabad 29 of 32
Windy Grid World : Stochastic Environment
Recall given an MDP  S, A, P, R, γ , we have the state transition probability P defined
as
Pa
ss0 = P(st+1 = s0
|st = s, at = a), at ∈ A
I In general, note that even after choosing action a at state s (as prescribed by the
policy) the next state s0
need not be a fixed state
Easwar Subramanian, IIT Hyderabad 30 of 32
Finite and Infinite Horizon MDPs
I If T is fixed and finite, the resultant MDP is a finite horizon MDP
F Wealth management problem
I If T is infinite, the resultant MDP is infinite horizon MDP
F Certain Atari games
I When |S| is finite, the MDP is called finite state MDPs
Easwar Subramanian, IIT Hyderabad 31 of 32
Grid World Example
Question : Is Grid world finite / infinite horizon problem ? Why ?
(Stochastic shortest path MDPs)
I For finite horizon MDPs and stochastic shortest path MDPs, one can use γ = 1
Easwar Subramanian, IIT Hyderabad 32 of 32

More Related Content

PDF
Lecture2-MRP.pdf
PPTX
AI - Introduction to Markov Principles
PDF
Cs229 notes12
PPTX
Fundamentals of RL.pptx
PDF
PDF
Machine learning (13)
PDF
Machine Learning - Reinforcement Learning
PDF
the bellman equation
Lecture2-MRP.pdf
AI - Introduction to Markov Principles
Cs229 notes12
Fundamentals of RL.pptx
Machine learning (13)
Machine Learning - Reinforcement Learning
the bellman equation

Similar to Lecture3-MDP.pdf (20)

PDF
Reinfrocement Learning
PPTX
lecture_21.pptx - PowerPoint Presentation
PPTX
Reinforcement Learning: An Introduction.pptx
PPTX
What is Reinforcement Algorithms and how worked.pptx
PDF
Intro to Reinforcement learning - part I
PDF
Reinforcement learning Markov principle
PPTX
Reinforcement Learning
PPTX
Unit 4 - 4.1 Markov Decision Process.pptx
PDF
Head First Reinforcement Learning
PPT
POMDP Seminar Backup3
PDF
Introduction to reinforcement learning
PPTX
Lecture 8 artificial intelligence .pptx
PDF
13_RL_1.pdf
PDF
PDF
Policy-Gradient for deep reinforcement learning.pdf
PPTX
How to formulate reinforcement learning in illustrative ways
PDF
100 things I know
PPT
markov chain.ppt
PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PPTX
Making Complex Decisions(Artificial Intelligence)
Reinfrocement Learning
lecture_21.pptx - PowerPoint Presentation
Reinforcement Learning: An Introduction.pptx
What is Reinforcement Algorithms and how worked.pptx
Intro to Reinforcement learning - part I
Reinforcement learning Markov principle
Reinforcement Learning
Unit 4 - 4.1 Markov Decision Process.pptx
Head First Reinforcement Learning
POMDP Seminar Backup3
Introduction to reinforcement learning
Lecture 8 artificial intelligence .pptx
13_RL_1.pdf
Policy-Gradient for deep reinforcement learning.pdf
How to formulate reinforcement learning in illustrative ways
100 things I know
markov chain.ppt
An Introduction to Reinforcement Learning - The Doors to AGI
Making Complex Decisions(Artificial Intelligence)
Ad

Recently uploaded (20)

PPTX
Cell Structure & Organelles in detailed.
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
Computing-Curriculum for Schools in Ghana
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
master seminar digital applications in india
PDF
Trump Administration's workforce development strategy
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Lesson notes of climatology university.
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Pharma ospi slides which help in ospi learning
Cell Structure & Organelles in detailed.
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Yogi Goddess Pres Conference Studio Updates
Computing-Curriculum for Schools in Ghana
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
master seminar digital applications in india
Trump Administration's workforce development strategy
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Complications of Minimal Access Surgery at WLH
Anesthesia in Laparoscopic Surgery in India
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Lesson notes of climatology university.
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Final Presentation General Medicine 03-08-2024.pptx
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Pharma ospi slides which help in ospi learning
Ad

Lecture3-MDP.pdf

  • 1. Value Functions and Markov Decision Process Easwar Subramanian TCS Innovation Labs, Hyderabad Email : easwar.subramanian@tcs.com / cs5500.2020@iith.ac.in August 12, 2022
  • 2. Overview 1 Review 2 Value Function 3 Markov Decision Process Easwar Subramanian, IIT Hyderabad 2 of 32
  • 3. Review Easwar Subramanian, IIT Hyderabad 3 of 32
  • 4. Markov Property Markov Property A state st of a stochastic process {st}t∈T is said to have Markov property if P(st+1|st) = P(st+1|s1, · · · , st) The state st at time t captures all relevant information from history and is a sufficient statistic of the future Easwar Subramanian, IIT Hyderabad 4 of 32
  • 5. State Transition Matrix State Transition Probability For a Markov state s and a successor state s0 , the state transition probability is defined by Pss0 = P(st+1 = s0 |st = s) State transition matrix P then denotes the transition probabilities from all states s to all successor states s0 (with each row summing to 1) P =    P11 P12 · · · P1n . . . Pn1 Pn2 · · · Pnn    Easwar Subramanian, IIT Hyderabad 5 of 32
  • 6. Markov Chain A stochastic process {st}t∈T is a Markov process or Markov Chain if it satisfies Markov property for every state st. It is represented by tuple < S, P > where S denote the set of states and P denote the state transition probablity No notion of reward or action Easwar Subramanian, IIT Hyderabad 6 of 32
  • 7. Markov Reward Process Markov Reward Process A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values I S : (Finite) set of states I P : State transition probablity I R : Reward for being in state st is given by a deterministic function R rt+1 = R(st) I γ : Discount factor such that γ ∈ [0, 1] No notion of action Easwar Subramanian, IIT Hyderabad 7 of 32
  • 8. Markov Reward Process Markov Reward Process A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values I S : (Finite) set of states I P : State transition probablity I R : Reward for being in state st is given by a deterministic function R rt+1 = R(st) I γ : Discount factor such that γ ∈ [0, 1] I In general, the reward function can also be an expectation R(st = s) = E[rt+1|st = s] Easwar Subramanian, IIT Hyderabad 8 of 32
  • 9. Value Function Easwar Subramanian, IIT Hyderabad 9 of 32
  • 10. Snakes and Ladders : Revisited I Reward R : R(s) = −1 for s ∈ s1, · · · , s99 and for R(s100) = 0 I Discount Factor γ = 1 Easwar Subramanian, IIT Hyderabad 10 of 32
  • 11. Snakes and Ladders : Revisited Question : Are all intermediate states equally ’valuable ’ just because they have equal reward ? Easwar Subramanian, IIT Hyderabad 11 of 32
  • 12. Value Function The value function V (s) gives the long-term value of state s ∈ S V (s) = E (Gt|st = s) = E ∞ X k=0 γk rt+k+1|st = s ! I Value function V (s) determines the value of being in state s I V (s) measures the potential future rewards we may get from being in state s I V (s) is independent of t Easwar Subramanian, IIT Hyderabad 12 of 32
  • 13. Value Function Computation : Example Consider the following MRP. Assume γ = 1 I V (s1) = 6.8 I V (s2) = 1 + γ ∗ 6 = 7 I V (s3) = 3 + γ ∗ 6 = 9 I V (s4) = 6 Easwar Subramanian, IIT Hyderabad 13 of 32
  • 14. Example : Snakes and Ladders Question : How can we evaluate the value of each state in a large MRP such as ’Snakes and Ladders ’ ? Easwar Subramanian, IIT Hyderabad 14 of 32
  • 15. Decomposition of Value Function Let s and s0 be successor states at time steps t and t + 1, the value function can be decomposed into sum of two parts I Immediate reward rt+1 I Discounted value of next state s0 (i.e. γV (s0 )) V (s) = E (Gt|st = s) = E ∞ X k=0 γk rt+k+1|st = s ! = E (rt+1 + γV (st+1)|st = s) Easwar Subramanian, IIT Hyderabad 15 of 32
  • 16. Decomposition of Value Function Recall that, Gt = rt+1 + γrt+2 + γ2 rt+3 + · · · = ∞ X k=0 γk rt+k+1 V (s) = E (Gt|st = s) = E ∞ X k=0 γk rt+k+1|st = s ! = E rt+1 + γrt+2 + γ2 rt+3 + · · · |st = s = E(rt+1|st = s) + ∞ X k=1 γk E (rt+k+1|st = s) = E(rt+1|st = s) + γ X s0∈S P(s0 |s) ∞ X k=0 γk E (rt+k+1|st = s, st+1 = s0 ) = E(rt+1|st = s) + γ X s0∈S P(s0 |s) ∞ X k=0 γk E (rt+k+1|st+1 = s0 ) (Markov property) = E(rt+1 + γV (st+1)|st = s) Easwar Subramanian, IIT Hyderabad 16 of 32
  • 17. Value Function : Evaluation We have V (s) = E(rt+1 + γV (st+1)|st = s) V (s) = R(s) + γ h Pss0 a V (s 0 a) + Pss 0 b V (s 0 b) + Pss0 c V (s 0 c) + Pss 0 d V (s 0 d) i Easwar Subramanian, IIT Hyderabad 17 of 32
  • 18. Value Function Computation : Example Consider the following MRP. Assume γ = 1 I V (s4) = 6 I V (s3) = 3 + γ ∗ 6 = 9 I V (s2) = 1 + γ ∗ 6 = 7 I V (s1) = − 1 + γ ∗ (0.6 ∗ 7 + 0.4 ∗ 9) = 6.8 Easwar Subramanian, IIT Hyderabad 18 of 32
  • 19. Bellman Equation for Markov Reward Process V (s) = E(rt+1 + γV (st+1)|st = s) For any s0 ∈ S a successor state of s with transition probability Pss0 , we can rewrite the above equation as (using definition of Expectation) V (s) = E(rt+1|st = s) + γ X s0∈S Pss0 V (s0 ) This is the Bellman Equation for value functions Easwar Subramanian, IIT Hyderabad 19 of 32
  • 20. Snakes and Ladders Question : How can we evaluate the value of (all) states using the value function decomposition ? V (s) = E(rt+1|st = s) + γ X s0∈S Pss0 V (s0 ) Easwar Subramanian, IIT Hyderabad 20 of 32
  • 21. Bellman Equation in Matrix Form Let S = {1, 2, · · · , n} and P be known. Then one can write the Bellman equation can as, V = R + γPV where      V (1) V (2) . . . V (n)      =      R(1) R(2) . . . R(n)      + γ      P11 P12 · · · P1n P21 P22 · · · P2n . . . Pn1 Pn2 · · · Pnn      ×      V (1) V (2) . . . V (n)      Solving for V , we get, V = (I − γP)−1 R The discount factor should be γ 1 for the inverse to exist Easwar Subramanian, IIT Hyderabad 21 of 32
  • 22. Example : Snakes and Ladders I We can now compute the value of states in such ’large’ MRP using the matrix form of Bellman equation I Value function computed for a particular state provides the expected number of plays to reach the goal state s100 from that state Easwar Subramanian, IIT Hyderabad 22 of 32
  • 23. Few Remarks on Discounting V (s) = E (Gt|st = s) = E ∞ X k=0 γk rt+k+1|st = s ! I Mathematically convienient to discount rewards I Avoids infinite returns in cyclic and infinite horizon setting I Discount rate determines the present value of future reward I Offers trade-off between being ’myopic’ and ’far-sighted’ reward I In certain class of MDPs, it is sometimes possible to use undiscounted reward (i.e. γ = 1), for example, if all sequences terminate Easwar Subramanian, IIT Hyderabad 23 of 32
  • 24. Markov Decision Process Easwar Subramanian, IIT Hyderabad 24 of 32
  • 25. Markov Decision Process Markov decision process is a tuple S, A, P, R, γ where I S : (Finite) set of states I A : (Finite) set of actions I P : State transition probability Pa ss0 = P(st+1 = s0 |st = s, at = a), at ∈ A I R : Reward for taking action at at state st and transitioning to state st+1 is given by the deterministic function R rt+1 = R(st, at, st+1) I γ : Discount factor such that γ ∈ [0, 1] Easwar Subramanian, IIT Hyderabad 25 of 32
  • 26. Wealth Management Problem I States S : Current value of the portfolio and current valuation of instruments in the portfolio I Actions A : Buy / Sell instruments of the portfolio I Reward R : Return on portfolio compared to previous decision epoch Easwar Subramanian, IIT Hyderabad 26 of 32
  • 27. Navigation Problem I States S : Squares of the grid I Actions A : Any of the four directions possible I Reward R : -1 for every move made until reaching goal state Easwar Subramanian, IIT Hyderabad 27 of 32
  • 28. Example : Atari Games I States S : Possible set of all (Atari) images I Actions A : Move the paddle up or down I Reward R : +1 for making the opponent miss the ball; -1 if the agent miss the ball; 0 otherwise; Easwar Subramanian, IIT Hyderabad 28 of 32
  • 29. Flow Diagram I The goal is to choose a sequence of actions such that the expected total discounted future reward E(Gt|st = s) is maximized where Gt = ∞ X k=0 γk rt+k+1 Easwar Subramanian, IIT Hyderabad 29 of 32
  • 30. Windy Grid World : Stochastic Environment Recall given an MDP S, A, P, R, γ , we have the state transition probability P defined as Pa ss0 = P(st+1 = s0 |st = s, at = a), at ∈ A I In general, note that even after choosing action a at state s (as prescribed by the policy) the next state s0 need not be a fixed state Easwar Subramanian, IIT Hyderabad 30 of 32
  • 31. Finite and Infinite Horizon MDPs I If T is fixed and finite, the resultant MDP is a finite horizon MDP F Wealth management problem I If T is infinite, the resultant MDP is infinite horizon MDP F Certain Atari games I When |S| is finite, the MDP is called finite state MDPs Easwar Subramanian, IIT Hyderabad 31 of 32
  • 32. Grid World Example Question : Is Grid world finite / infinite horizon problem ? Why ? (Stochastic shortest path MDPs) I For finite horizon MDPs and stochastic shortest path MDPs, one can use γ = 1 Easwar Subramanian, IIT Hyderabad 32 of 32