SlideShare a Scribd company logo
Markov Chains and Reward Process
Easwar Subramanian
TCS Innovation Labs, Hyderabad
Email : easwar.subramanian@tcs.com / cs5500.2020@iith.ac.in
August 5, 2022
Administrivia
I Please consult Prof. Vineeth, for all queries related to registration and other
administrative issues.
I If need be, register for CS 5500 instead of AI 3000 (relevant for CS, PhD students).
Easwar Subramanian, IIT Hyderabad 2 of 33
Overview
1 Review
2 Mathematical Framework for Decision Making
3 Markov Chains
4 Markov Reward Process
Easwar Subramanian, IIT Hyderabad 3 of 33
Review
Easwar Subramanian, IIT Hyderabad 4 of 33
Types of Learning : Summary
Easwar Subramanian, IIT Hyderabad 5 of 33 Figure Source: Saggie
Characteristics of Reinforcement Learning
I Observations are non i.i.d and are sequential in nature
I Agent’s action (may) affect the subsequent observation seen
I There is no supervisor; Only reward signal (feedback)
I Reward or feedback can be delayed
Easwar Subramanian, IIT Hyderabad 6 of 33
Reinforcement Learning : History
Easwar Subramanian, IIT Hyderabad 7 of 33 Slide Credit: RL Course, Abir Das
Course Setup
Easwar Subramanian, IIT Hyderabad 8 of 33
Mathematical Framework for Decision Making
Easwar Subramanian, IIT Hyderabad 9 of 33
RL Framework : Notations
Easwar Subramanian, IIT Hyderabad 10 of 33 Figure Source: Sutton and Barto
Markov Decision Process
I Markov Decision Process (MDP) provides a mathematical framework for modeling
decision making process
I Can formally describe the working of the environment and agent in the RL setting
I Can handle huge variety of interesting settings
F Multi-arm Bandits - Single state MDPs
F Optimal Control - Continuous MDPs
I Core problem in solving an MDP is to find an ’optimal’ policy (or behaviour) for the
decision maker (agent) in order to maximize the total future reward
Easwar Subramanian, IIT Hyderabad 11 of 33
Markov Chains
Easwar Subramanian, IIT Hyderabad 12 of 33
Random Variables and Stochastic Process
Random Variable (Non-mathematical definition)
A random variable is a variable whose value depend on the outcome of a random
phenomenon
I Outcome of a coin toss
I Outcome of roll of a dice
Stochastic Process
A stochastic or random process, denoted by {st}t∈T , can be defined as a collection of
random variables that is indexed by some mathematical set T
I Index set has the interpretation of time
I The set T is, typically, N or R
Easwar Subramanian, IIT Hyderabad 13 of 33
Notations
I Typically, in optimal control problems, the index set is continuous (say R)
I Throughout this course (RL), the index set is always discrete (say N)
I Let {st}t∈T be a stochastic process
I Let st be the state at time t of the stochastic process {st}t∈T
Easwar Subramanian, IIT Hyderabad 14 of 33
Markov Property
Markov Property
A state st of a stochastic process {st}t∈T is said to have Markov property if
P(st+1|st) = P(st+1|s1, · · · , st)
The state st at time t captures all relevant information from history and is a sufficient
statistic of the future
Easwar Subramanian, IIT Hyderabad 15 of 33
Transition Probability
State Transition Probability
For a Markov state s and a successor state s0
, the state transition probability is defined by
Pss0 = P(st+1 = s0
|st = s)
State transition matrix P then denotes the transition probabilities from all states s to all
successor states s0
(with each row summing to 1)
P =



P11 P12 · · · P1n
.
.
.
Pn1 Pn2 · · · Pnn



Easwar Subramanian, IIT Hyderabad 16 of 33
Markov Chain
A stochastic process {st}t∈T is a Markov process or Markov Chain if the sequence of
random states satisfy the Markov property. It is represented by tuple < S, P > where S
denote the set of states and P denote the state transition probablity
Example 1 : Simple Two State Markov Chain
I State S = {Sunny, Rainy}
I Transition Probability Matrix
P =

.8 .2
.7 .3

Easwar Subramanian, IIT Hyderabad 17 of 33
Figure Source:
https://bookdown.org/probability
Markov Chain : Example Revisited
State S = {Sunny, Rainy} and Transition Probability Matrix
P =

.8 .2
.7 .3

I Probability that tomorrow will be ’Rainy’ given today is ’Sunny’ = 0.2
Easwar Subramanian, IIT Hyderabad 18 of 33
Figure Source:
https://bookdown.org/probability
Multi-Step Transitions
I Probability that day-after-tomorrow will be ’Rainy’ given today is ’Sunny’ is given
by 0.2 * 0.3 + 0.8 * 0.2 = 0.22
In general, if one step transition matrix is given by,
P =

Pss Psr
Prs Prr

then the two step transition matrix is given by,
P(2) =

Pss ∗ Pss + Psr ∗ Prs Pss ∗ Psr + Psr ∗ Prr
Prr ∗ Prs + Prs ∗ Pss Prr ∗ Prr + Prs ∗ Psr

= P2
Easwar Subramanian, IIT Hyderabad 19 of 33
Figure Source:
https://bookdown.org/probability
Multi-Step Transitions
In general, n-step transition matrix is given by,
P(n) = Pn
Assumption
We made an important assumption in arriving at the above expression. That the one-step
transition matrix stays constant through time or independent of time
I Markov chains generated using such transition matrices are called homogeneous
Markov chains
I For much of this course, we will consider homogeneous Markov chains, for which the
transition probabilities depend on the length of time interval [t1, t2] but not on the
exact time instants
Easwar Subramanian, IIT Hyderabad 20 of 33
Markov Chains : Examples
Example 2 : One dimensional random walk
A walker flips a coin every time slot to decide which ’way’ to go.
st+1 =
(
st + 1 with probability p
st − 1 with probability 1 − p
Easwar Subramanian, IIT Hyderabad 21 of 33
Example 3 : Simple Grid World
I S = {s1, s2, s3, s4, s5, s6, s6}
I P as shown above
I Example Markov Chains with s2 as start state
F {s2, s3, s2, s1, s2, · · · }
F {s2, s2, s3, s4, s3, · · · }
Easwar Subramanian, IIT Hyderabad 22 of 33
Slide Credit: Emma Brunskill
CS234 Stanford
Markov Chains : Examples
Example 4 : Dice roll experiment
Let {st}t∈T model the stochastic process representing the cumulative sum of a fair
six-sided die rolls
Example 5 : Natural Language Processing
Let {st}t∈T model the stochastic process that keeps track of the chain of letters in a
sentence. Consider an example
Tomorrow is a sunny day
I We normally don’t ask the question what is probability of character ’a’ appearing
given previous character is ’d’
I Sentence formation is typically non-Markovian
Easwar Subramanian, IIT Hyderabad 23 of 33
Notion of Absorbing State
Absorbing State
A state s ∈ S is called absorbing state if it is impossible to leave the state. That is,
Pss0 =

1, if s = s0
0, otherwise

Easwar Subramanian, IIT Hyderabad 24 of 33
Markov Reward Process
Easwar Subramanian, IIT Hyderabad 25 of 33
Markov Reward Process
Markov Reward Process
A Markov reward process is a tuple  S, P, R, γ  is a Markov chain with values
I S : (Finite) set of states
I P : State transition probablity
I R : Reward for being in state st is given by a deterministic function R
rt+1 = R(st)
I γ : Discount factor such that γ ∈ [0, 1]
Easwar Subramanian, IIT Hyderabad 26 of 33
Simple Grid World : Revisited
I For the Markov chain {s2, s3, s2, s1, s2, · · · } the corresponding reward sequence is
{−1, 0, −1, −6, −1, · · · }
No notion of action
Easwar Subramanian, IIT Hyderabad 27 of 33
Slide Credit: Emma Brunskill
CS234 Stanford
Example : Snakes and Ladders
Easwar Subramanian, IIT Hyderabad 28 of 33
Example : Snakes and Ladders
I States S : {s1, s2, · · · , s100}
I Transition Probability P :
F What is the probability to move from state 2 to 6 in one step ?
F What are the states that can be visited in one-step from state 2 ?
F What is the probability to move from state 2 to 4 ?
F Can we transition from state 15 to 7 in one step ?
Question : Is transition matrix independent of time ?
Question : Can we formulate the game of Snake and Ladders as a MRP ?
Need to define suitable reward function and discounting factor
Easwar Subramanian, IIT Hyderabad 29 of 33
On Rewards : Total Return
I At each time step t, there is a reward rt+1 associated with being in state st
I Ideally, we would like the agent to pick such trajectories in which the cumulative
reward accumulated by traversing such a path is high
Question : How can we formalize this ?
Answer : If the reward sequence is given by {rt+1, rt+2, rt+3, · · · }, then, we want to
maximize the sum
rt+1 + rt+2 + rt+3 + · · ·
Define Gt to be
Gt = rt+1 + rt+2 + rt+3 + · · · =
∞
X
k=0
rt+k+1
The goal of the agent is to pick such paths that maximize Gt
Easwar Subramanian, IIT Hyderabad 30 of 33
Total (Discounted) Return
Recall that,
Gt = rt+1 + rt+2 + rt+3 + · · · =
∞
X
k=0
rt+k+1
I In the case that the underlying stochastic process has infinite terms the above
summation could be divergent
Therefore, we introduce discount factor γ ∈ [0, 1] and redefine Gt as
Gt = rt+1 + γrt+2 + γ2
rt+3 + · · · =
∞
X
k=0
γk
rt+k+1
I Gt is the total discounted return starting from time t
I If γ  1 then the infinite sum has a finite value if the reward sequence is bounded
I γ close to 0 the agent is concerned only with immediate reward(s) (myopic)
I γ close to 1 the agent considers future reward more strongly (far-sighted)
Easwar Subramanian, IIT Hyderabad 31 of 33
Few Remarks on Discounting
I Mathematically convienient to discount rewards
I Avoids infinite returns in cyclic and infinite horizon setting
I Discount rate determines the present value of future reward
I Offers trade-off between being ’myopic’ and ’far-sighted’ reward
I In finite MDPs, it is sometimes possible to use undiscounted reward (i.e. γ = 1), for
example, if all sequences terminate
Easwar Subramanian, IIT Hyderabad 32 of 33
Snakes and Ladders : Revisited
Question : What can be a suitable reward function and discount factor to describe
’Snake and Ladders’ as a Markov reward process ?
I Goal : From any given state reach s100 in as few steps as possible
I Reward R : R(s) = −1 for s ∈ s1, · · · , s99 and for R(s100) = 0
I Discount Factor γ = 1
Easwar Subramanian, IIT Hyderabad 33 of 33

More Related Content

PDF
Lecture3-MDP.pdf
PDF
A STUDY ON MARKOV CHAIN WITH TRANSITION DIAGRAM
PPTX
Stochastic matrices
PPTX
Markov chain and Monte Carlo
PDF
PPT
Markov Chains
PPT
markov chain.ppt
PDF
S19_lecture6_exploreexploitinbandits.pdf
Lecture3-MDP.pdf
A STUDY ON MARKOV CHAIN WITH TRANSITION DIAGRAM
Stochastic matrices
Markov chain and Monte Carlo
Markov Chains
markov chain.ppt
S19_lecture6_exploreexploitinbandits.pdf

Similar to Lecture2-MRP.pdf (20)

PPTX
02 - Discrete-Time Markov Models - incomplete.pptx
PPTX
AI - Introduction to Markov Principles
PPT
Markov chains1
PDF
Reinfrocement Learning
PDF
22PCOAM16 Machine Learning Unit V Full notes & QB
PDF
Intro to Quant Trading Strategies (Lecture 2 of 10)
PPT
Modeling of Granular Mixing using Markov Chains and the Discrete Element Method
PDF
Penerapan rantai markov terkonfirmasi positif covid 19 di jabar
PPT
Reinforcement learning 7313
PPTX
Linear Regression in Machine Learning YLP
PDF
REINFORCEMENT LEARNING
PPT
Cs221 lecture8-fall11
PDF
PPTX
Stochastic Processes Assignment Help
PPTX
Stat 2153 Stochastic Process and Markov chain
PDF
Intro to Quant Trading Strategies (Lecture 7 of 10)
PDF
IUT Probability and Statistics - Chapter 02_Part-1.pdf
PPTX
Fundamentals of RL.pptx
PPTX
Lecture 8 artificial intelligence .pptx
DOCX
Theoryofcomp science
02 - Discrete-Time Markov Models - incomplete.pptx
AI - Introduction to Markov Principles
Markov chains1
Reinfrocement Learning
22PCOAM16 Machine Learning Unit V Full notes & QB
Intro to Quant Trading Strategies (Lecture 2 of 10)
Modeling of Granular Mixing using Markov Chains and the Discrete Element Method
Penerapan rantai markov terkonfirmasi positif covid 19 di jabar
Reinforcement learning 7313
Linear Regression in Machine Learning YLP
REINFORCEMENT LEARNING
Cs221 lecture8-fall11
Stochastic Processes Assignment Help
Stat 2153 Stochastic Process and Markov chain
Intro to Quant Trading Strategies (Lecture 7 of 10)
IUT Probability and Statistics - Chapter 02_Part-1.pdf
Fundamentals of RL.pptx
Lecture 8 artificial intelligence .pptx
Theoryofcomp science
Ad

Recently uploaded (20)

PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
Institutional Correction lecture only . . .
PDF
Classroom Observation Tools for Teachers
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Complications of Minimal Access Surgery at WLH
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Presentation on HIE in infants and its manifestations
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
Computing-Curriculum for Schools in Ghana
STATICS OF THE RIGID BODIES Hibbelers.pdf
Cell Structure & Organelles in detailed.
Institutional Correction lecture only . . .
Classroom Observation Tools for Teachers
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
human mycosis Human fungal infections are called human mycosis..pptx
O5-L3 Freight Transport Ops (International) V1.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Complications of Minimal Access Surgery at WLH
O7-L3 Supply Chain Operations - ICLT Program
Anesthesia in Laparoscopic Surgery in India
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Chinmaya Tiranga quiz Grand Finale.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
VCE English Exam - Section C Student Revision Booklet
Presentation on HIE in infants and its manifestations
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Microbial diseases, their pathogenesis and prophylaxis
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Computing-Curriculum for Schools in Ghana
Ad

Lecture2-MRP.pdf

  • 1. Markov Chains and Reward Process Easwar Subramanian TCS Innovation Labs, Hyderabad Email : easwar.subramanian@tcs.com / cs5500.2020@iith.ac.in August 5, 2022
  • 2. Administrivia I Please consult Prof. Vineeth, for all queries related to registration and other administrative issues. I If need be, register for CS 5500 instead of AI 3000 (relevant for CS, PhD students). Easwar Subramanian, IIT Hyderabad 2 of 33
  • 3. Overview 1 Review 2 Mathematical Framework for Decision Making 3 Markov Chains 4 Markov Reward Process Easwar Subramanian, IIT Hyderabad 3 of 33
  • 4. Review Easwar Subramanian, IIT Hyderabad 4 of 33
  • 5. Types of Learning : Summary Easwar Subramanian, IIT Hyderabad 5 of 33 Figure Source: Saggie
  • 6. Characteristics of Reinforcement Learning I Observations are non i.i.d and are sequential in nature I Agent’s action (may) affect the subsequent observation seen I There is no supervisor; Only reward signal (feedback) I Reward or feedback can be delayed Easwar Subramanian, IIT Hyderabad 6 of 33
  • 7. Reinforcement Learning : History Easwar Subramanian, IIT Hyderabad 7 of 33 Slide Credit: RL Course, Abir Das
  • 8. Course Setup Easwar Subramanian, IIT Hyderabad 8 of 33
  • 9. Mathematical Framework for Decision Making Easwar Subramanian, IIT Hyderabad 9 of 33
  • 10. RL Framework : Notations Easwar Subramanian, IIT Hyderabad 10 of 33 Figure Source: Sutton and Barto
  • 11. Markov Decision Process I Markov Decision Process (MDP) provides a mathematical framework for modeling decision making process I Can formally describe the working of the environment and agent in the RL setting I Can handle huge variety of interesting settings F Multi-arm Bandits - Single state MDPs F Optimal Control - Continuous MDPs I Core problem in solving an MDP is to find an ’optimal’ policy (or behaviour) for the decision maker (agent) in order to maximize the total future reward Easwar Subramanian, IIT Hyderabad 11 of 33
  • 12. Markov Chains Easwar Subramanian, IIT Hyderabad 12 of 33
  • 13. Random Variables and Stochastic Process Random Variable (Non-mathematical definition) A random variable is a variable whose value depend on the outcome of a random phenomenon I Outcome of a coin toss I Outcome of roll of a dice Stochastic Process A stochastic or random process, denoted by {st}t∈T , can be defined as a collection of random variables that is indexed by some mathematical set T I Index set has the interpretation of time I The set T is, typically, N or R Easwar Subramanian, IIT Hyderabad 13 of 33
  • 14. Notations I Typically, in optimal control problems, the index set is continuous (say R) I Throughout this course (RL), the index set is always discrete (say N) I Let {st}t∈T be a stochastic process I Let st be the state at time t of the stochastic process {st}t∈T Easwar Subramanian, IIT Hyderabad 14 of 33
  • 15. Markov Property Markov Property A state st of a stochastic process {st}t∈T is said to have Markov property if P(st+1|st) = P(st+1|s1, · · · , st) The state st at time t captures all relevant information from history and is a sufficient statistic of the future Easwar Subramanian, IIT Hyderabad 15 of 33
  • 16. Transition Probability State Transition Probability For a Markov state s and a successor state s0 , the state transition probability is defined by Pss0 = P(st+1 = s0 |st = s) State transition matrix P then denotes the transition probabilities from all states s to all successor states s0 (with each row summing to 1) P =    P11 P12 · · · P1n . . . Pn1 Pn2 · · · Pnn    Easwar Subramanian, IIT Hyderabad 16 of 33
  • 17. Markov Chain A stochastic process {st}t∈T is a Markov process or Markov Chain if the sequence of random states satisfy the Markov property. It is represented by tuple < S, P > where S denote the set of states and P denote the state transition probablity Example 1 : Simple Two State Markov Chain I State S = {Sunny, Rainy} I Transition Probability Matrix P = .8 .2 .7 .3 Easwar Subramanian, IIT Hyderabad 17 of 33 Figure Source: https://bookdown.org/probability
  • 18. Markov Chain : Example Revisited State S = {Sunny, Rainy} and Transition Probability Matrix P = .8 .2 .7 .3 I Probability that tomorrow will be ’Rainy’ given today is ’Sunny’ = 0.2 Easwar Subramanian, IIT Hyderabad 18 of 33 Figure Source: https://bookdown.org/probability
  • 19. Multi-Step Transitions I Probability that day-after-tomorrow will be ’Rainy’ given today is ’Sunny’ is given by 0.2 * 0.3 + 0.8 * 0.2 = 0.22 In general, if one step transition matrix is given by, P = Pss Psr Prs Prr then the two step transition matrix is given by, P(2) = Pss ∗ Pss + Psr ∗ Prs Pss ∗ Psr + Psr ∗ Prr Prr ∗ Prs + Prs ∗ Pss Prr ∗ Prr + Prs ∗ Psr = P2 Easwar Subramanian, IIT Hyderabad 19 of 33 Figure Source: https://bookdown.org/probability
  • 20. Multi-Step Transitions In general, n-step transition matrix is given by, P(n) = Pn Assumption We made an important assumption in arriving at the above expression. That the one-step transition matrix stays constant through time or independent of time I Markov chains generated using such transition matrices are called homogeneous Markov chains I For much of this course, we will consider homogeneous Markov chains, for which the transition probabilities depend on the length of time interval [t1, t2] but not on the exact time instants Easwar Subramanian, IIT Hyderabad 20 of 33
  • 21. Markov Chains : Examples Example 2 : One dimensional random walk A walker flips a coin every time slot to decide which ’way’ to go. st+1 = ( st + 1 with probability p st − 1 with probability 1 − p Easwar Subramanian, IIT Hyderabad 21 of 33
  • 22. Example 3 : Simple Grid World I S = {s1, s2, s3, s4, s5, s6, s6} I P as shown above I Example Markov Chains with s2 as start state F {s2, s3, s2, s1, s2, · · · } F {s2, s2, s3, s4, s3, · · · } Easwar Subramanian, IIT Hyderabad 22 of 33 Slide Credit: Emma Brunskill CS234 Stanford
  • 23. Markov Chains : Examples Example 4 : Dice roll experiment Let {st}t∈T model the stochastic process representing the cumulative sum of a fair six-sided die rolls Example 5 : Natural Language Processing Let {st}t∈T model the stochastic process that keeps track of the chain of letters in a sentence. Consider an example Tomorrow is a sunny day I We normally don’t ask the question what is probability of character ’a’ appearing given previous character is ’d’ I Sentence formation is typically non-Markovian Easwar Subramanian, IIT Hyderabad 23 of 33
  • 24. Notion of Absorbing State Absorbing State A state s ∈ S is called absorbing state if it is impossible to leave the state. That is, Pss0 = 1, if s = s0 0, otherwise Easwar Subramanian, IIT Hyderabad 24 of 33
  • 25. Markov Reward Process Easwar Subramanian, IIT Hyderabad 25 of 33
  • 26. Markov Reward Process Markov Reward Process A Markov reward process is a tuple S, P, R, γ is a Markov chain with values I S : (Finite) set of states I P : State transition probablity I R : Reward for being in state st is given by a deterministic function R rt+1 = R(st) I γ : Discount factor such that γ ∈ [0, 1] Easwar Subramanian, IIT Hyderabad 26 of 33
  • 27. Simple Grid World : Revisited I For the Markov chain {s2, s3, s2, s1, s2, · · · } the corresponding reward sequence is {−1, 0, −1, −6, −1, · · · } No notion of action Easwar Subramanian, IIT Hyderabad 27 of 33 Slide Credit: Emma Brunskill CS234 Stanford
  • 28. Example : Snakes and Ladders Easwar Subramanian, IIT Hyderabad 28 of 33
  • 29. Example : Snakes and Ladders I States S : {s1, s2, · · · , s100} I Transition Probability P : F What is the probability to move from state 2 to 6 in one step ? F What are the states that can be visited in one-step from state 2 ? F What is the probability to move from state 2 to 4 ? F Can we transition from state 15 to 7 in one step ? Question : Is transition matrix independent of time ? Question : Can we formulate the game of Snake and Ladders as a MRP ? Need to define suitable reward function and discounting factor Easwar Subramanian, IIT Hyderabad 29 of 33
  • 30. On Rewards : Total Return I At each time step t, there is a reward rt+1 associated with being in state st I Ideally, we would like the agent to pick such trajectories in which the cumulative reward accumulated by traversing such a path is high Question : How can we formalize this ? Answer : If the reward sequence is given by {rt+1, rt+2, rt+3, · · · }, then, we want to maximize the sum rt+1 + rt+2 + rt+3 + · · · Define Gt to be Gt = rt+1 + rt+2 + rt+3 + · · · = ∞ X k=0 rt+k+1 The goal of the agent is to pick such paths that maximize Gt Easwar Subramanian, IIT Hyderabad 30 of 33
  • 31. Total (Discounted) Return Recall that, Gt = rt+1 + rt+2 + rt+3 + · · · = ∞ X k=0 rt+k+1 I In the case that the underlying stochastic process has infinite terms the above summation could be divergent Therefore, we introduce discount factor γ ∈ [0, 1] and redefine Gt as Gt = rt+1 + γrt+2 + γ2 rt+3 + · · · = ∞ X k=0 γk rt+k+1 I Gt is the total discounted return starting from time t I If γ 1 then the infinite sum has a finite value if the reward sequence is bounded I γ close to 0 the agent is concerned only with immediate reward(s) (myopic) I γ close to 1 the agent considers future reward more strongly (far-sighted) Easwar Subramanian, IIT Hyderabad 31 of 33
  • 32. Few Remarks on Discounting I Mathematically convienient to discount rewards I Avoids infinite returns in cyclic and infinite horizon setting I Discount rate determines the present value of future reward I Offers trade-off between being ’myopic’ and ’far-sighted’ reward I In finite MDPs, it is sometimes possible to use undiscounted reward (i.e. γ = 1), for example, if all sequences terminate Easwar Subramanian, IIT Hyderabad 32 of 33
  • 33. Snakes and Ladders : Revisited Question : What can be a suitable reward function and discount factor to describe ’Snake and Ladders’ as a Markov reward process ? I Goal : From any given state reach s100 in as few steps as possible I Reward R : R(s) = −1 for s ∈ s1, · · · , s99 and for R(s100) = 0 I Discount Factor γ = 1 Easwar Subramanian, IIT Hyderabad 33 of 33