Lecture2-MRP.pdf

Markov Chains and Reward Process
Easwar Subramanian
TCS Innovation Labs, Hyderabad
Email : easwar.subramanian@tcs.com / cs5500.2020@iith.ac.in
August 5, 2022

Administrivia
I Please consult Prof. Vineeth, for all queries related to registration and other
administrative issues.
I If need be, register for CS 5500 instead of AI 3000 (relevant for CS, PhD students).
Easwar Subramanian, IIT Hyderabad 2 of 33

Overview
1 Review
2 Mathematical Framework for Decision Making
3 Markov Chains
4 Markov Reward Process

Review

Types of Learning : Summary
Easwar Subramanian, IIT Hyderabad 5 of 33 Figure Source: Saggie

Characteristics of Reinforcement Learning
I Observations are non i.i.d and are sequential in nature
I Agent’s action (may) affect the subsequent observation seen
I There is no supervisor; Only reward signal (feedback)
I Reward or feedback can be delayed

Reinforcement Learning : History
Easwar Subramanian, IIT Hyderabad 7 of 33 Slide Credit: RL Course, Abir Das

Course Setup

Mathematical Framework for Decision Making

RL Framework : Notations
Easwar Subramanian, IIT Hyderabad 10 of 33 Figure Source: Sutton and Barto

Markov Decision Process
I Markov Decision Process (MDP) provides a mathematical framework for modeling
decision making process
I Can formally describe the working of the environment and agent in the RL setting
I Can handle huge variety of interesting settings
F Multi-arm Bandits - Single state MDPs
F Optimal Control - Continuous MDPs
I Core problem in solving an MDP is to find an ’optimal’ policy (or behaviour) for the
decision maker (agent) in order to maximize the total future reward

Markov Chains

Random Variables and Stochastic Process
Random Variable (Non-mathematical definition)
A random variable is a variable whose value depend on the outcome of a random
phenomenon
I Outcome of a coin toss
I Outcome of roll of a dice
Stochastic Process
A stochastic or random process, denoted by {st}t∈T , can be defined as a collection of
random variables that is indexed by some mathematical set T
I Index set has the interpretation of time
I The set T is, typically, N or R

Notations
I Typically, in optimal control problems, the index set is continuous (say R)
I Throughout this course (RL), the index set is always discrete (say N)
I Let {st}t∈T be a stochastic process
I Let st be the state at time t of the stochastic process {st}t∈T

Markov Property
Markov Property
A state st of a stochastic process {st}t∈T is said to have Markov property if
P(st+1|st) = P(st+1|s1, · · · , st)
The state st at time t captures all relevant information from history and is a sufficient
statistic of the future

Transition Probability
State Transition Probability
For a Markov state s and a successor state s0
, the state transition probability is defined by
Pss0 = P(st+1 = s0
|st = s)
State transition matrix P then denotes the transition probabilities from all states s to all
successor states s0
(with each row summing to 1)
P =



P11 P12 · · · P1n
.
.
.
Pn1 Pn2 · · · Pnn




Markov Chain
A stochastic process {st}t∈T is a Markov process or Markov Chain if the sequence of
random states satisfy the Markov property. It is represented by tuple < S, P > where S
denote the set of states and P denote the state transition probablity
Example 1 : Simple Two State Markov Chain
I State S = {Sunny, Rainy}
I Transition Probability Matrix
P =

.8 .2
.7 .3

Figure Source:
https://bookdown.org/probability

Markov Chain : Example Revisited
State S = {Sunny, Rainy} and Transition Probability Matrix
P =

.8 .2
.7 .3

I Probability that tomorrow will be ’Rainy’ given today is ’Sunny’ = 0.2
Figure Source:

Multi-Step Transitions
I Probability that day-after-tomorrow will be ’Rainy’ given today is ’Sunny’ is given
by 0.2 * 0.3 + 0.8 * 0.2 = 0.22
In general, if one step transition matrix is given by,
P =

Pss Psr
Prs Prr

then the two step transition matrix is given by,
P(2) =

Pss ∗ Pss + Psr ∗ Prs Pss ∗ Psr + Psr ∗ Prr
Prr ∗ Prs + Prs ∗ Pss Prr ∗ Prr + Prs ∗ Psr

= P2
Figure Source:

Multi-Step Transitions
In general, n-step transition matrix is given by,
P(n) = Pn
Assumption
We made an important assumption in arriving at the above expression. That the one-step
transition matrix stays constant through time or independent of time
I Markov chains generated using such transition matrices are called homogeneous
Markov chains
I For much of this course, we will consider homogeneous Markov chains, for which the
transition probabilities depend on the length of time interval [t1, t2] but not on the
exact time instants

Markov Chains : Examples
Example 2 : One dimensional random walk
A walker flips a coin every time slot to decide which ’way’ to go.
st+1 =
(
st + 1 with probability p
st − 1 with probability 1 − p

Example 3 : Simple Grid World
I S = {s1, s2, s3, s4, s5, s6, s6}
I P as shown above
I Example Markov Chains with s2 as start state
F {s2, s3, s2, s1, s2, · · · }
F {s2, s2, s3, s4, s3, · · · }
Slide Credit: Emma Brunskill
CS234 Stanford

Markov Chains : Examples
Example 4 : Dice roll experiment
Let {st}t∈T model the stochastic process representing the cumulative sum of a fair
six-sided die rolls
Example 5 : Natural Language Processing
Let {st}t∈T model the stochastic process that keeps track of the chain of letters in a
sentence. Consider an example
Tomorrow is a sunny day
I We normally don’t ask the question what is probability of character ’a’ appearing
given previous character is ’d’
I Sentence formation is typically non-Markovian

Notion of Absorbing State
Absorbing State
A state s ∈ S is called absorbing state if it is impossible to leave the state. That is,
Pss0 =

1, if s = s0
0, otherwise


Markov Reward Process

A Markov reward process is a tuple S, P, R, γ is a Markov chain with values
I S : (Finite) set of states
I P : State transition probablity
I R : Reward for being in state st is given by a deterministic function R
rt+1 = R(st)
I γ : Discount factor such that γ ∈ [0, 1]

Simple Grid World : Revisited
I For the Markov chain {s2, s3, s2, s1, s2, · · · } the corresponding reward sequence is
{−1, 0, −1, −6, −1, · · · }
No notion of action
Slide Credit: Emma Brunskill
CS234 Stanford

Example : Snakes and Ladders

Example : Snakes and Ladders
I States S : {s1, s2, · · · , s100}
I Transition Probability P :
F What is the probability to move from state 2 to 6 in one step ?
F What are the states that can be visited in one-step from state 2 ?
F What is the probability to move from state 2 to 4 ?
F Can we transition from state 15 to 7 in one step ?
Question : Is transition matrix independent of time ?
Question : Can we formulate the game of Snake and Ladders as a MRP ?
Need to define suitable reward function and discounting factor

On Rewards : Total Return
I At each time step t, there is a reward rt+1 associated with being in state st
I Ideally, we would like the agent to pick such trajectories in which the cumulative
reward accumulated by traversing such a path is high
Question : How can we formalize this ?
Answer : If the reward sequence is given by {rt+1, rt+2, rt+3, · · · }, then, we want to
maximize the sum
rt+1 + rt+2 + rt+3 + · · ·
Define Gt to be
Gt = rt+1 + rt+2 + rt+3 + · · · =
∞
X
k=0
rt+k+1
The goal of the agent is to pick such paths that maximize Gt

Total (Discounted) Return
Recall that,
Gt = rt+1 + rt+2 + rt+3 + · · · =
∞
X
k=0
rt+k+1
I In the case that the underlying stochastic process has infinite terms the above
summation could be divergent
Therefore, we introduce discount factor γ ∈ [0, 1] and redefine Gt as
Gt = rt+1 + γrt+2 + γ2
rt+3 + · · · =
∞
X
k=0
γk
rt+k+1
I Gt is the total discounted return starting from time t
I If γ 1 then the infinite sum has a finite value if the reward sequence is bounded
I γ close to 0 the agent is concerned only with immediate reward(s) (myopic)
I γ close to 1 the agent considers future reward more strongly (far-sighted)

Few Remarks on Discounting
I Mathematically convienient to discount rewards
I Avoids infinite returns in cyclic and infinite horizon setting
I Discount rate determines the present value of future reward
I Offers trade-off between being ’myopic’ and ’far-sighted’ reward
I In finite MDPs, it is sometimes possible to use undiscounted reward (i.e. γ = 1), for
example, if all sequences terminate

Snakes and Ladders : Revisited
Question : What can be a suitable reward function and discount factor to describe
’Snake and Ladders’ as a Markov reward process ?
I Goal : From any given state reach s100 in as few steps as possible
I Reward R : R(s) = −1 for s ∈ s1, · · · , s99 and for R(s100) = 0
I Discount Factor γ = 1

Lecture2-MRP.pdf

More Related Content

Similar to Lecture2-MRP.pdf (20)

Recently uploaded (20)

Lecture2-MRP.pdf