Markov chain and Monte Carlo

Reinforcement Learning:
Markov Chain and Monte Carlo
MSC-IT Part-1
By - Ajay Chaurasiya
1

What is Reinforcement Learning?
 “Teach by experience”
 For each step, an agent will:
Execute an action
Observe a new state
Receive a reward
 Agent takes an action in an environment to maximize a reward
MSC-IT Part-1 By - Ajay Chaurasiya
2

Main points in Reinforcement learning
 Input
 Output
 Training
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.
3

 Example: The problem is as follows: We have an agent and a reward, with
many hurdles in between. The agent is supposed to find the best possible
path to reach the reward.
The above image shows robot, diamond and fire. The goal of the robot is to get the
reward that is the diamond and avoid the hurdles that is fire.
The robot learns by trying all the possible paths and then choosing the path which gives
him the reward with the least hurdles. Each right step will give the
robot a reward and each wrong step will subtract the reward of the robot. The total
reward will be calculated when it reaches the final reward that is the diamond.
4

Markov Chain Learning
 A Markov chain is a probabilistic model.
 It describing a sequence of possible events in which the probability of each
event depends only on the state attained in the previous event. In
continuous-time.
 it is also known as Markov process.
 The Markov property states that the future depends only on the present and
not on the past.
 Moving from one state to another is called transition and its probability is
called a transition probability.
5

 Example: A robot car wants to travel far, quickly
1. Three states: Cool, Warm, Overheated
2. Two actions: Slow. Fast
3. Going faster gets double reward
Note: In Markov Decision, the probability of going
from one state to another state will always be one.
6

MDP can be represented by 5 important elements.
 State(S)
 Actions(A)
 Transition Probability(Pᵃₛ₁ₛ₂)
 Reward Probability(Rᵃₛ₁ₛ₂)
 discount factor (γ)
7

Continue..
 It is based on the action our agent performs, it receives a reward. A reward is nothing
but a numerical value, say, +1 for good action and -1 for a bad action.
 The total amount of rewards the agent receives from the environment is called
returns. We can formulate the total amount of reward as
R(t) = r(t+1)+r(t+2)+r(t+3)+r(t+4)……………+r(Τ)
 Since we don’t have any final state for a continuous task, we can define our
return for continuous tasks as R(t) = r(t+1)+r(t+2)+r(t+3)…+r(Τ) which will sum up to ∞
 That’s why we introduce the notion of a discount factor. We can redefine our return
with a discount factor, as follows R(t) = r(t+1)+γr(t+2)+γ² r(t+3)+ ……………
 The value of the discount factor lies within 0 to 1
Rewards
Discount factor
8

Monte Carlo Learning
 The Monte Carlo method for reinforcement learning learns directly from
episodes of experience without any prior knowledge of MDP transitions. Here,
the random component is the return or reward.
 Below are key characteristics of Monte Carlo (MC) method:
1. There is no model (agent does not know state MDP transitions)
2. Agent learns from sampled experience
3. learn state value vπ(s) under policy π by experiencing average return from
all sampled episodes (value = average return)
4. only after a complete episode, values are updated
5. There is no bootstrapping
6. Only can be used in episodic problems
9

Continue..
 In Monte Carlo Method instead of expected return we use empirical return
that agent has sampled based following the policy.
10

 Example: Gems collection
 Agent follows policy and complete an episode, along the way in each
step it collects rewards in the form of gem. To get state value agent
sum-up all the gems collected after each episode starting from that
state.
11

Continue..
 Return(Sample 01) = 2 + 1 + 2 + 2 + 1 + 5 = 13 gems
 Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems
 Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on
3 samples following policy π.
12

 (even if agent comes-back to the same state multiple time in the episode,
only first visit will be counted). Detailed step as below:
 To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s) =
0 (these values are updated across episodes)
 The first time-step t that state s is visited in an episode, increment counter
N(s) = N(s) + 1
 Increment total return TR(s) = TR(s) + Gt
 Value is estimated by mean return V(s) = TR(s)/N(s)
 By law of large numbers, V(s) -> vπ(s) (this is called true value under policy π)
as N(s) approaches infinity
13

Thank You !
14

Markov chain and Monte Carlo

More Related Content

Similar to Markov chain and Monte Carlo (20)

Recently uploaded (20)

Markov chain and Monte Carlo