Lectures_18_19_20_Monte_Carlo_Methods.pdf

Monte Carlo Methods
Dr. Surya Prakash
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Indore, Indore-453552, INDIA
E-mail: surya@iiti.ac.in
Dr. Surya Prakash (CSE, IIT Indore)

Quick Recap
 Model based methods:
– Policy iteration algorithm
• Iterative policy evaluation
• Policy improvement
– Value iteration algorithm
• Iterative policy evaluation + Policy improvement
 These methods assume the availability of model of the environment
– Markov decision process (MDP)
2

Introduction
 In previous techniques such as policy iteration and value
iteration, we have assumed that agent has complete knowledge
of the environment
 Knowledge of the model is available in the form of MDP
– Reward and transition dynamics
 This is a very strong assumption and is not true in real life
situations
3

Introduction
 How to find policy where environment model is not available
 In such situation, we need to go for Model free reinforcement
learning
 Monte Carlo Based methods are one of the methods which can
be used to solve RL problems in model free environment
4

Introduction
 In general, Monte Carlo methods (or Monte Carlo
experiments)
– are a broad class of computational algorithms that use repeated
random sampling to obtain numerical results (like probabilities,
expected values)
– they are often used in physical and mathematical problems where it is
difficult to derive probabilities and expected values using basic
principals and concepts
5

Introduction
 Example: Application of Monte Carlo methods
 Probability of getting an outcome (1, 2, 3, 4, 5 or 6) in symmetric
dice
– Simple to solve:
– Solution 1/6
 Probability of getting an outcome in asymmetric dice
– we can not use symmetry argument to find probability of the outcomes
– Solution: Monte Carlo methods
• Perform experiment, toss dice large number of times and count the appearance of
each outcome and give probability as Pr (i) = ni/n
6

Introduction
 Unlike model based techniques, in case of Monte Carlo
methods
–We assume that we do not have knowledge of the state to
next state transition given actions, that is, p(s’, r | s, a).
–So here, we would estimate the value function V(s) or Q
function Q (s, a) based on experience.
–Use the above findings to find optimal policies.
7

Introduction
 Monte Carlo methods require only experience, that is
–they sample states, actions, and rewards, while interacting
with the environment.
–they are a way to solve RL problems based on averaging
sample returns.
8

Introduction
 Since, we are going to average returns, we focus on
Monte Carlo for episodic tasks
–if we would have a continuous task, it will be difficult to
compute the average
–once an episode ends, the value estimates and policies
change.
9

Introduction
 In model based learning – we used Bellman equation
– We could use this as we had information about how state “s” transitions into the
next state s’ because we had a model of the environment.
10
Here, we had the knowledge of all the transitions (s → s’) so we could just
update a state value by calculating it.

Introduction
 In model free learning
– we do not have the transition function p(s’, r | s, a).
– so, we update the states by averaging the returns we experience
while traveling through those states.
– so here, we have to actually explore, starting from state “s”, and see
• what the next state and action look like from experience (i.e. sample a state
and action)
• and update that state “s” value by averaging the results as we are exploring.
11

How does the sampling returns differ from before?
12
Sampling returns (left) Vs. backup diagram vπ (right)
State
Action

How does the sampling returns differ from before?
 In sampling returns:
–we update the value of state s based on samples of episodes
going through the state (left image).
 In comparison, in the backup diagram:
–first, we check one step ahead to all of the next states s’,
and
–use that to update state s.
13

Learning State Value function in Monte Carlo Prediction
 In model based techniques
– we have seen that the value of a state is the expected return starting from
that state and then following a particular policy
 In model free techniques
– An easy way to estimate it based on experience would be to average the
returns we observe after visiting a state.
– As we interact with the environment more and observe more returns, the
average should converge to the expected value.
– That is the idea behind all Monte Carlo methods.
14

 Suppose we want to estimate the value of a state Vπ(s)
–Vπ(s) the value of a state under policy π, given a set of
episodes obtained by following policy π and passing through
state s.
–each time we visit state s in an episode is called a “visit to s”
–we may visit state s many times in a single episode
–let us call the first time we visit “s” in an episode as the “first
visit to s”.
15

 Now we have two types of Monte Carlo methods to compute
the value of a state:
– Considering first visit to s
– Considering every visit to s
 In first-visit MC:
– the first-visit MC method estimates Vπ(s) by averaging just the
returns following first visits to s in a set of episodes
 In every-visit MC:
– the every-visit MC method estimates Vπ(s) as the average of the
returns following all the visits to s in a set of episodes
16

17
An episode represented
by a red arrow
State S6 is visited two times
First-visit to S6: at time instant t=2 (after visiting state S5)
Next visit to S6: at time instant t=8 (after visiting state S10)

Monte Carlo Prediction Algorithm for Learning V(s)
18
Algorithm for
First-visit MC
method for
estimating Vπ(s)

Monte Carlo Estimation of Action Values
 As we have seen, if we have a model of the environment, it is
quite easy to determine the policy from the state values
– we look one step ahead to see which state gives the best combination
of reward and next state
 But if we do not have a model of the environment, state values
are not enough
– In that case, it is useful to estimate action values (the values of
different actions in a state) rather than state values.
19

 Thus, the main goal of MC methods is to estimate the optimal
action values q∗.
 To obtain q∗, we first look at policy evaluation for action
values.
 Which means that we are going to estimate qπ(s, a), the
expected return, when you start in state s, take an action a, and
then follow a policy π.
20

 This is similar to what we discussed for state values (Vπ),
– Here, we are talking about visiting a state-action pair, rather than just a
state.
 More specifically
– a single state may have several actions.
– so by visiting a state we have several options (that is, several actions we can
take).
– when we talk about state-action pair, we are always talking about taking
that specific action in that specific state.
21

 Now we have two types of Monte Carlo methods to compute the value
of a state:
– Considering first visit to (s, a)
– Considering every visit to (s, a)
 In the first-visit MC method
– first-visit MC method estimates the value of a state-action pair as the average of
returns following the first time in each episode that the state was visited and the action
was selected.
 In the every-visit MC method
– every-visit MC method estimates the value of a state-action pair as the average of the
returns that have followed visits to the state in which the action was selected.
22
It is shown that these methods converge quadratically to the true expected values as the number
of visits to each state-action pair approaches infinity.

 Issues
–the only problem is that many state-action pairs may never
be visited.
 For example:
–if we have a deterministic policy, we shall only get one
action per state (the one that the policy favor).
–And hence, we shall only observe returns for one action.
23

 This is the general problem of maintaining exploration
– for policy evaluation to work, we need to make sure that we visit all
actions in every state.
– one way to go about this is to specify that every episode will start in
some state – action pair, and that every pair has a probability > 0 to
be selected as the start pair.
– this is called the assumption of exploring starts.
24

Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
25
Monte Carlo ES:
A Monte Carlo
control algorithm
assuming
exploring starts
(first-visit MC method )

26
Monte Carlo ES:
A Monte Carlo
control algorithm
assuming
exploring starts –
an elaborated
version
As we are using first-visit MC method

27
An episode represented
by a red arrow
State S6 is visited two times
First-visit to S6: at time instant t=2 (after visiting state S5)
Next visit to S6: at time instant t=8 (after visiting state S10)
Image source: https://aleksandarhaber.com/monte-carlo-method-for-learning-state-value-functions-first-visit-method-reinforcement-learning-tutorial/

 But this assumption does not work always.
– assumption of exploring starts
 For example,
– if we learn directly from the interaction with the environment,
starting conditions are not very useful.
 A more common way to go about it is to only consider
stochastic policies where the probability of every action in
every state is not 0.
28

Monte Carlo Control
 We now look at how our MC estimation can be used in control.
– that is, to approximate optimal policies
 The idea is to follow generalized policy iteration (GPI)
– Here, we will maintain an approximate policy and an approximate
value function.
 We continuously alter the value function to be a better
approximation for the policy, and the policy is continuously
improved (similar to previous techniques)
29

Monte Carlo Control
30
Generalized Policy Iteration (GPI)

Monte Carlo Control
 The policy evaluation part is done exactly as done in previous
techniques,
– Only change is that here we are evaluating the state-action pair, rather
than states.
 The policy improvement part is done by taking greedy actions in
each state.
 So, for any action-value function “q”, and for every state “s”, the
greedy policy chooses the action with maximal action-value:
31

Monte Carlo Control without Exploring Starts
 How can we avoid the unlikely assumption of exploring
starts?
 The only general way to ensure that all actions are selected
infinitely often is for the agent to continue to select them.
 There are two approaches to ensuring this
– on-policy methods, and
– off-policy methods.
32

 On-policy methods
–these methods attempt to evaluate or improve the policy that
is used to make decisions.
–so here, we try to evaluate or improve the policy that we
have.
 Off-policy methods:
–we have two policies, and we try to evaluate or improve one
of them, and use the other for directions.
33

 Right now, we focus on on-policy method that does not use the assumption
of exploring starts.
 In on-policy control methods, policy is generally soft
– This means, π(a|s) > 0 for all s ∈ S and all a ∈ A(s),
 There are many possible variations on on-policy methods.
– One possibility is to gradually shift the policy toward a deterministic optimal
policy (many ways to do this).
 We here use ε-greedy policies for this which most of the time choose an
action that has maximal estimated action value, but with probability they
instead select an action at random
34

 ε-Greedy Policies
– here, most of the time choose an action that has maximal estimated
action value, but with probability they instead select an action at
random
– that is, all non-greedy actions are given the minimal probability of
selection,
ε
|𝐴𝐴(𝑠𝑠)|
, and the remaining bulk of the probability, 1- ε+
ε
,
is given to the greedy action.
– The ε-greedy policies are examples of ε-soft policies, defined as
policies for which π(s,a) ≥
ε
for all states and actions, for some ε > 0
35

36
An ∈-soft on-policy
Monte Carlo control
algorithm

37
An ∈-soft on-policy
Monte Carlo control
algorithm
– an elaborated
version

References
 Sutton and Barto, Reinforcement Learning: An Introduction, MIT press (5.3 Monte Carlo Control):
http://incompleteideas.net/book/ebook/node53.html
 Monte Carlo Methods, https://towardsdatascience.com/introduction-to-reinforcement-learning-
rl-part-5-monte-carlo-methods-25067003bb0f
 Monte Carlo Methods-An Example:
https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo-
learning-openai-gym/
 Reinforcement learning: Model-free MC learner with code implementation:
https://medium.com/@ngao7/reinforcement-learning-model-free-mc-learner-with-code-implementation-
f9f475296dcb
 Monte Carlo Methods in Reinforcement Learning — Part 1 on-policy Methods:
https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-
methods-
1f004d59686a#:~:text=Monte%20Carlo%20Control%20without%20exploring%20starts&text=In%20on-
policy%20MC%20control,closer%20to%20a%20deterministic%20policy.
 Monte Carlo Method for Learning State-Value Functions – First Visit Method –
Reinforcement Learning Tutorial: https://aleksandarhaber.com/monte-carlo-method-for-learning-state-
value-functions-first-visit-method-reinforcement-learning-tutorial/
38

Thank You
Dr. Surya Prakash (CSE, IIT Indore) 39

Lectures_18_19_20_Monte_Carlo_Methods.pdf

More Related Content

Similar to Lectures_18_19_20_Monte_Carlo_Methods.pdf (20)

Recently uploaded (20)

Lectures_18_19_20_Monte_Carlo_Methods.pdf