SlideShare a Scribd company logo
Monte Carlo Methods
Dr. Surya Prakash
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Indore, Indore-453552, INDIA
E-mail: surya@iiti.ac.in
Dr. Surya Prakash (CSE, IIT Indore)
Quick Recap
 Model based methods:
– Policy iteration algorithm
• Iterative policy evaluation
• Policy improvement
– Value iteration algorithm
• Iterative policy evaluation + Policy improvement
 These methods assume the availability of model of the environment
– Markov decision process (MDP)
Dr. Surya Prakash (CSE, IIT Indore)
2
Introduction
 In previous techniques such as policy iteration and value
iteration, we have assumed that agent has complete knowledge
of the environment
 Knowledge of the model is available in the form of MDP
– Reward and transition dynamics
 This is a very strong assumption and is not true in real life
situations
Dr. Surya Prakash (CSE, IIT Indore)
3
Introduction
 How to find policy where environment model is not available
 In such situation, we need to go for Model free reinforcement
learning
 Monte Carlo Based methods are one of the methods which can
be used to solve RL problems in model free environment
Dr. Surya Prakash (CSE, IIT Indore)
4
Introduction
 In general, Monte Carlo methods (or Monte Carlo
experiments)
– are a broad class of computational algorithms that use repeated
random sampling to obtain numerical results (like probabilities,
expected values)
– they are often used in physical and mathematical problems where it is
difficult to derive probabilities and expected values using basic
principals and concepts
Dr. Surya Prakash (CSE, IIT Indore)
5
Introduction
 Example: Application of Monte Carlo methods
 Probability of getting an outcome (1, 2, 3, 4, 5 or 6) in symmetric
dice
– Simple to solve:
– Solution 1/6
 Probability of getting an outcome in asymmetric dice
– we can not use symmetry argument to find probability of the outcomes
– Solution: Monte Carlo methods
• Perform experiment, toss dice large number of times and count the appearance of
each outcome and give probability as Pr (i) = ni/n
Dr. Surya Prakash (CSE, IIT Indore)
6
Introduction
 Unlike model based techniques, in case of Monte Carlo
methods
–We assume that we do not have knowledge of the state to
next state transition given actions, that is, p(s’, r | s, a).
–So here, we would estimate the value function V(s) or Q
function Q (s, a) based on experience.
–Use the above findings to find optimal policies.
Dr. Surya Prakash (CSE, IIT Indore)
7
Introduction
 Monte Carlo methods require only experience, that is
–they sample states, actions, and rewards, while interacting
with the environment.
–they are a way to solve RL problems based on averaging
sample returns.
Dr. Surya Prakash (CSE, IIT Indore)
8
Introduction
 Since, we are going to average returns, we focus on
Monte Carlo for episodic tasks
–if we would have a continuous task, it will be difficult to
compute the average
–once an episode ends, the value estimates and policies
change.
Dr. Surya Prakash (CSE, IIT Indore)
9
Introduction
 In model based learning – we used Bellman equation
– We could use this as we had information about how state “s” transitions into the
next state s’ because we had a model of the environment.
Dr. Surya Prakash (CSE, IIT Indore)
10
Here, we had the knowledge of all the transitions (s → s’) so we could just
update a state value by calculating it.
Introduction
 In model free learning
– we do not have the transition function p(s’, r | s, a).
– so, we update the states by averaging the returns we experience
while traveling through those states.
– so here, we have to actually explore, starting from state “s”, and see
• what the next state and action look like from experience (i.e. sample a state
and action)
• and update that state “s” value by averaging the results as we are exploring.
Dr. Surya Prakash (CSE, IIT Indore)
11
How does the sampling returns differ from before?
Dr. Surya Prakash (CSE, IIT Indore)
12
Sampling returns (left) Vs. backup diagram vπ (right)
State
Action
How does the sampling returns differ from before?
 In sampling returns:
–we update the value of state s based on samples of episodes
going through the state (left image).
 In comparison, in the backup diagram:
–first, we check one step ahead to all of the next states s’,
and
–use that to update state s.
Dr. Surya Prakash (CSE, IIT Indore)
13
Learning State Value function in Monte Carlo Prediction
 In model based techniques
– we have seen that the value of a state is the expected return starting from
that state and then following a particular policy
 In model free techniques
– An easy way to estimate it based on experience would be to average the
returns we observe after visiting a state.
– As we interact with the environment more and observe more returns, the
average should converge to the expected value.
– That is the idea behind all Monte Carlo methods.
Dr. Surya Prakash (CSE, IIT Indore)
14
Learning State Value function in Monte Carlo Prediction
 Suppose we want to estimate the value of a state Vπ(s)
–Vπ(s) the value of a state under policy π, given a set of
episodes obtained by following policy π and passing through
state s.
–each time we visit state s in an episode is called a “visit to s”
–we may visit state s many times in a single episode
–let us call the first time we visit “s” in an episode as the “first
visit to s”.
Dr. Surya Prakash (CSE, IIT Indore)
15
Learning State Value function in Monte Carlo Prediction
 Now we have two types of Monte Carlo methods to compute
the value of a state:
– Considering first visit to s
– Considering every visit to s
 In first-visit MC:
– the first-visit MC method estimates Vπ(s) by averaging just the
returns following first visits to s in a set of episodes
 In every-visit MC:
– the every-visit MC method estimates Vπ(s) as the average of the
returns following all the visits to s in a set of episodes
Dr. Surya Prakash (CSE, IIT Indore)
16
Learning State Value function in Monte Carlo Prediction
Dr. Surya Prakash (CSE, IIT Indore)
17
An episode represented
by a red arrow
State S6 is visited two times
First-visit to S6: at time instant t=2 (after visiting state S5)
Next visit to S6: at time instant t=8 (after visiting state S10)
Monte Carlo Prediction Algorithm for Learning V(s)
Dr. Surya Prakash (CSE, IIT Indore)
18
Algorithm for
First-visit MC
method for
estimating Vπ(s)
Monte Carlo Estimation of Action Values
 As we have seen, if we have a model of the environment, it is
quite easy to determine the policy from the state values
– we look one step ahead to see which state gives the best combination
of reward and next state
 But if we do not have a model of the environment, state values
are not enough
– In that case, it is useful to estimate action values (the values of
different actions in a state) rather than state values.
Dr. Surya Prakash (CSE, IIT Indore)
19
Monte Carlo Estimation of Action Values
 Thus, the main goal of MC methods is to estimate the optimal
action values q∗.
 To obtain q∗, we first look at policy evaluation for action
values.
 Which means that we are going to estimate qπ(s, a), the
expected return, when you start in state s, take an action a, and
then follow a policy π.
Dr. Surya Prakash (CSE, IIT Indore)
20
Monte Carlo Estimation of Action Values
 This is similar to what we discussed for state values (Vπ),
– Here, we are talking about visiting a state-action pair, rather than just a
state.
 More specifically
– a single state may have several actions.
– so by visiting a state we have several options (that is, several actions we can
take).
– when we talk about state-action pair, we are always talking about taking
that specific action in that specific state.
Dr. Surya Prakash (CSE, IIT Indore)
21
Monte Carlo Estimation of Action Values
 Now we have two types of Monte Carlo methods to compute the value
of a state:
– Considering first visit to (s, a)
– Considering every visit to (s, a)
 In the first-visit MC method
– first-visit MC method estimates the value of a state-action pair as the average of
returns following the first time in each episode that the state was visited and the action
was selected.
 In the every-visit MC method
– every-visit MC method estimates the value of a state-action pair as the average of the
returns that have followed visits to the state in which the action was selected.
Dr. Surya Prakash (CSE, IIT Indore)
22
It is shown that these methods converge quadratically to the true expected values as the number
of visits to each state-action pair approaches infinity.
Monte Carlo Estimation of Action Values
 Issues
–the only problem is that many state-action pairs may never
be visited.
 For example:
–if we have a deterministic policy, we shall only get one
action per state (the one that the policy favor).
–And hence, we shall only observe returns for one action.
Dr. Surya Prakash (CSE, IIT Indore)
23
Monte Carlo Estimation of Action Values
 This is the general problem of maintaining exploration
– for policy evaluation to work, we need to make sure that we visit all
actions in every state.
– one way to go about this is to specify that every episode will start in
some state – action pair, and that every pair has a probability > 0 to
be selected as the start pair.
– this is called the assumption of exploring starts.
Dr. Surya Prakash (CSE, IIT Indore)
24
Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)
25
Monte Carlo ES:
A Monte Carlo
control algorithm
assuming
exploring starts
(first-visit MC method )
Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)
26
Monte Carlo ES:
A Monte Carlo
control algorithm
assuming
exploring starts –
an elaborated
version
(first-visit MC method )
As we are using first-visit MC method
Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)
27
An episode represented
by a red arrow
State S6 is visited two times
First-visit to S6: at time instant t=2 (after visiting state S5)
Next visit to S6: at time instant t=8 (after visiting state S10)
Image source: https://aleksandarhaber.com/monte-carlo-method-for-learning-state-value-functions-first-visit-method-reinforcement-learning-tutorial/
Monte Carlo Estimation of Action Values
 But this assumption does not work always.
– assumption of exploring starts
 For example,
– if we learn directly from the interaction with the environment,
starting conditions are not very useful.
 A more common way to go about it is to only consider
stochastic policies where the probability of every action in
every state is not 0.
Dr. Surya Prakash (CSE, IIT Indore)
28
Monte Carlo Control
 We now look at how our MC estimation can be used in control.
– that is, to approximate optimal policies
 The idea is to follow generalized policy iteration (GPI)
– Here, we will maintain an approximate policy and an approximate
value function.
 We continuously alter the value function to be a better
approximation for the policy, and the policy is continuously
improved (similar to previous techniques)
Dr. Surya Prakash (CSE, IIT Indore)
29
Monte Carlo Control
Dr. Surya Prakash (CSE, IIT Indore)
30
Generalized Policy Iteration (GPI)
Monte Carlo Control
 The policy evaluation part is done exactly as done in previous
techniques,
– Only change is that here we are evaluating the state-action pair, rather
than states.
 The policy improvement part is done by taking greedy actions in
each state.
 So, for any action-value function “q”, and for every state “s”, the
greedy policy chooses the action with maximal action-value:
Dr. Surya Prakash (CSE, IIT Indore)
31
Monte Carlo Control without Exploring Starts
 How can we avoid the unlikely assumption of exploring
starts?
 The only general way to ensure that all actions are selected
infinitely often is for the agent to continue to select them.
 There are two approaches to ensuring this
– on-policy methods, and
– off-policy methods.
Dr. Surya Prakash (CSE, IIT Indore)
32
Monte Carlo Control without Exploring Starts
 On-policy methods
–these methods attempt to evaluate or improve the policy that
is used to make decisions.
–so here, we try to evaluate or improve the policy that we
have.
 Off-policy methods:
–we have two policies, and we try to evaluate or improve one
of them, and use the other for directions.
Dr. Surya Prakash (CSE, IIT Indore)
33
Monte Carlo Control without Exploring Starts
 Right now, we focus on on-policy method that does not use the assumption
of exploring starts.
 In on-policy control methods, policy is generally soft
– This means, π(a|s) > 0 for all s ∈ S and all a ∈ A(s),
 There are many possible variations on on-policy methods.
– One possibility is to gradually shift the policy toward a deterministic optimal
policy (many ways to do this).
 We here use ε-greedy policies for this which most of the time choose an
action that has maximal estimated action value, but with probability they
instead select an action at random
Dr. Surya Prakash (CSE, IIT Indore)
34
Monte Carlo Control without Exploring Starts
 ε-Greedy Policies
– here, most of the time choose an action that has maximal estimated
action value, but with probability they instead select an action at
random
– that is, all non-greedy actions are given the minimal probability of
selection,
ε
|𝐴𝐴(𝑠𝑠)|
, and the remaining bulk of the probability, 1- ε+
ε
|𝐴𝐴(𝑠𝑠)|
,
is given to the greedy action.
– The ε-greedy policies are examples of ε-soft policies, defined as
policies for which π(s,a) ≥
ε
|𝐴𝐴(𝑠𝑠)|
for all states and actions, for some ε > 0
Dr. Surya Prakash (CSE, IIT Indore)
35
Monte Carlo Control without Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)
36
An ∈-soft on-policy
Monte Carlo control
algorithm
(first-visit MC method )
Monte Carlo Control without Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)
37
An ∈-soft on-policy
Monte Carlo control
algorithm
– an elaborated
version
(first-visit MC method )
References
 Sutton and Barto, Reinforcement Learning: An Introduction, MIT press (5.3 Monte Carlo Control):
http://incompleteideas.net/book/ebook/node53.html
 Monte Carlo Methods, https://towardsdatascience.com/introduction-to-reinforcement-learning-
rl-part-5-monte-carlo-methods-25067003bb0f
 Monte Carlo Methods-An Example:
https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo-
learning-openai-gym/
 Reinforcement learning: Model-free MC learner with code implementation:
https://medium.com/@ngao7/reinforcement-learning-model-free-mc-learner-with-code-implementation-
f9f475296dcb
 Monte Carlo Methods in Reinforcement Learning — Part 1 on-policy Methods:
https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-
methods-
1f004d59686a#:~:text=Monte%20Carlo%20Control%20without%20exploring%20starts&text=In%20on-
policy%20MC%20control,closer%20to%20a%20deterministic%20policy.
 Monte Carlo Method for Learning State-Value Functions – First Visit Method –
Reinforcement Learning Tutorial: https://aleksandarhaber.com/monte-carlo-method-for-learning-state-
value-functions-first-visit-method-reinforcement-learning-tutorial/
Dr. Surya Prakash (CSE, IIT Indore)
38
Thank You
Dr. Surya Prakash (CSE, IIT Indore) 39

More Related Content

PPTX
The Monte Carlo method! A computational technique.pptx
PDF
Reinforcement Learning 5. Monte Carlo Methods
PDF
Head First Reinforcement Learning
PDF
Reinfrocement Learning
PDF
Mc td
PDF
Intro to Reinforcement learning - part II
PDF
Policy-Gradient for deep reinforcement learning.pdf
PPT
about reinforcement-learning ,reinforcement-learning.ppt
The Monte Carlo method! A computational technique.pptx
Reinforcement Learning 5. Monte Carlo Methods
Head First Reinforcement Learning
Reinfrocement Learning
Mc td
Intro to Reinforcement learning - part II
Policy-Gradient for deep reinforcement learning.pdf
about reinforcement-learning ,reinforcement-learning.ppt

Similar to Lectures_18_19_20_Monte_Carlo_Methods.pdf (20)

PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PPT
reinforcement-learning.ppt
PPT
reinforcement-learning.prsentation for c
PPT
reinforcement-learning its based on the slide of university
PPTX
Reinforcement learning:policy gradient (part 1)
PDF
Reinforcement learning Markov principle
PPTX
Monte Carlo Berkeley.pptx
PPTX
Monte Carlo Simulation
PDF
Multiple estimators for Monte Carlo approximations
PPTX
The monte carlo method
PDF
Intro to Reinforcement learning - part III
PPTX
Markov chain and Monte Carlo
PDF
Temporal difference learning
PDF
Temporal difference learning
PDF
1100163YifanGuo
PDF
Cs229 notes12
PDF
Lecture3-MDP.pdf
PDF
Simulation methods finance_1
PDF
Lecture: Monte Carlo Methods
PDF
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
reinforcement-learning.ppt
reinforcement-learning.prsentation for c
reinforcement-learning its based on the slide of university
Reinforcement learning:policy gradient (part 1)
Reinforcement learning Markov principle
Monte Carlo Berkeley.pptx
Monte Carlo Simulation
Multiple estimators for Monte Carlo approximations
The monte carlo method
Intro to Reinforcement learning - part III
Markov chain and Monte Carlo
Temporal difference learning
Temporal difference learning
1100163YifanGuo
Cs229 notes12
Lecture3-MDP.pdf
Simulation methods finance_1
Lecture: Monte Carlo Methods
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Ad

Recently uploaded (20)

PDF
IFRS Notes in your pocket for study all the time
PPTX
HR Introduction Slide (1).pptx on hr intro
PDF
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
PDF
WRN_Investor_Presentation_August 2025.pdf
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PPTX
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
PPTX
Probability Distribution, binomial distribution, poisson distribution
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
DOCX
Euro SEO Services 1st 3 General Updates.docx
DOCX
unit 2 cost accounting- Tender and Quotation & Reconciliation Statement
PDF
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PPTX
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
Unit 1 Cost Accounting - Cost sheet
PDF
A Brief Introduction About Julia Allison
PPTX
Amazon (Business Studies) management studies
IFRS Notes in your pocket for study all the time
HR Introduction Slide (1).pptx on hr intro
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
Nidhal Samdaie CV - International Business Consultant
DOC-20250806-WA0002._20250806_112011_0000.pdf
WRN_Investor_Presentation_August 2025.pdf
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
Probability Distribution, binomial distribution, poisson distribution
ICG2025_ICG 6th steering committee 30-8-24.pptx
Euro SEO Services 1st 3 General Updates.docx
unit 2 cost accounting- Tender and Quotation & Reconciliation Statement
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
Unit 1 Cost Accounting - Cost sheet
A Brief Introduction About Julia Allison
Amazon (Business Studies) management studies
Ad

Lectures_18_19_20_Monte_Carlo_Methods.pdf

  • 1. Monte Carlo Methods Dr. Surya Prakash Associate Professor Department of Computer Science & Engineering Indian Institute of Technology Indore, Indore-453552, INDIA E-mail: surya@iiti.ac.in Dr. Surya Prakash (CSE, IIT Indore)
  • 2. Quick Recap  Model based methods: – Policy iteration algorithm • Iterative policy evaluation • Policy improvement – Value iteration algorithm • Iterative policy evaluation + Policy improvement  These methods assume the availability of model of the environment – Markov decision process (MDP) Dr. Surya Prakash (CSE, IIT Indore) 2
  • 3. Introduction  In previous techniques such as policy iteration and value iteration, we have assumed that agent has complete knowledge of the environment  Knowledge of the model is available in the form of MDP – Reward and transition dynamics  This is a very strong assumption and is not true in real life situations Dr. Surya Prakash (CSE, IIT Indore) 3
  • 4. Introduction  How to find policy where environment model is not available  In such situation, we need to go for Model free reinforcement learning  Monte Carlo Based methods are one of the methods which can be used to solve RL problems in model free environment Dr. Surya Prakash (CSE, IIT Indore) 4
  • 5. Introduction  In general, Monte Carlo methods (or Monte Carlo experiments) – are a broad class of computational algorithms that use repeated random sampling to obtain numerical results (like probabilities, expected values) – they are often used in physical and mathematical problems where it is difficult to derive probabilities and expected values using basic principals and concepts Dr. Surya Prakash (CSE, IIT Indore) 5
  • 6. Introduction  Example: Application of Monte Carlo methods  Probability of getting an outcome (1, 2, 3, 4, 5 or 6) in symmetric dice – Simple to solve: – Solution 1/6  Probability of getting an outcome in asymmetric dice – we can not use symmetry argument to find probability of the outcomes – Solution: Monte Carlo methods • Perform experiment, toss dice large number of times and count the appearance of each outcome and give probability as Pr (i) = ni/n Dr. Surya Prakash (CSE, IIT Indore) 6
  • 7. Introduction  Unlike model based techniques, in case of Monte Carlo methods –We assume that we do not have knowledge of the state to next state transition given actions, that is, p(s’, r | s, a). –So here, we would estimate the value function V(s) or Q function Q (s, a) based on experience. –Use the above findings to find optimal policies. Dr. Surya Prakash (CSE, IIT Indore) 7
  • 8. Introduction  Monte Carlo methods require only experience, that is –they sample states, actions, and rewards, while interacting with the environment. –they are a way to solve RL problems based on averaging sample returns. Dr. Surya Prakash (CSE, IIT Indore) 8
  • 9. Introduction  Since, we are going to average returns, we focus on Monte Carlo for episodic tasks –if we would have a continuous task, it will be difficult to compute the average –once an episode ends, the value estimates and policies change. Dr. Surya Prakash (CSE, IIT Indore) 9
  • 10. Introduction  In model based learning – we used Bellman equation – We could use this as we had information about how state “s” transitions into the next state s’ because we had a model of the environment. Dr. Surya Prakash (CSE, IIT Indore) 10 Here, we had the knowledge of all the transitions (s → s’) so we could just update a state value by calculating it.
  • 11. Introduction  In model free learning – we do not have the transition function p(s’, r | s, a). – so, we update the states by averaging the returns we experience while traveling through those states. – so here, we have to actually explore, starting from state “s”, and see • what the next state and action look like from experience (i.e. sample a state and action) • and update that state “s” value by averaging the results as we are exploring. Dr. Surya Prakash (CSE, IIT Indore) 11
  • 12. How does the sampling returns differ from before? Dr. Surya Prakash (CSE, IIT Indore) 12 Sampling returns (left) Vs. backup diagram vπ (right) State Action
  • 13. How does the sampling returns differ from before?  In sampling returns: –we update the value of state s based on samples of episodes going through the state (left image).  In comparison, in the backup diagram: –first, we check one step ahead to all of the next states s’, and –use that to update state s. Dr. Surya Prakash (CSE, IIT Indore) 13
  • 14. Learning State Value function in Monte Carlo Prediction  In model based techniques – we have seen that the value of a state is the expected return starting from that state and then following a particular policy  In model free techniques – An easy way to estimate it based on experience would be to average the returns we observe after visiting a state. – As we interact with the environment more and observe more returns, the average should converge to the expected value. – That is the idea behind all Monte Carlo methods. Dr. Surya Prakash (CSE, IIT Indore) 14
  • 15. Learning State Value function in Monte Carlo Prediction  Suppose we want to estimate the value of a state Vπ(s) –Vπ(s) the value of a state under policy π, given a set of episodes obtained by following policy π and passing through state s. –each time we visit state s in an episode is called a “visit to s” –we may visit state s many times in a single episode –let us call the first time we visit “s” in an episode as the “first visit to s”. Dr. Surya Prakash (CSE, IIT Indore) 15
  • 16. Learning State Value function in Monte Carlo Prediction  Now we have two types of Monte Carlo methods to compute the value of a state: – Considering first visit to s – Considering every visit to s  In first-visit MC: – the first-visit MC method estimates Vπ(s) by averaging just the returns following first visits to s in a set of episodes  In every-visit MC: – the every-visit MC method estimates Vπ(s) as the average of the returns following all the visits to s in a set of episodes Dr. Surya Prakash (CSE, IIT Indore) 16
  • 17. Learning State Value function in Monte Carlo Prediction Dr. Surya Prakash (CSE, IIT Indore) 17 An episode represented by a red arrow State S6 is visited two times First-visit to S6: at time instant t=2 (after visiting state S5) Next visit to S6: at time instant t=8 (after visiting state S10)
  • 18. Monte Carlo Prediction Algorithm for Learning V(s) Dr. Surya Prakash (CSE, IIT Indore) 18 Algorithm for First-visit MC method for estimating Vπ(s)
  • 19. Monte Carlo Estimation of Action Values  As we have seen, if we have a model of the environment, it is quite easy to determine the policy from the state values – we look one step ahead to see which state gives the best combination of reward and next state  But if we do not have a model of the environment, state values are not enough – In that case, it is useful to estimate action values (the values of different actions in a state) rather than state values. Dr. Surya Prakash (CSE, IIT Indore) 19
  • 20. Monte Carlo Estimation of Action Values  Thus, the main goal of MC methods is to estimate the optimal action values q∗.  To obtain q∗, we first look at policy evaluation for action values.  Which means that we are going to estimate qπ(s, a), the expected return, when you start in state s, take an action a, and then follow a policy π. Dr. Surya Prakash (CSE, IIT Indore) 20
  • 21. Monte Carlo Estimation of Action Values  This is similar to what we discussed for state values (Vπ), – Here, we are talking about visiting a state-action pair, rather than just a state.  More specifically – a single state may have several actions. – so by visiting a state we have several options (that is, several actions we can take). – when we talk about state-action pair, we are always talking about taking that specific action in that specific state. Dr. Surya Prakash (CSE, IIT Indore) 21
  • 22. Monte Carlo Estimation of Action Values  Now we have two types of Monte Carlo methods to compute the value of a state: – Considering first visit to (s, a) – Considering every visit to (s, a)  In the first-visit MC method – first-visit MC method estimates the value of a state-action pair as the average of returns following the first time in each episode that the state was visited and the action was selected.  In the every-visit MC method – every-visit MC method estimates the value of a state-action pair as the average of the returns that have followed visits to the state in which the action was selected. Dr. Surya Prakash (CSE, IIT Indore) 22 It is shown that these methods converge quadratically to the true expected values as the number of visits to each state-action pair approaches infinity.
  • 23. Monte Carlo Estimation of Action Values  Issues –the only problem is that many state-action pairs may never be visited.  For example: –if we have a deterministic policy, we shall only get one action per state (the one that the policy favor). –And hence, we shall only observe returns for one action. Dr. Surya Prakash (CSE, IIT Indore) 23
  • 24. Monte Carlo Estimation of Action Values  This is the general problem of maintaining exploration – for policy evaluation to work, we need to make sure that we visit all actions in every state. – one way to go about this is to specify that every episode will start in some state – action pair, and that every pair has a probability > 0 to be selected as the start pair. – this is called the assumption of exploring starts. Dr. Surya Prakash (CSE, IIT Indore) 24
  • 25. Monte Carlo Estimation of Action Values – Algorithm Exploring Starts Dr. Surya Prakash (CSE, IIT Indore) 25 Monte Carlo ES: A Monte Carlo control algorithm assuming exploring starts (first-visit MC method )
  • 26. Monte Carlo Estimation of Action Values – Algorithm Exploring Starts Dr. Surya Prakash (CSE, IIT Indore) 26 Monte Carlo ES: A Monte Carlo control algorithm assuming exploring starts – an elaborated version (first-visit MC method ) As we are using first-visit MC method
  • 27. Monte Carlo Estimation of Action Values – Algorithm Exploring Starts Dr. Surya Prakash (CSE, IIT Indore) 27 An episode represented by a red arrow State S6 is visited two times First-visit to S6: at time instant t=2 (after visiting state S5) Next visit to S6: at time instant t=8 (after visiting state S10) Image source: https://aleksandarhaber.com/monte-carlo-method-for-learning-state-value-functions-first-visit-method-reinforcement-learning-tutorial/
  • 28. Monte Carlo Estimation of Action Values  But this assumption does not work always. – assumption of exploring starts  For example, – if we learn directly from the interaction with the environment, starting conditions are not very useful.  A more common way to go about it is to only consider stochastic policies where the probability of every action in every state is not 0. Dr. Surya Prakash (CSE, IIT Indore) 28
  • 29. Monte Carlo Control  We now look at how our MC estimation can be used in control. – that is, to approximate optimal policies  The idea is to follow generalized policy iteration (GPI) – Here, we will maintain an approximate policy and an approximate value function.  We continuously alter the value function to be a better approximation for the policy, and the policy is continuously improved (similar to previous techniques) Dr. Surya Prakash (CSE, IIT Indore) 29
  • 30. Monte Carlo Control Dr. Surya Prakash (CSE, IIT Indore) 30 Generalized Policy Iteration (GPI)
  • 31. Monte Carlo Control  The policy evaluation part is done exactly as done in previous techniques, – Only change is that here we are evaluating the state-action pair, rather than states.  The policy improvement part is done by taking greedy actions in each state.  So, for any action-value function “q”, and for every state “s”, the greedy policy chooses the action with maximal action-value: Dr. Surya Prakash (CSE, IIT Indore) 31
  • 32. Monte Carlo Control without Exploring Starts  How can we avoid the unlikely assumption of exploring starts?  The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them.  There are two approaches to ensuring this – on-policy methods, and – off-policy methods. Dr. Surya Prakash (CSE, IIT Indore) 32
  • 33. Monte Carlo Control without Exploring Starts  On-policy methods –these methods attempt to evaluate or improve the policy that is used to make decisions. –so here, we try to evaluate or improve the policy that we have.  Off-policy methods: –we have two policies, and we try to evaluate or improve one of them, and use the other for directions. Dr. Surya Prakash (CSE, IIT Indore) 33
  • 34. Monte Carlo Control without Exploring Starts  Right now, we focus on on-policy method that does not use the assumption of exploring starts.  In on-policy control methods, policy is generally soft – This means, π(a|s) > 0 for all s ∈ S and all a ∈ A(s),  There are many possible variations on on-policy methods. – One possibility is to gradually shift the policy toward a deterministic optimal policy (many ways to do this).  We here use ε-greedy policies for this which most of the time choose an action that has maximal estimated action value, but with probability they instead select an action at random Dr. Surya Prakash (CSE, IIT Indore) 34
  • 35. Monte Carlo Control without Exploring Starts  ε-Greedy Policies – here, most of the time choose an action that has maximal estimated action value, but with probability they instead select an action at random – that is, all non-greedy actions are given the minimal probability of selection, ε |𝐴𝐴(𝑠𝑠)| , and the remaining bulk of the probability, 1- ε+ ε |𝐴𝐴(𝑠𝑠)| , is given to the greedy action. – The ε-greedy policies are examples of ε-soft policies, defined as policies for which π(s,a) ≥ ε |𝐴𝐴(𝑠𝑠)| for all states and actions, for some ε > 0 Dr. Surya Prakash (CSE, IIT Indore) 35
  • 36. Monte Carlo Control without Exploring Starts Dr. Surya Prakash (CSE, IIT Indore) 36 An ∈-soft on-policy Monte Carlo control algorithm (first-visit MC method )
  • 37. Monte Carlo Control without Exploring Starts Dr. Surya Prakash (CSE, IIT Indore) 37 An ∈-soft on-policy Monte Carlo control algorithm – an elaborated version (first-visit MC method )
  • 38. References  Sutton and Barto, Reinforcement Learning: An Introduction, MIT press (5.3 Monte Carlo Control): http://incompleteideas.net/book/ebook/node53.html  Monte Carlo Methods, https://towardsdatascience.com/introduction-to-reinforcement-learning- rl-part-5-monte-carlo-methods-25067003bb0f  Monte Carlo Methods-An Example: https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo- learning-openai-gym/  Reinforcement learning: Model-free MC learner with code implementation: https://medium.com/@ngao7/reinforcement-learning-model-free-mc-learner-with-code-implementation- f9f475296dcb  Monte Carlo Methods in Reinforcement Learning — Part 1 on-policy Methods: https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy- methods- 1f004d59686a#:~:text=Monte%20Carlo%20Control%20without%20exploring%20starts&text=In%20on- policy%20MC%20control,closer%20to%20a%20deterministic%20policy.  Monte Carlo Method for Learning State-Value Functions – First Visit Method – Reinforcement Learning Tutorial: https://aleksandarhaber.com/monte-carlo-method-for-learning-state- value-functions-first-visit-method-reinforcement-learning-tutorial/ Dr. Surya Prakash (CSE, IIT Indore) 38
  • 39. Thank You Dr. Surya Prakash (CSE, IIT Indore) 39