SlideShare a Scribd company logo
Reinforcement Learning:
Markov Chain and Monte Carlo
MSC-IT Part-1
By - Ajay Chaurasiya
1
What is Reinforcement Learning?
 “Teach by experience”
 For each step, an agent will:
Execute an action
Observe a new state
Receive a reward
 Agent takes an action in an environment to maximize a reward
MSC-IT Part-1 By - Ajay Chaurasiya
2
Main points in Reinforcement learning
 Input
 Output
 Training
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.
MSC-IT Part-1 By - Ajay Chaurasiya
3
 Example: The problem is as follows: We have an agent and a reward, with
many hurdles in between. The agent is supposed to find the best possible
path to reach the reward.
The above image shows robot, diamond and fire. The goal of the robot is to get the
reward that is the diamond and avoid the hurdles that is fire.
The robot learns by trying all the possible paths and then choosing the path which gives
him the reward with the least hurdles. Each right step will give the
robot a reward and each wrong step will subtract the reward of the robot. The total
reward will be calculated when it reaches the final reward that is the diamond.
MSC-IT Part-1 By - Ajay Chaurasiya
4
Markov Chain Learning
 A Markov chain is a probabilistic model.
 It describing a sequence of possible events in which the probability of each
event depends only on the state attained in the previous event. In
continuous-time.
 it is also known as Markov process.
 The Markov property states that the future depends only on the present and
not on the past.
 Moving from one state to another is called transition and its probability is
called a transition probability.
MSC-IT Part-1 By - Ajay Chaurasiya
5
 Example: A robot car wants to travel far, quickly
1. Three states: Cool, Warm, Overheated
2. Two actions: Slow. Fast
3. Going faster gets double reward
Note: In Markov Decision, the probability of going
from one state to another state will always be one.
MSC-IT Part-1 By - Ajay Chaurasiya
6
MDP can be represented by 5 important elements.
 State(S)
 Actions(A)
 Transition Probability(Pᵃₛ₁ₛ₂)
 Reward Probability(Rᵃₛ₁ₛ₂)
 discount factor (γ)
MSC-IT Part-1 By - Ajay Chaurasiya
7
Continue..
 It is based on the action our agent performs, it receives a reward. A reward is nothing
but a numerical value, say, +1 for good action and -1 for a bad action.
 The total amount of rewards the agent receives from the environment is called
returns. We can formulate the total amount of reward as
R(t) = r(t+1)+r(t+2)+r(t+3)+r(t+4)……………+r(Τ)
 Since we don’t have any final state for a continuous task, we can define our
return for continuous tasks as R(t) = r(t+1)+r(t+2)+r(t+3)…+r(Τ) which will sum up to ∞
 That’s why we introduce the notion of a discount factor. We can redefine our return
with a discount factor, as follows R(t) = r(t+1)+γr(t+2)+γ² r(t+3)+ ……………
 The value of the discount factor lies within 0 to 1
Rewards
Discount factor
MSC-IT Part-1 By - Ajay Chaurasiya
8
Monte Carlo Learning
 The Monte Carlo method for reinforcement learning learns directly from
episodes of experience without any prior knowledge of MDP transitions. Here,
the random component is the return or reward.
 Below are key characteristics of Monte Carlo (MC) method:
1. There is no model (agent does not know state MDP transitions)
2. Agent learns from sampled experience
3. learn state value vπ(s) under policy π by experiencing average return from
all sampled episodes (value = average return)
4. only after a complete episode, values are updated
5. There is no bootstrapping
6. Only can be used in episodic problems
MSC-IT Part-1 By - Ajay Chaurasiya
9
Continue..
 In Monte Carlo Method instead of expected return we use empirical return
that agent has sampled based following the policy.
MSC-IT Part-1 By - Ajay Chaurasiya
10
 Example: Gems collection
 Agent follows policy and complete an episode, along the way in each
step it collects rewards in the form of gem. To get state value agent
sum-up all the gems collected after each episode starting from that
state.
MSC-IT Part-1 By - Ajay Chaurasiya
11
Continue..
 Return(Sample 01) = 2 + 1 + 2 + 2 + 1 + 5 = 13 gems
 Return(Sample 02) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems
 Return(Sample 03) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems
 Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems
 Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on
3 samples following policy π.
MSC-IT Part-1 By - Ajay Chaurasiya
12
 (even if agent comes-back to the same state multiple time in the episode,
only first visit will be counted). Detailed step as below:
 To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s) =
0 (these values are updated across episodes)
 The first time-step t that state s is visited in an episode, increment counter
N(s) = N(s) + 1
 Increment total return TR(s) = TR(s) + Gt
 Value is estimated by mean return V(s) = TR(s)/N(s)
 By law of large numbers, V(s) -> vπ(s) (this is called true value under policy π)
as N(s) approaches infinity
MSC-IT Part-1 By - Ajay Chaurasiya
13
Thank You !
MSC-IT Part-1 By - Ajay Chaurasiya
14

More Related Content

PPTX
Coin Change : Greedy vs Dynamic Programming
PDF
Coin Change Problem
PPTX
Coin change using dp
PPTX
Coin change Problem (DP & GREEDY)
PDF
Should a football team go for a one or two point conversion? A dynamic progra...
PDF
Should a football team run or pass? A linear programming approach to game theory
PDF
Coin Change : Greedy vs Dynamic Programming
Coin Change Problem
Coin change using dp
Coin change Problem (DP & GREEDY)
Should a football team go for a one or two point conversion? A dynamic progra...
Should a football team run or pass? A linear programming approach to game theory

Similar to Markov chain and Monte Carlo (20)

PDF
Reinfrocement Learning
PPT
RL.ppt
PDF
22PCOAM16 Machine Learning Unit V Full notes & QB
PPTX
Designing an AI that gains experience for absolute beginners
PPTX
An efficient use of temporal difference technique in Computer Game Learning
PPTX
Deep Reinforcement Learning
PPTX
Lecture 8 artificial intelligence .pptx
PDF
CS799_FinalReport
PPT
Reinforcement learning
PDF
REINFORCEMENT LEARNING
PDF
Reinforcement learning Russell and Norvig CMSC
PDF
Reinforcement Learning
PDF
Lecture2-MRP.pdf
PDF
NUS-ISS Learning Day 2018-How to train your program to play black jack
PDF
S19_lecture6_exploreexploitinbandits.pdf
PPTX
Reinforcement Learning
PDF
Head First Reinforcement Learning
PDF
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
PPTX
Deep einforcement learning
PDF
reinforcement-learning-141009013546-conversion-gate02.pdf
Reinfrocement Learning
RL.ppt
22PCOAM16 Machine Learning Unit V Full notes & QB
Designing an AI that gains experience for absolute beginners
An efficient use of temporal difference technique in Computer Game Learning
Deep Reinforcement Learning
Lecture 8 artificial intelligence .pptx
CS799_FinalReport
Reinforcement learning
REINFORCEMENT LEARNING
Reinforcement learning Russell and Norvig CMSC
Reinforcement Learning
Lecture2-MRP.pdf
NUS-ISS Learning Day 2018-How to train your program to play black jack
S19_lecture6_exploreexploitinbandits.pdf
Reinforcement Learning
Head First Reinforcement Learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
Deep einforcement learning
reinforcement-learning-141009013546-conversion-gate02.pdf
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Spectroscopy.pptx food analysis technology
PDF
KodekX | Application Modernization Development
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectroscopy.pptx food analysis technology
KodekX | Application Modernization Development
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Ad

Markov chain and Monte Carlo

  • 1. Reinforcement Learning: Markov Chain and Monte Carlo MSC-IT Part-1 By - Ajay Chaurasiya 1
  • 2. What is Reinforcement Learning?  “Teach by experience”  For each step, an agent will: Execute an action Observe a new state Receive a reward  Agent takes an action in an environment to maximize a reward MSC-IT Part-1 By - Ajay Chaurasiya 2
  • 3. Main points in Reinforcement learning  Input  Output  Training  The model keeps continues to learn.  The best solution is decided based on the maximum reward. MSC-IT Part-1 By - Ajay Chaurasiya 3
  • 4.  Example: The problem is as follows: We have an agent and a reward, with many hurdles in between. The agent is supposed to find the best possible path to reach the reward. The above image shows robot, diamond and fire. The goal of the robot is to get the reward that is the diamond and avoid the hurdles that is fire. The robot learns by trying all the possible paths and then choosing the path which gives him the reward with the least hurdles. Each right step will give the robot a reward and each wrong step will subtract the reward of the robot. The total reward will be calculated when it reaches the final reward that is the diamond. MSC-IT Part-1 By - Ajay Chaurasiya 4
  • 5. Markov Chain Learning  A Markov chain is a probabilistic model.  It describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In continuous-time.  it is also known as Markov process.  The Markov property states that the future depends only on the present and not on the past.  Moving from one state to another is called transition and its probability is called a transition probability. MSC-IT Part-1 By - Ajay Chaurasiya 5
  • 6.  Example: A robot car wants to travel far, quickly 1. Three states: Cool, Warm, Overheated 2. Two actions: Slow. Fast 3. Going faster gets double reward Note: In Markov Decision, the probability of going from one state to another state will always be one. MSC-IT Part-1 By - Ajay Chaurasiya 6
  • 7. MDP can be represented by 5 important elements.  State(S)  Actions(A)  Transition Probability(Pᵃₛ₁ₛ₂)  Reward Probability(Rᵃₛ₁ₛ₂)  discount factor (γ) MSC-IT Part-1 By - Ajay Chaurasiya 7
  • 8. Continue..  It is based on the action our agent performs, it receives a reward. A reward is nothing but a numerical value, say, +1 for good action and -1 for a bad action.  The total amount of rewards the agent receives from the environment is called returns. We can formulate the total amount of reward as R(t) = r(t+1)+r(t+2)+r(t+3)+r(t+4)……………+r(Τ)  Since we don’t have any final state for a continuous task, we can define our return for continuous tasks as R(t) = r(t+1)+r(t+2)+r(t+3)…+r(Τ) which will sum up to ∞  That’s why we introduce the notion of a discount factor. We can redefine our return with a discount factor, as follows R(t) = r(t+1)+γr(t+2)+γ² r(t+3)+ ……………  The value of the discount factor lies within 0 to 1 Rewards Discount factor MSC-IT Part-1 By - Ajay Chaurasiya 8
  • 9. Monte Carlo Learning  The Monte Carlo method for reinforcement learning learns directly from episodes of experience without any prior knowledge of MDP transitions. Here, the random component is the return or reward.  Below are key characteristics of Monte Carlo (MC) method: 1. There is no model (agent does not know state MDP transitions) 2. Agent learns from sampled experience 3. learn state value vπ(s) under policy π by experiencing average return from all sampled episodes (value = average return) 4. only after a complete episode, values are updated 5. There is no bootstrapping 6. Only can be used in episodic problems MSC-IT Part-1 By - Ajay Chaurasiya 9
  • 10. Continue..  In Monte Carlo Method instead of expected return we use empirical return that agent has sampled based following the policy. MSC-IT Part-1 By - Ajay Chaurasiya 10
  • 11.  Example: Gems collection  Agent follows policy and complete an episode, along the way in each step it collects rewards in the form of gem. To get state value agent sum-up all the gems collected after each episode starting from that state. MSC-IT Part-1 By - Ajay Chaurasiya 11
  • 12. Continue..  Return(Sample 01) = 2 + 1 + 2 + 2 + 1 + 5 = 13 gems  Return(Sample 02) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems  Return(Sample 03) = 2 + 3 + 1 + 3 + 1 + 5 = 15 gems  Observed mean return (based on 3 samples) = (13 + 15 + 15)/3 = 14.33 gems  Thus state value as per Monte Carlo Method, v π(S 05) is 14.33 gems based on 3 samples following policy π. MSC-IT Part-1 By - Ajay Chaurasiya 12
  • 13.  (even if agent comes-back to the same state multiple time in the episode, only first visit will be counted). Detailed step as below:  To evaluate state s, first we set number of visit, N(s) = 0, Total return TR(s) = 0 (these values are updated across episodes)  The first time-step t that state s is visited in an episode, increment counter N(s) = N(s) + 1  Increment total return TR(s) = TR(s) + Gt  Value is estimated by mean return V(s) = TR(s)/N(s)  By law of large numbers, V(s) -> vπ(s) (this is called true value under policy π) as N(s) approaches infinity MSC-IT Part-1 By - Ajay Chaurasiya 13
  • 14. Thank You ! MSC-IT Part-1 By - Ajay Chaurasiya 14