SlideShare a Scribd company logo
A U R O R A ’ S S C I E N T I F I C
A N D
T E C H N O L O G I C A L I N S T I T U T E
M A R K O V D E C I S I O N P R O C E S S
-An Overview
Presented By:
M.Narasimha Naik – 21M95A6603
Guide:
Dr.M.Sridhar(HOD)
CONTENTS
• Introduction
• Definition
• Components of Markov Decision Problem
• Working of Markov Decision Process
• Advantages
• Disadvantages
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
2
M
a
y
2
1
,
2
0
X
X
INTRODUCTION
3
The Markov decision process is a stochastic decision-making
tool based on the Markov Property principle. It is used to make
optimal decisions for dynamic systems while considering their
current state and the environment in which they operate. MDP is
a key component of reinforcement learning applications and is
widely employed to design intelligent systems. Several
industries, such as robotic process automation, manufacturing,
finance & economics, and logistics, use MDPs regularly to carry
out their day-to-day tasks.
DEFINITION
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
4
M
a
y
2
1
,
2
0
X
X
The Markov decision process (MDP) is a mathematical tool used for decision-making problems where
the outcomes are partially random and partially controllable.
In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It
provides a mathematical framework for modeling decision making in situations where outcomes are
partly random and partly under the control of a decision maker. MDPs are useful for
studying optimization problems solved via dynamic programming. MDPs were known at least as early
as the 1950s;[1] a core body of research on Markov decision processes resulted from Ronald Howard's
1960 book, Dynamic Programming and Markov Processes.[2] They are used in many disciplines,
including robotics, automatic control, economics and manufacturing. The name of MDPs comes from
the Russian mathematician Andrey Markov as they are an extension of Markov chains.
C O M P O N E N T S O F M A R K O V D E C I S I O N P R O C E S S
1.Agent: A reinforcement learning agent is the entity which we are
training to make correct decisions. For example, a robot that is
being trained to move around a house without crashing.
2.Environment: The environment is the surroundings with which the
agent interacts. For example, the house where the robot moves.
The agent cannot manipulate the environment; it can only control its
own actions. In other words, the robot can’t control where a table is
in the house, but it can walk around it.
3.State: The state defines the current situation of the agent This
can be the exact position of the robot in the house, the alignment of
its two legs or its current posture. It all depends on how you
address the problem.
5
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
6
M
a
y
2
1
,
2
0
X
X
1.Action: The choice that the agent makes at the current time step. For
example, the robot can move its right or left leg, raise its arm, lift an
object or turn right/left, etc. We know the set of actions (decisions)
that the agent can perform in advance.
2.Policy: A policy is the thought process behind picking an action. In
practice, it’s a probability distribution assigned to the set of actions.
Highly rewarding actions will have a high probability and vice versa. If
an action has a low probability, it doesn’t mean it won’t be picked at
all. It’s just less likely to be picked.
3.Reward: Is obtained after taking an action on a given state.
COMPANY
OVERVIEW
DIAGRAM OF MDP
BUSINESS MODEL
8
The MDP model operates by using key elements such as the agent, states, actions, rewards, and optimal policies. The agent refers
to a system responsible for making decisions and performing actions. It operates in an environment that details the various states
that the agent is in while it transitions from one state to another. MDP defines the mechanism of how certain states and an agent’s
actions lead to the other states. Moreover, the agent receives rewards depending on the action it performs and the state it attains
(current state). The policy for the MDP model reveals the agent’s following action depending on its current state.
The MDP framework has the following key components:
•S: states (s ∈ S)
•A: Actions (a ∈ A)
•P (St+1|st.at): Transition probabilities
•R (s): Reward
MARKOV DECISION
PROCESS
9
The graphical representation of the MDP model is as follows:
The MDP model uses the Markov Property, which states that the future can be determined only from the present state
that encapsulates all the necessary information from the past. The Markov Property can be evaluated by using this
equation:
P[St+1|St] = P[St+1 |S1,S2,S3……St]
MARKOV DECISION
PROCESS
10
According to this equation, the probability of the next state (P[St+1]) given the present
state (St) is given by the next state’s probability (P[St+1]) considering all the previous
states (S1,S2,S3……St). This implies that MDP uses only the present/current state to evaluate
the next actions without any dependencies on previous states or actions.
POLICY
As mentioned earlier, a policy defines the thought behind making a decision (picking an action).
it defines the behavior of an rl agent.
Formally,a policy is probability distribution over the set of actions a, current given state s.it
gives the probability of picking an action a.at state s.
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
11
M
a
y
2
1
,
2
0
X
X
VALUE FUNCTION
12
A value function is the long-term value of a state or an action. In other words, it’s the expected Return
over a state or an action. This is something that we are actually interested in optimizing.
State valuve function for MRP:
The value state function v(s) is the expected return starting from state s.
BELLMAN EQUATION
13
The Bellman equation helps us finding the optimal policies and value functions. The optimal
value function is the one yielding maximum value compared to all other value functions.
Similarly, the optimal policy is the one which results in an optimal value function.
Since the optimal value function is the one that has a higher value, it will be the maximum
ot the Q functions: V∗(s)=maxaQ∗(s,a)
The Bellman equation for the value and Q functions is:
14
• Finally, the Bellman optimality equation is:
• To solve this equation, two algorithms are used:
• Value iteration.
• Policy iteration.
EXAMPLE:GRID WORLD
15
A maze-like problem
• The agent lives in a grid
• Walls block the agent’s path
Noisy movement: actions do not always go as planned
• If agent takes action North
• 80% of the time: Get to the cell on the North
• (if there is no wall there)
• 10%: West; 10%: East
• If path after roll dice blocked by wall, stays put
The agent receives rewards each time step
• “Living” reward (can be negative)
• Additional reward at pit or target (good or bad) and will exit the grid world afterward
Goal: maximize sum of rewards
CHALLENGES
MDPs come with their own set of challenges, such as the curse of
dimensionality, which is the exponential growth of the MDP as the
number of states and actions increases. This can be addressed with
approximation techniques like function approximation, value iteration
networks, or deep reinforcement learning. Additionally, data scientists
must specify the transition and reward functions to use MDPs, which may
not be known or easy to obtain in real-world problems.
16
LIMITATIONS
MDPs are not always suitable or sufficient for every data science problem.
They have several limitations, such as the Markov assumption that the future
state and reward depend only on the current state and action, not on the
history of previous states and actions. Additionally, MDPs assume a single-
agent environment, which may not be true in some scenarios with
competition, cooperation, or communication with other agents. Lastly,
MDPs assume that the agent is rational and consistent in its preferences and
actions, which may not reflect reality due to emotions, biases, or social
norms. These limitations can be addressed by using higher-order MDPs,
multi-agent MDPs, behavioral MDPs, bounded rationality MDPs, or social
MDPs.
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
17
M
a
y
2
1
,
2
0
X
X
CONCLUSION
18
To conclude, a Markov decision process is a Markov reward process with actions,
in which an agent has to make decision based on optimal values and policy.
This analysis can help the hospital management to provide additional benefits for doctor
2 who does a very good job in treating patients. in a way, this method provides an idea
of efficiency between doctors in the hospital.
REFERENCES
I. 49(10):1021–1025, 1998.
II. Amanda A. Honeycutt, James P. Boyle, Kristine R. Broglio, Theodore J. Thompson, Thomas
J. Hoerger, Linda S. Geiss, and K. M. Venkat Narayan, A dynamic Markov model for forecasting
diabetes prevalence in the United States through 2050. Health Care Management
III. Behavioral Sciences, pages 9242–9250, 2004.
IV. Chih-Ming Liu, Kuo-Ming Wang, and Yuh-Yuan Guh. A Markov chain model for medical
V. Distribution under treatment. Mathematical and Computer Modeling, 19(11):53–66, 1994.
VI. For discrete-time longitudinal data on human mixed-species infections. In Some
Mathematical
19
technical seminar2.pptx.on markov decision process

More Related Content

PPTX
unit-4 Markov Decision process presentation.pptx
PPT
RL intro
PPT
Cs221 lecture8-fall11
PPTX
Making Complex Decisions(Artificial Intelligence)
PDF
Deep reinforcement learning from scratch
PPTX
How to formulate reinforcement learning in illustrative ways
PPTX
Reinforcement Learning
PDF
unit-4 Markov Decision process presentation.pptx
RL intro
Cs221 lecture8-fall11
Making Complex Decisions(Artificial Intelligence)
Deep reinforcement learning from scratch
How to formulate reinforcement learning in illustrative ways
Reinforcement Learning

Similar to technical seminar2.pptx.on markov decision process (20)

PPTX
Markov decision process
PDF
Markovian sequential decision-making in non-stationary environments: applicat...
PDF
Markov decision process
PDF
Introduction to reinforcement learning
PPTX
Reinforcement Learning: An Introduction.pptx
PPTX
What is Reinforcement Algorithms and how worked.pptx
PDF
Maximum Entropy Reinforcement Learning (Stochastic Control)
PDF
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
PDF
Lecture 9 Markov decision process
PPTX
lecture_21.pptx - PowerPoint Presentation
PDF
Head First Reinforcement Learning
PDF
Reinforcement Learning - DQN
PPTX
Reinforcement learning Markov decisions process mdp ppt
PDF
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
PDF
Reinforcement Learning for Financial Markets
PDF
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
PDF
MarkovDecisionProcess&POMDP-MDP_PPTX.pdf
PPTX
14_ReinforcementLearning.pptx
PDF
Introduction to Deep Reinforcement Learning
PDF
Intro to Reinforcement learning - part I
Markov decision process
Markovian sequential decision-making in non-stationary environments: applicat...
Markov decision process
Introduction to reinforcement learning
Reinforcement Learning: An Introduction.pptx
What is Reinforcement Algorithms and how worked.pptx
Maximum Entropy Reinforcement Learning (Stochastic Control)
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Lecture 9 Markov decision process
lecture_21.pptx - PowerPoint Presentation
Head First Reinforcement Learning
Reinforcement Learning - DQN
Reinforcement learning Markov decisions process mdp ppt
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Reinforcement Learning for Financial Markets
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
MarkovDecisionProcess&POMDP-MDP_PPTX.pdf
14_ReinforcementLearning.pptx
Introduction to Deep Reinforcement Learning
Intro to Reinforcement learning - part I
Ad

Recently uploaded (20)

PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
master seminar digital applications in india
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Cell Types and Its function , kingdom of life
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Pharma ospi slides which help in ospi learning
PDF
Complications of Minimal Access Surgery at WLH
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Cell Structure & Organelles in detailed.
Abdominal Access Techniques with Prof. Dr. R K Mishra
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Insiders guide to clinical Medicine.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
master seminar digital applications in india
Microbial disease of the cardiovascular and lymphatic systems
TR - Agricultural Crops Production NC III.pdf
O7-L3 Supply Chain Operations - ICLT Program
Supply Chain Operations Speaking Notes -ICLT Program
Cell Types and Its function , kingdom of life
102 student loan defaulters named and shamed – Is someone you know on the list?
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Computing-Curriculum for Schools in Ghana
Pharma ospi slides which help in ospi learning
Complications of Minimal Access Surgery at WLH
01-Introduction-to-Information-Management.pdf
Cell Structure & Organelles in detailed.
Ad

technical seminar2.pptx.on markov decision process

  • 1. A U R O R A ’ S S C I E N T I F I C A N D T E C H N O L O G I C A L I N S T I T U T E M A R K O V D E C I S I O N P R O C E S S -An Overview Presented By: M.Narasimha Naik – 21M95A6603 Guide: Dr.M.Sridhar(HOD)
  • 2. CONTENTS • Introduction • Definition • Components of Markov Decision Problem • Working of Markov Decision Process • Advantages • Disadvantages P R E S E N T A T I O N T I T L E 2 M a y 2 1 , 2 0 X X
  • 3. INTRODUCTION 3 The Markov decision process is a stochastic decision-making tool based on the Markov Property principle. It is used to make optimal decisions for dynamic systems while considering their current state and the environment in which they operate. MDP is a key component of reinforcement learning applications and is widely employed to design intelligent systems. Several industries, such as robotic process automation, manufacturing, finance & economics, and logistics, use MDPs regularly to carry out their day-to-day tasks.
  • 4. DEFINITION P R E S E N T A T I O N T I T L E 4 M a y 2 1 , 2 0 X X The Markov decision process (MDP) is a mathematical tool used for decision-making problems where the outcomes are partially random and partially controllable. In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s;[1] a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes.[2] They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov as they are an extension of Markov chains.
  • 5. C O M P O N E N T S O F M A R K O V D E C I S I O N P R O C E S S 1.Agent: A reinforcement learning agent is the entity which we are training to make correct decisions. For example, a robot that is being trained to move around a house without crashing. 2.Environment: The environment is the surroundings with which the agent interacts. For example, the house where the robot moves. The agent cannot manipulate the environment; it can only control its own actions. In other words, the robot can’t control where a table is in the house, but it can walk around it. 3.State: The state defines the current situation of the agent This can be the exact position of the robot in the house, the alignment of its two legs or its current posture. It all depends on how you address the problem. 5
  • 6. P R E S E N T A T I O N T I T L E 6 M a y 2 1 , 2 0 X X 1.Action: The choice that the agent makes at the current time step. For example, the robot can move its right or left leg, raise its arm, lift an object or turn right/left, etc. We know the set of actions (decisions) that the agent can perform in advance. 2.Policy: A policy is the thought process behind picking an action. In practice, it’s a probability distribution assigned to the set of actions. Highly rewarding actions will have a high probability and vice versa. If an action has a low probability, it doesn’t mean it won’t be picked at all. It’s just less likely to be picked. 3.Reward: Is obtained after taking an action on a given state.
  • 8. BUSINESS MODEL 8 The MDP model operates by using key elements such as the agent, states, actions, rewards, and optimal policies. The agent refers to a system responsible for making decisions and performing actions. It operates in an environment that details the various states that the agent is in while it transitions from one state to another. MDP defines the mechanism of how certain states and an agent’s actions lead to the other states. Moreover, the agent receives rewards depending on the action it performs and the state it attains (current state). The policy for the MDP model reveals the agent’s following action depending on its current state. The MDP framework has the following key components: •S: states (s ∈ S) •A: Actions (a ∈ A) •P (St+1|st.at): Transition probabilities •R (s): Reward
  • 9. MARKOV DECISION PROCESS 9 The graphical representation of the MDP model is as follows: The MDP model uses the Markov Property, which states that the future can be determined only from the present state that encapsulates all the necessary information from the past. The Markov Property can be evaluated by using this equation: P[St+1|St] = P[St+1 |S1,S2,S3……St]
  • 10. MARKOV DECISION PROCESS 10 According to this equation, the probability of the next state (P[St+1]) given the present state (St) is given by the next state’s probability (P[St+1]) considering all the previous states (S1,S2,S3……St). This implies that MDP uses only the present/current state to evaluate the next actions without any dependencies on previous states or actions.
  • 11. POLICY As mentioned earlier, a policy defines the thought behind making a decision (picking an action). it defines the behavior of an rl agent. Formally,a policy is probability distribution over the set of actions a, current given state s.it gives the probability of picking an action a.at state s. P R E S E N T A T I O N T I T L E 11 M a y 2 1 , 2 0 X X
  • 12. VALUE FUNCTION 12 A value function is the long-term value of a state or an action. In other words, it’s the expected Return over a state or an action. This is something that we are actually interested in optimizing. State valuve function for MRP: The value state function v(s) is the expected return starting from state s.
  • 13. BELLMAN EQUATION 13 The Bellman equation helps us finding the optimal policies and value functions. The optimal value function is the one yielding maximum value compared to all other value functions. Similarly, the optimal policy is the one which results in an optimal value function. Since the optimal value function is the one that has a higher value, it will be the maximum ot the Q functions: V∗(s)=maxaQ∗(s,a) The Bellman equation for the value and Q functions is:
  • 14. 14 • Finally, the Bellman optimality equation is: • To solve this equation, two algorithms are used: • Value iteration. • Policy iteration.
  • 15. EXAMPLE:GRID WORLD 15 A maze-like problem • The agent lives in a grid • Walls block the agent’s path Noisy movement: actions do not always go as planned • If agent takes action North • 80% of the time: Get to the cell on the North • (if there is no wall there) • 10%: West; 10%: East • If path after roll dice blocked by wall, stays put The agent receives rewards each time step • “Living” reward (can be negative) • Additional reward at pit or target (good or bad) and will exit the grid world afterward Goal: maximize sum of rewards
  • 16. CHALLENGES MDPs come with their own set of challenges, such as the curse of dimensionality, which is the exponential growth of the MDP as the number of states and actions increases. This can be addressed with approximation techniques like function approximation, value iteration networks, or deep reinforcement learning. Additionally, data scientists must specify the transition and reward functions to use MDPs, which may not be known or easy to obtain in real-world problems. 16
  • 17. LIMITATIONS MDPs are not always suitable or sufficient for every data science problem. They have several limitations, such as the Markov assumption that the future state and reward depend only on the current state and action, not on the history of previous states and actions. Additionally, MDPs assume a single- agent environment, which may not be true in some scenarios with competition, cooperation, or communication with other agents. Lastly, MDPs assume that the agent is rational and consistent in its preferences and actions, which may not reflect reality due to emotions, biases, or social norms. These limitations can be addressed by using higher-order MDPs, multi-agent MDPs, behavioral MDPs, bounded rationality MDPs, or social MDPs. P R E S E N T A T I O N T I T L E 17 M a y 2 1 , 2 0 X X
  • 18. CONCLUSION 18 To conclude, a Markov decision process is a Markov reward process with actions, in which an agent has to make decision based on optimal values and policy. This analysis can help the hospital management to provide additional benefits for doctor 2 who does a very good job in treating patients. in a way, this method provides an idea of efficiency between doctors in the hospital.
  • 19. REFERENCES I. 49(10):1021–1025, 1998. II. Amanda A. Honeycutt, James P. Boyle, Kristine R. Broglio, Theodore J. Thompson, Thomas J. Hoerger, Linda S. Geiss, and K. M. Venkat Narayan, A dynamic Markov model for forecasting diabetes prevalence in the United States through 2050. Health Care Management III. Behavioral Sciences, pages 9242–9250, 2004. IV. Chih-Ming Liu, Kuo-Ming Wang, and Yuh-Yuan Guh. A Markov chain model for medical V. Distribution under treatment. Mathematical and Computer Modeling, 19(11):53–66, 1994. VI. For discrete-time longitudinal data on human mixed-species infections. In Some Mathematical 19