technical seminar2.pptx.on markov decision process

A U R O R A ’ S S C I E N T I F I C
A N D
T E C H N O L O G I C A L I N S T I T U T E
M A R K O V D E C I S I O N P R O C E S S
-An Overview
Presented By:
M.Narasimha Naik – 21M95A6603
Guide:
Dr.M.Sridhar(HOD)

CONTENTS
• Introduction
• Definition
• Components of Markov Decision Problem
• Working of Markov Decision Process
• Advantages
• Disadvantages
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
2
M
a
y
2
1
,
2
0
X
X

INTRODUCTION
3
The Markov decision process is a stochastic decision-making
tool based on the Markov Property principle. It is used to make
optimal decisions for dynamic systems while considering their
current state and the environment in which they operate. MDP is
a key component of reinforcement learning applications and is
widely employed to design intelligent systems. Several
industries, such as robotic process automation, manufacturing,
finance & economics, and logistics, use MDPs regularly to carry
out their day-to-day tasks.

DEFINITION
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
4
M
a
y
2
1
,
2
0
X
X
The Markov decision process (MDP) is a mathematical tool used for decision-making problems where
the outcomes are partially random and partially controllable.
In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It
provides a mathematical framework for modeling decision making in situations where outcomes are
partly random and partly under the control of a decision maker. MDPs are useful for
studying optimization problems solved via dynamic programming. MDPs were known at least as early
as the 1950s;[1] a core body of research on Markov decision processes resulted from Ronald Howard's
1960 book, Dynamic Programming and Markov Processes.[2] They are used in many disciplines,
including robotics, automatic control, economics and manufacturing. The name of MDPs comes from
the Russian mathematician Andrey Markov as they are an extension of Markov chains.

C O M P O N E N T S O F M A R K O V D E C I S I O N P R O C E S S
1.Agent: A reinforcement learning agent is the entity which we are
training to make correct decisions. For example, a robot that is
being trained to move around a house without crashing.
2.Environment: The environment is the surroundings with which the
agent interacts. For example, the house where the robot moves.
The agent cannot manipulate the environment; it can only control its
own actions. In other words, the robot can’t control where a table is
in the house, but it can walk around it.
3.State: The state defines the current situation of the agent This
can be the exact position of the robot in the house, the alignment of
its two legs or its current posture. It all depends on how you
address the problem.
5

P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
6
M
a
y
2
1
,
2
0
X
X
1.Action: The choice that the agent makes at the current time step. For
example, the robot can move its right or left leg, raise its arm, lift an
object or turn right/left, etc. We know the set of actions (decisions)
that the agent can perform in advance.
2.Policy: A policy is the thought process behind picking an action. In
practice, it’s a probability distribution assigned to the set of actions.
Highly rewarding actions will have a high probability and vice versa. If
an action has a low probability, it doesn’t mean it won’t be picked at
all. It’s just less likely to be picked.
3.Reward: Is obtained after taking an action on a given state.

COMPANY
OVERVIEW
DIAGRAM OF MDP

BUSINESS MODEL
8
The MDP model operates by using key elements such as the agent, states, actions, rewards, and optimal policies. The agent refers
to a system responsible for making decisions and performing actions. It operates in an environment that details the various states
that the agent is in while it transitions from one state to another. MDP defines the mechanism of how certain states and an agent’s
actions lead to the other states. Moreover, the agent receives rewards depending on the action it performs and the state it attains
(current state). The policy for the MDP model reveals the agent’s following action depending on its current state.
The MDP framework has the following key components:
•S: states (s ∈ S)
•A: Actions (a ∈ A)
•P (St+1|st.at): Transition probabilities
•R (s): Reward

MARKOV DECISION
PROCESS
9
The graphical representation of the MDP model is as follows:
The MDP model uses the Markov Property, which states that the future can be determined only from the present state
that encapsulates all the necessary information from the past. The Markov Property can be evaluated by using this
equation:
P[St+1|St] = P[St+1 |S1,S2,S3……St]

MARKOV DECISION
PROCESS
10
According to this equation, the probability of the next state (P[St+1]) given the present
state (St) is given by the next state’s probability (P[St+1]) considering all the previous
states (S1,S2,S3……St). This implies that MDP uses only the present/current state to evaluate
the next actions without any dependencies on previous states or actions.

POLICY
As mentioned earlier, a policy defines the thought behind making a decision (picking an action).
it defines the behavior of an rl agent.
Formally,a policy is probability distribution over the set of actions a, current given state s.it
gives the probability of picking an action a.at state s.
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
11
M
a
y
2
1
,
2
0
X
X

VALUE FUNCTION
12
A value function is the long-term value of a state or an action. In other words, it’s the expected Return
over a state or an action. This is something that we are actually interested in optimizing.
State valuve function for MRP:
The value state function v(s) is the expected return starting from state s.

BELLMAN EQUATION
13
The Bellman equation helps us finding the optimal policies and value functions. The optimal
value function is the one yielding maximum value compared to all other value functions.
Similarly, the optimal policy is the one which results in an optimal value function.
Since the optimal value function is the one that has a higher value, it will be the maximum
ot the Q functions: V∗(s)=maxaQ∗(s,a)
The Bellman equation for the value and Q functions is:

14
• Finally, the Bellman optimality equation is:
• To solve this equation, two algorithms are used:
• Value iteration.
• Policy iteration.

EXAMPLE:GRID WORLD
15
A maze-like problem
• The agent lives in a grid
• Walls block the agent’s path
Noisy movement: actions do not always go as planned
• If agent takes action North
• 80% of the time: Get to the cell on the North
• (if there is no wall there)
• 10%: West; 10%: East
• If path after roll dice blocked by wall, stays put
The agent receives rewards each time step
• “Living” reward (can be negative)
• Additional reward at pit or target (good or bad) and will exit the grid world afterward
Goal: maximize sum of rewards

CHALLENGES
MDPs come with their own set of challenges, such as the curse of
dimensionality, which is the exponential growth of the MDP as the
number of states and actions increases. This can be addressed with
approximation techniques like function approximation, value iteration
networks, or deep reinforcement learning. Additionally, data scientists
must specify the transition and reward functions to use MDPs, which may
not be known or easy to obtain in real-world problems.
16

LIMITATIONS
MDPs are not always suitable or sufficient for every data science problem.
They have several limitations, such as the Markov assumption that the future
state and reward depend only on the current state and action, not on the
history of previous states and actions. Additionally, MDPs assume a single-
agent environment, which may not be true in some scenarios with
competition, cooperation, or communication with other agents. Lastly,
MDPs assume that the agent is rational and consistent in its preferences and
actions, which may not reflect reality due to emotions, biases, or social
norms. These limitations can be addressed by using higher-order MDPs,
multi-agent MDPs, behavioral MDPs, bounded rationality MDPs, or social
MDPs.
P
R
E
S
E
N
T
A
T
I
O
N
T
I
T
L
E
17
M
a
y
2
1
,
2
0
X
X

CONCLUSION
18
To conclude, a Markov decision process is a Markov reward process with actions,
in which an agent has to make decision based on optimal values and policy.
This analysis can help the hospital management to provide additional benefits for doctor
2 who does a very good job in treating patients. in a way, this method provides an idea
of efficiency between doctors in the hospital.

REFERENCES
I. 49(10):1021–1025, 1998.
II. Amanda A. Honeycutt, James P. Boyle, Kristine R. Broglio, Theodore J. Thompson, Thomas
J. Hoerger, Linda S. Geiss, and K. M. Venkat Narayan, A dynamic Markov model for forecasting
diabetes prevalence in the United States through 2050. Health Care Management
III. Behavioral Sciences, pages 9242–9250, 2004.
IV. Chih-Ming Liu, Kuo-Ming Wang, and Yuh-Yuan Guh. A Markov chain model for medical
V. Distribution under treatment. Mathematical and Computer Modeling, 19(11):53–66, 1994.
VI. For discrete-time longitudinal data on human mixed-species infections. In Some
Mathematical
19

technical seminar2.pptx.on markov decision process

technical seminar2.pptx.on markov decision process

More Related Content

Similar to technical seminar2.pptx.on markov decision process (20)

Recently uploaded (20)

technical seminar2.pptx.on markov decision process