Lecture 9 Markov decision process

Reinforcement Learning ⇒ Dynamic Programming ⇒
Markov Decision Process
Subject: Machine Learning
Dr. Varun Kumar
Subject: Machine Learning Dr. Varun Kumar Lecture 9 1 / 16

Outlines
1 Introduction to Reinforcement Learning
2 Application of Reinforcement Learning
3 Approach for Studying Reinforcement Learning
4 Basics of Dynamic Programming
5 Markov Decision Process:
6 References

Introduction to reinforcement learning:
Key Feature
1 There is no supervisor for performing the learning process.
2 In stead of supervisor, there is a critic that informs the end outcome.
3 If outcome is meaningful then the whole process is rewarded. On the
other side the whole process is penalized.
4 This learning process is based on reward and penalty.
5 Critic convert the primary reinforcement signal into heuristic
reinforcement signal.
6 Primary reinforcement signal → Signal observed from the
environment.
7 Heuristic reinforcement signal → Higher quality signal.

Diﬀerence between critic and supervisor
Let a complex system has been described as follows
Note
⇒ Critic does not provide the step-by-step solution.
⇒ Critic does not provide any method, training data, suitable learning
system or logical operation for doing the necessary correction, if
output does reaches to the expected value.
⇒ It comment only the end output, whereas supervisor helps in many
ways.

Block diagram of reinforcement learning
Block diagram

Aim of reinforcement learning
⇒ To minimize the cost-to-go function.
⇒ Cost-to-go function → Expectation of cumulative cost of action
taken over a sequence of steps instead of immediate cost.
⇒ Learning system : It discover several actions and feed them back to
the environment.

Application of reinforcement learning
Major application area
♦ Game theory.
♦ Simulation based optimization.
♦ Operational research.
♦ Control theory.
♦ Swarm intelligence.
♦ Multi-agents system.
♦ Information theory.
Note :
⇒ Reinforcement learning is also called as approximate dynamic
programming.

Approach for studying reinforcement learning
Classical approach: Learning takes place through a process of reward
and penalty with the goal of achieving highly skilled behavior.
Modern approach :
⇒ Based on mathematical framework, such as dynamic programming.
⇒ It decides on the course of action by considering possible future stages
without actually experiencing them.
⇒ It emphasis on planning.
⇒ It is a credit assignment problem.
⇒ Credit or blame is part of interacting decision.

Dynamic programming
Basics
⇒ How can an agent/decision maker/learning system improves its
long term performance in a stochastic environment ?
⇒ Attaining long term improvised performance without disrupting the
short term performance.
Markov decision process (MDP)

Markov decision process (MDP):
Key features of MDP
♦ Environment is modeled through probabilistic framework. Some
known probability mass function (pmf) may be the basis for
modeling.
♦ It consists of a finite set of discrete states.
♦ Here states does not contain any past statistics.
♦ Through well defined pmf a set of discrete sample data is created.
♦ For each environmental state, there is a finite set of possible action
that may be taken by agent.
♦ Every time agent takes an action, a certain cost is incurred.
♦ States are observed, actions are taken and costs are incurred at
discrete times.

Continued–
MDP works on the stochastic environment. It is nothing but a
random process.
Decision action is a time dependent random variable.
Mathematical description:
⇒ Si is the ith state at a sample instant n.
⇒ Sj is the next state at a sample instant n + 1
⇒ pij is known as the transition probability ∀ 1 ≤ i ≤ k and 1 ≤ j ≤ k
pij (Ai ) = P(Xn+1 = Sj |Xn = Si , An = Ai )
⇒ Ai ia ith action taken by an agent at a sample instant n.

Markov chain rule
Markov chain rule
Markov chain rule is based on the partition theorem.
Statement of partition theorem: Let B1, ..., Bm form a partition of Ω,
then for any event A.
P(A) =
N
i=1
P(A ∩ Bi ) =
N
i=1
P(A|Bi )P(Bi )

Markov property
1 The basic property of a Markov chain is that only the most recent
point in the trajectory aﬀects what happens next.
P(Xn+1|Xn, Xn−1, ....X0) = P(Xn+1|Xn)
2 Transition matrix or stochastic matrix:
P =





p11 p12 .... p1K
p21 p22 .... p2K
...
... ..........
pK1 pK2 ..... pKK





⇒ Sum of row is equal to unity → j pij = 1
⇒ p11 + p12 + ....p1K = 1 or but p11 + p21 + .... + pK1 = 1

Continued–
3 n-step transition probability:
Statement: Let X0, X1, X2, ... be a Markov chain with state space
S = 1, 2, ..., N. Recall that the elements of the transition matrix P
are deﬁned as:
pij = P(X1 = j|X0 = i) = P(Xn+1 = j|Xn = i) for any n.
⇒ pij is the probability of making a transition from state i to state j in a
single step.
Q What is the probability of making a transition from state i to state j
over two steps? In another sence, what is P(X2 = j|X0 = i)?
Ans pij
2

Continued–

References
E. Alpaydin, Introduction to machine learning. MIT press, 2020.
J. Grus, Data science from scratch: ﬁrst principles with python. O’Reilly Media,
2019.
T. M. Mitchell, The discipline of machine learning. Carnegie Mellon University,
School of Computer Science, Machine Learning , 2006, vol. 9.
S. Haykin, Neural Networks and Learning Machines, 3/E. Pearson Education
India, 2010.

Lecture 9 Markov decision process

More Related Content

What's hot (20)

Similar to Lecture 9 Markov decision process (20)

More from VARUN KUMAR (20)

Recently uploaded (20)

Lecture 9 Markov decision process