Introduction to reinforcement learning

An Introduction to Reinforcement Learning
Akshay A. Salunkhe
Roll No.: 183021002
Department of Electrical Engineering
Indian Institute of Technology Dharwad
June 8, 2018
Akshay A. Salunkhe (IITDh) Reinforcement Learning June 8, 2018 1 / 41

Overview
1 Introduction to Reinforcement Learning
2 MC MRP
3 Bellman Equation
4 MDP
5 Bellman Optimality Equations

Introduction to Reinforcement Learning
Real life example
Baby trying to walk...
Baby seeks support
You deny, baby tries himself
Trial and error
Figure 1: baby learning how to walk

Introduction
baby exploring environment
reward: reach mother
action: try crawling or walking...
Figure 2: system model

Diﬀerent in the Sense
An active way of learning
Feedback delayed, not instantaneous
Only reward signal, no supervisor
Sequential data, NOT i.i.d.

MC MRP
Markov State
A state St is markov state iﬀ
P [St+1|St] = P [St+1|S1, S2, ...St] (1)
Future is independent of past given present

MC MRP
Markov Process
A Markov Process (or Markov Chain) is a tuple S, P
S is a ﬁnite set of states
P is a state transition probability matrix
Pss = Pr St+1 = s |St = s (2)

MC MRP
Example: Student Markov Process (or Markov Chain)
A simple MC example with transition probability matrix
Figure 3: Student Markov Process

MC MRP
Markov Reward Process
a Markov Chain with values
Markov Reward Process is a tuple S, P, R, γ
Pss = Pr St+1 = s |St = s (3)
R is a reward function,
Rs = E [Rt+1|St = s] (4)
γ is a discount factor, γ ∈ [0, 1]

MC MRP
Return
The return Gt is the total discounted reward from time-step t.
Gt = Rt+1 + γRt+2 + ... =
∞
k=0
γk
Rt+k+1 (5)

MC MRP
Discount γ used because
math works out well
model is not perfect
avoids inﬁnite returns in Cyclic Markov Processes

MC MRP
State Value Function of an MRP
The state value function v(s) of an MRP is the expected return
starting from state s
v(s) = E [Gt|St = s] (6)
Expectation is over all the paths

MC MRP
Sample Returns for Student MRP
Starting with C1, γ=1/2,
Figure 4: Student MRP

MC MRP
State Value for C1
Will be averaged over all outward paths from C1
Figure 6: State values

MC MRP
State Values for Student MRP for γ=0

MC MRP
State Values for Student MRP for γ=0.9

Bellman Equation
Bellman Equation
The value function can be decomposed into two parts
immediate reward Rt+1
discounted value of successor state γv(St+1)

Bellman Equation
Example for state C3
v(s) = Rs + γ
s ∈S
Pss v(s ) (7)

Bellman Equation
Bellman Equation in matrix form
The Bellman equation can be written in matrix form as

Bellman Equation
Solving the Bellman Equation
Linear Equation
Direct Solution possible for Finite MRP

MDP
Markov Decision Process
A Markov Decision Process is a tuple S, A, P, R, γ
A is a ﬁnite set of actions
Pss = Pr St+1 = s |St = s, At = a (8)
R is a reward function,
Ra
s = E [Rt+1|St = s, At = a] (9)
γ is a discount factor, γ ∈ [0, 1]

MDP
Example: student MDP
MDP is MRP with decisions
All states are markov
MRP arises when we ﬁx policy for MDP
Figure 9: student MDP with actions

MDP
Policy
A policy is a distribution over actions given states,
A policy deﬁnes the behaviur of an agent fully
MDP policies depend on current state
i.e. policies are stationary (time independent)
At ∼ π(.|St), ∀t > 0 (10)

MDP
Value Functions: MDP
State Value Function:
The State Value Function, vπ(s) of an MDP is the expected return
starting from s, and following policy π
vπ(s) = Eπ [Gt|St = s] (11)
Action Value Function:
The Action Value Function, qπ(s, a) is the expected return starting
from s, and taking action a, and then following policy π
qπ(s, a) = Eπ [Gt|St = s, At = a] (12)

MDP
State Value Functions for Student MDP

MDP
The state-value function can again be decomposed into immediate reward
plus discounted value of successor state
The action-value function can similarly be decomposed,

MDP

MDP
Figure 11: recursive relation between state value functions

MDP
Figure 12: recursive relation between action value functions

MDP
Example: Student MDP
Figure 13: calculating State value for Student MDP

MDP
Bellman Expectation Function: Matrix Form for MDP
Bellman expectation function can be expressed in Matrix Form as
with direct solution

Bellman Optimality Equations
Optimal Value Function
The Optimal state value function v∗(s) is the maximum state value
function over all policies
The Optimal action value function q∗(s, a) is the maximum action value
function over all policies
MDP is solved when we know optimal action-value function

Optimal State-Value Function for Student MDP

Optimal Action-Value Function for Student MDP

Optimal Policy for student MDP

The optimal state value and action value functions are recursively related
by the Bellman optimality equations

Bellman Optimality Equations contd.

solving Bellman Optimality Equations
presence of max makes it non linear
only iterative solution methods such as
value iteration
policy iteration
linear programming

Value iteration

Policy iteration

The End

Introduction to reinforcement learning

More Related Content

Similar to Introduction to reinforcement learning (20)

Recently uploaded (20)

Introduction to reinforcement learning