Q_Learning.ppt

Reinforcement Learning
BY:
DR. SUKHNANDAN KAUR
ASSISTANT PROFESSOR ,
CSED, TIET

How can an agent learn behaviors when it doesn’t have a
supervisor to tell it how to perform?
This problem is called reinforcement learning.
Rewards:
Positive / Negative

Reinforcement Learning (cont.)
The goal is to get the agent to act in the world so as
to maximize its rewards.
The agent has to figure out what it did that made it
get the reward/punishment

Components of Reinforcement Learning
1. Agent
2. Environment
3.Rewards
4. State
5. Action(s)

Q Learning
AI can directly drive an optimal policy from its environment
without needing to create a model beforehand.
Q learning is model free learning technique that can be
used to find the optimal action selection policy using Q
function.
Q function gives the largest expected return achievable by
any policy π for each possible state action pair.

In reinforcement learning we want to obtain function Q(S, A) that predicts
best action A in a state S to get maximum cumulative reward.
cumulative reward 1 = Q(s1, a1) + Q(s2, a1) = -1 + 0 = -1
cumulative reward 2 = Q(s1, a2) + Q(s2, a2) = 1 + 0.5 = 1.5
cumulative reward 3 = Q(s1, a2) + Q(s2, a1) = 1 + 0 = 1
cumulative reward 4 = Q(s1, a1) + Q(s2, a2) = -1 + 0.5 = -0.5
maximum cumulative reward = max(cumulative reward 1, cumulative reward 2, cumulative
reward 3, cumulative reward 4) = 1.5
a1 a2
s1 -1 1
s2 0 0.5

How Q learning Works?
Initialize Q
Choose Action from Q
Calculate Reward
Take action
Update Q

Example
USER
Go to house
Retrieve
Flower
State =1
Q(State = 1, Action = “Retrieve Flower”) = 0.5
Q(State = 1, Action = “Go to house”) = 3.0

Accessible or
observable state
Repeat:
 s  sensed state
 If s is terminal then exit
 a  choose action (given s)
 Perform a
Reactive Agent Algorithm

Repeat:
 s  sensed state
 If s is terminal then exit
 a  P(s) /* Choose action using policy
 Perform a
Reactive Agent Algorithm using

Approaches
 Learn policy directly– function mapping from states to actions
Q(S, A)
Where, Q = {s1,s2,s3,s4} and A = {a1,a2,a3}
 Learn utility values for states (i.e., the value function)
If the outcome of performing an action at a state is deterministic, then the
agent can update the utility value U() of states:
◦ U(new state) = reward + U(old state)

Exploration / Exploitation policy
Wacky approach (exploration): act randomly in
hopes of eventually exploring entire environment
Greedy approach (exploitation): act to maximize
utility using current estimate
Reasonable balance: act more wacky (exploratory)
when agent has little idea of environment; more
greedy when the model is close to correct path.

Summary
Active area of research.
Reinforcement learning is applicable to game-playing, robot controllers,
others

Q_Learning.ppt

More Related Content

Similar to Q_Learning.ppt (20)

Recently uploaded (20)

Q_Learning.ppt