SlideShare a Scribd company logo
Reinforcement Learning
Learning Through Interaction
Sutton:
“When an infant plays, waves its arms, or looks about, it has no explicit
teacher, but it does have a direct sensorimotor connection to its
environment. Exercising this connection produces a wealth of
information about cause and effect, about the consequences of
actions, and about what to do in order to achieve goals”
• Reinforcement learning is a computational approach for this type of
learning. It adopts AI perspective to model learning through
interaction.
• As a single (agent) approaches the system, it takes an
action. Upon this action he gets a reward and jumps to
the next state. Online learning becomes plausible
3
Reinforcement Learning
Reinforcement Objective
• Learning the relation between the current situation (state) and the
action to be taken in order to optimize a “payment”
Predicting the expected future reward given the current state (s) :
1. Which actions should we take in order to maximize our gain
2. Which actions should we take in order to maximize the click rate
• The action that is taken influences on the next step “closed loop”
• The learner has to discover which action to take (in ML terminology
we can write the feature vector as some features are function of
others)
RL- Elements
• State (s) - The place where the user agent is right now.
Examples:
1. A position on a chess board
2. A potential customer in a sales web
• Action (a)- An action that a user can take while he is in a state.
Examples:
1. Knight pawn captures bishop
2. The user buys a ticket
• Reward (r) - The reward that is obtained due to the action
Examples:
1. A better worse position
2. More money or more clicks
Basic Elements (Cont)
• Policy (π)- The “strategy” in which the agent decides which action to take.
Abstractly speaking the policy is simply a probability function that is defined for
each state
• Episode – A sequence of states and their actions
• 𝑉π
(𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the
expected reward (e.g. in a chess the expected final outcome of the game if we
follow a given strategy)
• V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all
possible trajectories starting from 𝑠 )
• Q(s,a) - The analog for V(s) : the planar value function for state s and action a
7
Examples
• Tic Tac Toe
GridWorld (0,-1,10,5)
• We wish to find the best slot machine (best = max reward).
Strategy
Play ! .. and find the machine with the biggest reward (on average)
• At the beginning we pick each slot randomly
• After several steps we gain some knowledge
How do we choose which machine to play?
1. Should we always use the best machine ?
2. Should we pick it randomly?
3. Any other mechanism?
9
Slot Machines n-armed bandit
• The common trade-off
1. Play always with best machine -Exploitation
We may miss better machines due to statistical “noise”
2. Choose machine randomly - Exploration
We don’t take the optimal machine,
Epsilon Greedy
We exploit in probability (1- ε) and explore with probability ε
Typically ε=0.1
10
Exploration ,Exploitation & Epsilon Greedy
• Some problems (like n-bandit) are -Next Best Action.
1. A single given state
2. A set of options that are associated with this state
3. A reward for each action
• Sometimes we wish to learn journeys
Examples:
1. Teach a robot to go from point A to point B
2. Find the fastest way to drive home
11
Episodes
• Episode
1. A “time series” of states {S1, S2, S3.. SK}
2. For all state Si There are set of options {O1, O2,..Oki }
3. Learning formula (the “gradient”) depends not only on the immediate
rewards but on the next state as well
12
Episode (Cont.)
• The observed sequence:
st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward
• We optimize our goal function (commonly maximizing the average):
Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor
Classical Example
The Pole Balancing
Find the exact force to implement
in order to keep the pole up
The reward is 1 for every time step that
The pole didn’t fall
Reinforcement Learning – Foundation
Markov Property
Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At }
i.e. : The current state captures the entire history
• Markov processes are fully determined by the transition matrix P
Markov Process (or Markov Chain)
A tuple <S,P> where
S - set of states (mostly finite),
P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s]
Markov Decision Process -MDP
A Markov Reward Process -MRP (Markov Chain with Values)
A tuple < S,P, R, γ>
S ,P as in Markov process,
R a reward function Rs = E [Rt+1 | St = s]
γ is a discount factor, γ ∈ [0, 1] (as in Gt )
State Value Function for MRP:
v(s) = E [Gt | St = s]
MDP-cont
Bellman Eq.
• v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]=
E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ]
We get a recursion rule:
v(s) = E[Rt+1 + γ v(s t+1) | St = s]
Similalry we can define on a value on state-action space:
Q(s,a)= E [Gt | St = s, At =a]
MDP - MRP with a finite set of actions A
MDP-cont
• Recall - policy π is the strategy – it maps between states and actions.
π(a|s) = P [At = a | St = s]
We assume that for each time t ,and state S π( | St) is fixed (π is stationary )
Clearly for a MDP, a given policy π modifies the MDP:
R -> Rπ P->Pπ
We modify V & Q
Vπ(s) = Eπ [G t | S t = s]
Qπ(s,a) = Eπ [G t | S t = s, At =a]
Policy
• For V (or Q) the optimal value function v* ,for each state s :
v*(s) = max
π
vπ(s) π -policy
Solving MDP ≡ Finding the optimal value function!!!!
Optimal Policy
π ≥ π’ if vπ(s) ≥ v π’(s) ∀s
Theorem
For every MDP there exists optimal policy
Optimal Value Function
• If we know 𝑞∗ (s,a) we can find the optimal policy:
Optimal Value (Cont)
• Dynamic programming
• Monte Carlo
• TD methods
Objectives
Prediction - Find the optimal function
Control – Find the optimal policy
Solution Methods
• A class of algorithms used in many applications such as graph theory
(shortest path) and bio informatics. It has two essential properties:
1. Can be decomposed to sub solutions
2. Solutions can be cashed and reused
RL-MDP satisfies these both
• We assume a full knowledge of MDP !!!
Prediction
Input: MDP and policy
Output: Optimal Value function vπ
Control
Input: MDP
Output: Optimal Value function v* Optimal policy π *
Dynamic Programming
• Assume policy π and MDP we wish to find the optima V π(s)
V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s]
• Since policy and MDP are known it is a linear eq. in vi
but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto)
Prediction – Optimal Value Function
• Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the
policy which will lead to an optimal function
Policy Improvement (policy iteration)
• Policy iteration requires policy updating which can be heavy.
• We can study 𝑉∗ and obtain the policy through
• The idea is that
• Hence we can find 𝑉∗ iteratively (and derive the optimal policy)
Value Iteration
Reinfrocement Learning
• The formula supports online update
• Bootstrapping
• Mostly we don’t have MDP
DP -Remarks
• A model free (we don’t need MDP)
1. It learns from generating episodes.
2. It must complete an episode for having the required average.
3. It is unbiased
• For a policy π
S0,A0,R1….. St ~ π
We use empirical mean return rather expected return.
V(St) =V(St) +
1
𝑁(𝑡)
[ Gt –V(St ) ] N(t) – Amount of visits at time t
For non-stationary cases we update differently:
V(St) =V(St) +α [ Gt –V(St ) ]
In MC one must terminate the episode to get the value (we calculate the mean
explicitly ) Hence in grid problems it may work bad
Monte Carlo Methods
• Learn the optimal policy (using Q function):
Monte Carlo Control
Temporal Difference –TD
• Motivation –Combining DP & MC
As MC -Learning from experience , no explicit MDP
As DP- Bootstrapping, no need to complete the episodes
Prediction
Recall that for MC we have
Where Gt is known only at the end of the episode.
TD –Methods (Cont.)
• TD method needs to wait until the next step (TD(0))
We can see that it leads to different targets:
MC- Gt
TD - Rt+1 + γ V(S t+1)
• Hence it is a Bootstrapping method
The estimaion of V given a policy is straightforwad since the policy
chooses S t+1.
Driving Home Problem
TD Vs. MC -Summary
MC
• High variance unbiased
• Good convergence
• Easy to understand
• Low sensitivity for i.c
TD
• Efficiency
• Convergence to V π
• More sensitive to i.c.
SARSA
• On Policy method for Qlearning (update after every step):
The next step is using SARSA to develop also a control algorithm, we
learn on policy the Q function and update the policy toward
greedyness
On Policy Control Algorithm
Example Windy Grid-World
Qlearning –Off Policy
• Rather learning from an action that has been offered we simply take
the best action for the state
The control algorithm is straightforward
Value Function Approx.
• Sometimes we have a large scale RL
1. TD backgammon (Appendix)
2. GO – (Deep Mind)
3. Helicopter (continuous)
• Our objectives are still :control & predictions but we have huge
amount of states.
• The tabular solutions that we presented are not scalable.
• Value Function approx. will allow us to use models!!!
Value Function (Cont)
• Consider a large (continuous ) MDP
Vπ (s)= 𝑉′
π (s,w)
Qπ (s,a) =𝑄′
π (s,a,w) w –set of function parameters
• We can train them by both TD & MC .
• We can expand values to unseen states
Type of Approximations
1. Linear Combinations
2. Neural networks (lead to DQN)
3. Wavelet solutions
Function Approximation on the technics
• Define features vectors (X(S)) for the state S. e.g.
Distance from target
Trend in stock
Chess board configuration
• Training methods for W
• SGD
Linear Function get the form: 𝑉′
π =<X(S),W>
RL -Based problems
• No supervisor, only rewards solutions become:
Deep -RL
Why using Deep RL?
• It allows us to find an optimal model (value/policy)
• It allows us to optimize a model
• Commonly we will use SGD
Examples
• Automatic cars
• Atari
• Deep Mind
• TD- Gammon
Q – network
• We follow the value function approx. approach
Q(s,a,w)≈𝑄∗(s,a)
Q-Learning
• We will simply follow TD target function with supervised manners:
Target
r+ γmax
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤)
Loss -MSE
(r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2
• We solve it with SGD
Q Network –Stability Issues
Divergences
• Correlation between successive samples ,non-iid
• Policy is not necessarily stationary (influences on Q value)
• Scale of rewards and Q value is unknown
Deep –Q network
Experience Replay
Replay the data from the past with the current W
It allows to remove correlation in data:
• Pick at upon a greedy algorithm
• Store in memory the tuple(st, at, rt+1, st+1 ) - Replay
• Now calculate the MSE
Experience Repaly
DQN (Cont)
Fixed Target Q- Network
In order to handle oscillations
We calculate targets with respect to old parameters 𝑤−
r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤− )
The loss becomes
(r+ γ max
𝑎′
𝑄(𝑠′
, 𝑎′
, 𝑤−
) −Q(s,a,w) )2
𝑤−
<- w
DQN –Summary
Many further methods:
• RewardValue
• Double DQN
• Parallel Updates
Requires another lecture
Gradient Policy
• We have discussed:
1. Function approximations
2. Algorithms in which policy is learned through the value functions
We can parametrize policy using parameters θ :
πθ (s, a) =P[a| s, θ]
Remark: we focus on model free!!
Reinfrocement Learning
Policy Based Good & Bad
Good
Better in High dimensions
Convergence faster
Bad
Less efficient for high variance
Local minima
Example: Rock-Paper-Scissors
How to optimize a policy?
• We assume it is differentiable and calculate the log-likelihood
• We assume further Gibbs distribution i.e.
policy exponent in value function
πθ (s, a) α 𝑒−θΦ(𝑠,𝑎)
Deriving by θ implies:
We can also use Gaussian policy
Optimize policy (Cont.)
Actor-Critic
Critic – Update the action-state function by w
Actor –Update the policy θ upon the critic suggestion
Reinfrocement Learning
• Rather Learning value functions we learn probabilities. Let At the action
that is taken at time t
Pr(At =a) = πt (a) =
𝑒Ht (a)
𝑏=1
𝑘
𝑒Ht (b)
H – Numerical Preference
We assume Gibbs Boltzmann Distribution
R¯t - The average until time t
Rt - The reward at time t
Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) )
Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At
Gradient Bandit algorithm
Further Reading
• Sutton & Barto
http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-
bookdraft2016sep.pdf
Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8
• DeepMind papers
• David Silver –Youtube and ucl.ac.uk
• TD-Backgammon
Thank you

More Related Content

PPTX
An introduction to reinforcement learning (rl)
PDF
Reinforcement learning
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PDF
Generalized Reinforcement Learning
PPTX
An introduction to reinforcement learning
PDF
Actor critic algorithm
PDF
An introduction to reinforcement learning
An introduction to reinforcement learning (rl)
Reinforcement learning
Reinforcement Learning : A Beginners Tutorial
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Generalized Reinforcement Learning
An introduction to reinforcement learning
Actor critic algorithm
An introduction to reinforcement learning

What's hot (20)

PDF
Policy gradient
PPT
Reinforcement learning
PDF
Deep Reinforcement Learning
PPT
Reinforcement learning 7313
PDF
Deep Q-Learning
PDF
Discrete sequential prediction of continuous actions for deep RL
PPTX
lecture_21.pptx - PowerPoint Presentation
PDF
Deep reinforcement learning from scratch
PDF
Reinforcement Learning
PPTX
Reinforcement Learning
PDF
Reinforcement Learning
PDF
Finalver
PPTX
Intro to Deep Reinforcement Learning
PDF
Multi armed bandit
PDF
Financial Trading as a Game: A Deep Reinforcement Learning Approach
PDF
Temporal difference learning
PPTX
Deep Reinforcement Learning
PDF
Introduction to Reinforcement Learning
PDF
Exploration Strategies in Reinforcement Learning
PDF
An introduction to deep reinforcement learning
Policy gradient
Reinforcement learning
Deep Reinforcement Learning
Reinforcement learning 7313
Deep Q-Learning
Discrete sequential prediction of continuous actions for deep RL
lecture_21.pptx - PowerPoint Presentation
Deep reinforcement learning from scratch
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
Finalver
Intro to Deep Reinforcement Learning
Multi armed bandit
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Temporal difference learning
Deep Reinforcement Learning
Introduction to Reinforcement Learning
Exploration Strategies in Reinforcement Learning
An introduction to deep reinforcement learning
Ad

Similar to Reinfrocement Learning (20)

PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
PPT
about reinforcement-learning ,reinforcement-learning.ppt
PPT
reinforcement-learning.ppt
PPT
reinforcement-learning its based on the slide of university
PPT
reinforcement-learning.prsentation for c
PDF
Head First Reinforcement Learning
PDF
Cs229 notes12
PPTX
Reinforcement Learning
PDF
PPTX
Reinforcement Learning: An Introduction.pptx
PPTX
What is Reinforcement Algorithms and how worked.pptx
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
14_ReinforcementLearning.pptx
PDF
Introduction to Deep Reinforcement Learning
PDF
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
PPTX
Reinforcement learning
PPTX
How to formulate reinforcement learning in illustrative ways
PDF
Policy-Gradient for deep reinforcement learning.pdf
PDF
Reinforcement learning, Q-Learning
PDF
Machine learning (13)
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
about reinforcement-learning ,reinforcement-learning.ppt
reinforcement-learning.ppt
reinforcement-learning its based on the slide of university
reinforcement-learning.prsentation for c
Head First Reinforcement Learning
Cs229 notes12
Reinforcement Learning
Reinforcement Learning: An Introduction.pptx
What is Reinforcement Algorithms and how worked.pptx
anintroductiontoreinforcementlearning-180912151720.pdf
14_ReinforcementLearning.pptx
Introduction to Deep Reinforcement Learning
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement learning
How to formulate reinforcement learning in illustrative ways
Policy-Gradient for deep reinforcement learning.pdf
Reinforcement learning, Q-Learning
Machine learning (13)
Ad

More from Natan Katz (19)

PDF
Open Source models security- Supply chain
PDF
AI HIT taught in HIT always believe thanks
PPTX
AI Open-Source Models- Benefits vs. Risks.
PPTX
final_v.pptx
PPTX
AI for PM.pptx
PPTX
SGLD Berlin ML GROUP
PPTX
Ancestry, Anecdotes & Avanan -DL for Amateurs
PDF
Cyn meetup
PDF
Foundation of KL Divergence
PDF
Quant2a
PPTX
Bismark
PPTX
Bayesian Neural Networks
PDF
Deep VI with_beta_likelihood
PPTX
NICE Research -Variational inference project
PPTX
NICE Implementations of Variational Inference
PPTX
PPTX
Neural ODE
PDF
Variational inference
PPTX
GAN for Bayesian Inference objectives
Open Source models security- Supply chain
AI HIT taught in HIT always believe thanks
AI Open-Source Models- Benefits vs. Risks.
final_v.pptx
AI for PM.pptx
SGLD Berlin ML GROUP
Ancestry, Anecdotes & Avanan -DL for Amateurs
Cyn meetup
Foundation of KL Divergence
Quant2a
Bismark
Bayesian Neural Networks
Deep VI with_beta_likelihood
NICE Research -Variational inference project
NICE Implementations of Variational Inference
Neural ODE
Variational inference
GAN for Bayesian Inference objectives

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Global Data and Analytics Market Outlook Report
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
annual-report-2024-2025 original latest.
PPTX
Leprosy and NLEP programme community medicine
PDF
Microsoft Core Cloud Services powerpoint
DOCX
Factor Analysis Word Document Presentation
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
Managing Community Partner Relationships
PPTX
modul_python (1).pptx for professional and student
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Global Data and Analytics Market Outlook Report
STERILIZATION AND DISINFECTION-1.ppthhhbx
annual-report-2024-2025 original latest.
Leprosy and NLEP programme community medicine
Microsoft Core Cloud Services powerpoint
Factor Analysis Word Document Presentation
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
A Complete Guide to Streamlining Business Processes
Managing Community Partner Relationships
modul_python (1).pptx for professional and student
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Optimise Shopper Experiences with a Strong Data Estate.pdf

Reinfrocement Learning

  • 2. Learning Through Interaction Sutton: “When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals” • Reinforcement learning is a computational approach for this type of learning. It adopts AI perspective to model learning through interaction.
  • 3. • As a single (agent) approaches the system, it takes an action. Upon this action he gets a reward and jumps to the next state. Online learning becomes plausible 3 Reinforcement Learning
  • 4. Reinforcement Objective • Learning the relation between the current situation (state) and the action to be taken in order to optimize a “payment” Predicting the expected future reward given the current state (s) : 1. Which actions should we take in order to maximize our gain 2. Which actions should we take in order to maximize the click rate • The action that is taken influences on the next step “closed loop” • The learner has to discover which action to take (in ML terminology we can write the feature vector as some features are function of others)
  • 5. RL- Elements • State (s) - The place where the user agent is right now. Examples: 1. A position on a chess board 2. A potential customer in a sales web • Action (a)- An action that a user can take while he is in a state. Examples: 1. Knight pawn captures bishop 2. The user buys a ticket • Reward (r) - The reward that is obtained due to the action Examples: 1. A better worse position 2. More money or more clicks
  • 6. Basic Elements (Cont) • Policy (π)- The “strategy” in which the agent decides which action to take. Abstractly speaking the policy is simply a probability function that is defined for each state • Episode – A sequence of states and their actions • 𝑉π (𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the expected reward (e.g. in a chess the expected final outcome of the game if we follow a given strategy) • V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all possible trajectories starting from 𝑠 ) • Q(s,a) - The analog for V(s) : the planar value function for state s and action a
  • 9. • We wish to find the best slot machine (best = max reward). Strategy Play ! .. and find the machine with the biggest reward (on average) • At the beginning we pick each slot randomly • After several steps we gain some knowledge How do we choose which machine to play? 1. Should we always use the best machine ? 2. Should we pick it randomly? 3. Any other mechanism? 9 Slot Machines n-armed bandit
  • 10. • The common trade-off 1. Play always with best machine -Exploitation We may miss better machines due to statistical “noise” 2. Choose machine randomly - Exploration We don’t take the optimal machine, Epsilon Greedy We exploit in probability (1- ε) and explore with probability ε Typically ε=0.1 10 Exploration ,Exploitation & Epsilon Greedy
  • 11. • Some problems (like n-bandit) are -Next Best Action. 1. A single given state 2. A set of options that are associated with this state 3. A reward for each action • Sometimes we wish to learn journeys Examples: 1. Teach a robot to go from point A to point B 2. Find the fastest way to drive home 11 Episodes
  • 12. • Episode 1. A “time series” of states {S1, S2, S3.. SK} 2. For all state Si There are set of options {O1, O2,..Oki } 3. Learning formula (the “gradient”) depends not only on the immediate rewards but on the next state as well 12 Episode (Cont.)
  • 13. • The observed sequence: st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward • We optimize our goal function (commonly maximizing the average): Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor Classical Example The Pole Balancing Find the exact force to implement in order to keep the pole up The reward is 1 for every time step that The pole didn’t fall Reinforcement Learning – Foundation
  • 14. Markov Property Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At } i.e. : The current state captures the entire history • Markov processes are fully determined by the transition matrix P Markov Process (or Markov Chain) A tuple <S,P> where S - set of states (mostly finite), P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s] Markov Decision Process -MDP
  • 15. A Markov Reward Process -MRP (Markov Chain with Values) A tuple < S,P, R, γ> S ,P as in Markov process, R a reward function Rs = E [Rt+1 | St = s] γ is a discount factor, γ ∈ [0, 1] (as in Gt ) State Value Function for MRP: v(s) = E [Gt | St = s] MDP-cont
  • 16. Bellman Eq. • v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]= E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ] We get a recursion rule: v(s) = E[Rt+1 + γ v(s t+1) | St = s] Similalry we can define on a value on state-action space: Q(s,a)= E [Gt | St = s, At =a] MDP - MRP with a finite set of actions A MDP-cont
  • 17. • Recall - policy π is the strategy – it maps between states and actions. π(a|s) = P [At = a | St = s] We assume that for each time t ,and state S π( | St) is fixed (π is stationary ) Clearly for a MDP, a given policy π modifies the MDP: R -> Rπ P->Pπ We modify V & Q Vπ(s) = Eπ [G t | S t = s] Qπ(s,a) = Eπ [G t | S t = s, At =a] Policy
  • 18. • For V (or Q) the optimal value function v* ,for each state s : v*(s) = max π vπ(s) π -policy Solving MDP ≡ Finding the optimal value function!!!! Optimal Policy π ≥ π’ if vπ(s) ≥ v π’(s) ∀s Theorem For every MDP there exists optimal policy Optimal Value Function
  • 19. • If we know 𝑞∗ (s,a) we can find the optimal policy: Optimal Value (Cont)
  • 20. • Dynamic programming • Monte Carlo • TD methods Objectives Prediction - Find the optimal function Control – Find the optimal policy Solution Methods
  • 21. • A class of algorithms used in many applications such as graph theory (shortest path) and bio informatics. It has two essential properties: 1. Can be decomposed to sub solutions 2. Solutions can be cashed and reused RL-MDP satisfies these both • We assume a full knowledge of MDP !!! Prediction Input: MDP and policy Output: Optimal Value function vπ Control Input: MDP Output: Optimal Value function v* Optimal policy π * Dynamic Programming
  • 22. • Assume policy π and MDP we wish to find the optima V π(s) V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s] • Since policy and MDP are known it is a linear eq. in vi but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto) Prediction – Optimal Value Function
  • 23. • Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the policy which will lead to an optimal function Policy Improvement (policy iteration)
  • 24. • Policy iteration requires policy updating which can be heavy. • We can study 𝑉∗ and obtain the policy through • The idea is that • Hence we can find 𝑉∗ iteratively (and derive the optimal policy) Value Iteration
  • 26. • The formula supports online update • Bootstrapping • Mostly we don’t have MDP DP -Remarks
  • 27. • A model free (we don’t need MDP) 1. It learns from generating episodes. 2. It must complete an episode for having the required average. 3. It is unbiased • For a policy π S0,A0,R1….. St ~ π We use empirical mean return rather expected return. V(St) =V(St) + 1 𝑁(𝑡) [ Gt –V(St ) ] N(t) – Amount of visits at time t For non-stationary cases we update differently: V(St) =V(St) +α [ Gt –V(St ) ] In MC one must terminate the episode to get the value (we calculate the mean explicitly ) Hence in grid problems it may work bad Monte Carlo Methods
  • 28. • Learn the optimal policy (using Q function): Monte Carlo Control
  • 29. Temporal Difference –TD • Motivation –Combining DP & MC As MC -Learning from experience , no explicit MDP As DP- Bootstrapping, no need to complete the episodes Prediction Recall that for MC we have Where Gt is known only at the end of the episode.
  • 30. TD –Methods (Cont.) • TD method needs to wait until the next step (TD(0)) We can see that it leads to different targets: MC- Gt TD - Rt+1 + γ V(S t+1) • Hence it is a Bootstrapping method The estimaion of V given a policy is straightforwad since the policy chooses S t+1.
  • 32. TD Vs. MC -Summary MC • High variance unbiased • Good convergence • Easy to understand • Low sensitivity for i.c TD • Efficiency • Convergence to V π • More sensitive to i.c.
  • 33. SARSA • On Policy method for Qlearning (update after every step): The next step is using SARSA to develop also a control algorithm, we learn on policy the Q function and update the policy toward greedyness
  • 34. On Policy Control Algorithm
  • 36. Qlearning –Off Policy • Rather learning from an action that has been offered we simply take the best action for the state The control algorithm is straightforward
  • 37. Value Function Approx. • Sometimes we have a large scale RL 1. TD backgammon (Appendix) 2. GO – (Deep Mind) 3. Helicopter (continuous) • Our objectives are still :control & predictions but we have huge amount of states. • The tabular solutions that we presented are not scalable. • Value Function approx. will allow us to use models!!!
  • 38. Value Function (Cont) • Consider a large (continuous ) MDP Vπ (s)= 𝑉′ π (s,w) Qπ (s,a) =𝑄′ π (s,a,w) w –set of function parameters • We can train them by both TD & MC . • We can expand values to unseen states
  • 39. Type of Approximations 1. Linear Combinations 2. Neural networks (lead to DQN) 3. Wavelet solutions
  • 40. Function Approximation on the technics • Define features vectors (X(S)) for the state S. e.g. Distance from target Trend in stock Chess board configuration • Training methods for W • SGD Linear Function get the form: 𝑉′ π =<X(S),W>
  • 41. RL -Based problems • No supervisor, only rewards solutions become:
  • 42. Deep -RL Why using Deep RL? • It allows us to find an optimal model (value/policy) • It allows us to optimize a model • Commonly we will use SGD Examples • Automatic cars • Atari • Deep Mind • TD- Gammon
  • 43. Q – network • We follow the value function approx. approach Q(s,a,w)≈𝑄∗(s,a)
  • 44. Q-Learning • We will simply follow TD target function with supervised manners: Target r+ γmax 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) Loss -MSE (r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2 • We solve it with SGD
  • 45. Q Network –Stability Issues Divergences • Correlation between successive samples ,non-iid • Policy is not necessarily stationary (influences on Q value) • Scale of rewards and Q value is unknown
  • 46. Deep –Q network Experience Replay Replay the data from the past with the current W It allows to remove correlation in data: • Pick at upon a greedy algorithm • Store in memory the tuple(st, at, rt+1, st+1 ) - Replay • Now calculate the MSE
  • 48. DQN (Cont) Fixed Target Q- Network In order to handle oscillations We calculate targets with respect to old parameters 𝑤− r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤− ) The loss becomes (r+ γ max 𝑎′ 𝑄(𝑠′ , 𝑎′ , 𝑤− ) −Q(s,a,w) )2 𝑤− <- w
  • 49. DQN –Summary Many further methods: • RewardValue • Double DQN • Parallel Updates Requires another lecture
  • 50. Gradient Policy • We have discussed: 1. Function approximations 2. Algorithms in which policy is learned through the value functions We can parametrize policy using parameters θ : πθ (s, a) =P[a| s, θ] Remark: we focus on model free!!
  • 52. Policy Based Good & Bad Good Better in High dimensions Convergence faster Bad Less efficient for high variance Local minima Example: Rock-Paper-Scissors
  • 53. How to optimize a policy? • We assume it is differentiable and calculate the log-likelihood • We assume further Gibbs distribution i.e. policy exponent in value function πθ (s, a) α 𝑒−θΦ(𝑠,𝑎) Deriving by θ implies: We can also use Gaussian policy
  • 54. Optimize policy (Cont.) Actor-Critic Critic – Update the action-state function by w Actor –Update the policy θ upon the critic suggestion
  • 56. • Rather Learning value functions we learn probabilities. Let At the action that is taken at time t Pr(At =a) = πt (a) = 𝑒Ht (a) 𝑏=1 𝑘 𝑒Ht (b) H – Numerical Preference We assume Gibbs Boltzmann Distribution R¯t - The average until time t Rt - The reward at time t Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) ) Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At Gradient Bandit algorithm
  • 57. Further Reading • Sutton & Barto http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton- bookdraft2016sep.pdf Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8 • DeepMind papers • David Silver –Youtube and ucl.ac.uk • TD-Backgammon