Reinforcement Learning: a Brief Overview

Reinforcement Learning (RL)
Mehdi Elahi
Free University of Bozen / Bolzano

www.linkedin.com/in/mehdielahi

Introduc:on
§  Supervised Learning:
The input-output examples are known
§  Reinforcement Learning :
The input-output examples are unknown
Instead the reward-punishment are known

Mo:va:on
•  Typical AcGve Learning
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
Mean Absolute Error
# of iterations
MAE
Strategy 1
Strategy 2
More info on Ac:ve Learning:
Rubens, Neil; Elahi, Mehdi; Sugiyama, Masashi; Kaplan, Dain; Ac:ve Learning in Recommender Systems, Recommender Systems Handbook, Springer US (2015)

Mo:va:on
•  AdapGve AcGve Learning
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE
Mean Absolute Error
Strategy 1
Strategy 2
Adaptive Strategy
Switching
point
More info on Adap:ve Ac:ve Learning:
Elahi, Mehdi, Francesco Ricci, and Neil Rubens. "A survey of ac:ve learning in collabora:ve ﬁltering recommender systems." Computer Science Review (2016).

Mo:va:on
•  AdapGve AcGve Learning
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE
Mean Absolute Error
Adaptive Strategy
More info on Adap:ve Ac:ve Learning:
Elahi, Mehdi, Francesco Ricci, and Neil Rubens. "A survey of ac:ve learning in collabora:ve ﬁltering recommender systems." Computer Science Review (2016).

n-Armed Bandit

§  Slot machine with n-arms, each of them will give diﬀerent
reward
§  In every play, we should ﬁnd the best arm to maximized the
total reward

Predict
the next
reward
Choose
the best
arm
Learn
from the
reward

Example

§  Example:

1st play 2nd play 3rd play
§  Every play is an Ac:on (a)
§  Then the system make transiGon to the
next State (s)
§  In every play a reward (r) is given
based on the chosen arm
§  How to play is a Policy (π)
which maps states to ac9ons

Ac:on Value

§  AcGon value - Qt(a) :
the esGmated value of an acGon (a)
§  This Method is called sample-average

AcGon value at Gme t Rewards
Number of Gmes acGon a is chosen

Op:mal Ac:on Value

§  OpGmal AcGon Value – Qt
*(a) :
the true value of an acGon (a)

§  Law of large numbers guarantees the convergence

OpGmal AcGon value
lim
ka→∞
Qt (a) = Q*
(a)

Es:ma:on
Qk+1 =
=
1
k +1
ri
i=1
k+1
∑
=
1
k +1
[rk+1 + ri
i=1
k
∑ ]
=
1
k +1
[rk+1 + kQk +Qk −Qk ]
=
1
k +1
[rk+1 +(k +1)Qk −Qk ]
= Qk +
1
k +1
[rk+1 −Qk ]
New es:ma:on= Old es:ma:on+ Step size x [Target- Old es:ma:on]

Challenges
§  StaGonary and non-StaGonary
§  ExploraGon and ExploitaGon
§  Reinforcement Comparison

Sta:onary and non-Sta:onary
§  If the rewards are ﬁxed we have a sta:onary problem.
§  You can keep what you have learned
§  But in many cases the rewards are changing over the Gme.
This is called is non-sta:onary problem.
§  This means that once you learned, you can not keep it for ever
Number of plays (t)
Example of
non-staGonary problem
Rewards (r)

Sta:onary and non-Sta:onary
Qk =
= Qk−1 +α[rk −Qk−1]
=αrk +(1−α)Qk−1
=αrk +(1−α)αrk−1 +(1−α)2
Qk−2
=αrk +(1−α)αrk−1 +(1−α)2
αrk−2 +...+(1−α)k−1
αr1 ++(1−α)K
Q0
= (1−α)k
Q0 + α(1−
i=1
k
∑ α)k−i
ri
0 <α ≤1
Weight
§  Introducing the weight (α):
considers the recent rewards greater than long- past
ones.
This method is called exponen:al, recency-weighted average.

Explora:on and Exploita:on
§  Exploit:
Using what it is already learned in order to obtain
beeer reward
Example: choosing the best ac9on

§  Explore:
Learning from what has not selected before and
by trying other possible opGons
Example: choosing a random ac9on

Explora:on and Exploita:on
§  ξ-Greedy:
§  SoSmax:
Example: Boltzmann Method
at
*
= argmax
a
Qt (a)
at = at
*
at ≠ at
*
at
*
with probability 1 − ε
random action with probability ε{at =
Choosing the best acGon (Exploita:on)
Choosing the not-the-best acGon (Explora:on)
The best acGon
The next acGon
Maximum acGon value

Performance
ξ-Greedy
methods
ξ-Greedy
methods
Greedy
method
Greedy
method

Boltzmann Distribu:on
e
Qt (a)
T
e
Qt (b)
T
b=1
n
∑
Temperature
Lower Temperature More Greedy acGon selecGon
More Exploita5on

Higher Temperature Less Greedy acGon selecGon
More Explora5on
Gives the probability of choosing the acGon a with at the play t

Reinforcement Comparison
§  We know that in RL:
AcGons with large rewards should be followed more likely than acGons
with small rewards
§  If the reward is 5, is it large or small?
§  Natural reference reward is the average of previously received rewards
Larger rewards > reference reward
Small rewards < reference reward

§  Method based on this idea are called Reinforcement Comparison
§  This method:
Introduces the fact probability of choosing an ac:on in ac:on selec:on
process:
Indicates that high rewards should increase the probability of reselec:ng the
ac:on were taken

More Challenges
§  OpGmisGc IniGal Value
§  AssociaGve Search

Op:mis:c Ini:al Value
§  Imagine we set the iniGal acGon value very high (say
+5 instead of 0)
§  Whatever acGon is chosen, the next acGon value
would be less than +5
§  System may be DISAPPOINTED!!
§  It will end up with a temporary explora:on ac:ons
Well-suited for sta9onary problems
But not for non-sta9onary problems

Associa:ve Search
§  Associa:ve:
Inputs mapped to outputs; earn the best
output for each input
§  Non-Associa:ve:
Learn (ﬁnd) one best output
Examples: the bandit machine is changing
over the 9me

Preliminary Result
0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE Mean Absolute Error
Strategy 1
Strategy 2
RL Strategy
No averaging
ξ=0.1
α=0.9
ExploraGon
Non-StaGonary problem

0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
Strategy 1
Strategy 2
RL Strategy
Preliminary Result
5-fold averaging
ξ=0.1
α=0.9

0 50 100 150
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
Strategy 1
Strategy 2
RL Strategy
Preliminary Result
10-fold averaging
ξ=0.1
α=0.9

RL References
•  R. S. Sueon et. al., Reinforcement Learning,
an introduc:on, The MIT press , Cambridge
•  B. Bakker, Decision Making in Intelligence
Systems, Lecture 2, UVA, Amsterdam
•  C. Rothkopf, N-Armed bandit problems, FIAS

Thank you!
www.linkedin.com/in/mehdielahi

Reinforcement Learning: a Brief Overview

More Related Content

Similar to Reinforcement Learning: a Brief Overview (8)

More from University of Bergen (7)

Recently uploaded (20)

Reinforcement Learning: a Brief Overview