SlideShare a Scribd company logo
Multi-armed Bandits
Dongmin Lee
RLI Study
Reference
- Reinforcement Learning : An Introduction,
Richard S. Sutton & Andrew G. Barto
(Link)
- 멀티 암드 밴딧(Multi-Armed Bandits), 송호연
(https://brunch.co.kr/@chris-song/62)
Index
1. Why do you need to know Multi-armed Bandits?
2. A k-armed Bandit Problem
3. Simple-average Action-value Methods
4. A simple Bandit Algorithm
5. Gradient Bandit Algorithm
6. Summary
1. Why do you need to know
Multi-armed Bandits(MAB)?
1. Why do you need to know MAB?
- Reinforcement Learning(RL) uses training information that
‘Evaluate’(‘Instruct’ X) the actions
- Evaluative feedback indicates ‘How good the action taken was’
- Because of simplicity, ‘Nonassociative’ one situation
- Most prior work involving evaluative feedback
- ‘Nonassociative’, ‘Evaluative feedback’ -> MAB
- In order to Introduce basic learning methods in later chapters
1. Why do you need to know MAB?
In my opinion,
- I think we can’t seem to know RL without knowing MAB.
- MAB deal with ‘Exploitation & Exploration’ of the core ideas in RL.
- In the full reinforcement learning problem, MAB is always used.
- In every profession, MAB is very useful.
2. A k-armed Bandit Problem
2. A k-armed Bandit Problem
Do you know what MAB is?
Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
- Slot Machine -> Bandit
- Slot Machine’s lever -> Armed
- N slot Machine
 Multi-armed Bandits
Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
Among the various slot machines,
which slot machine
should I put my money on
and lower the lever?
Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
How can you make the best return
on your investment?
Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
MAB is a algorithm created to
optimize investment in slot machines
2. A k-armed Bandit Problem
A K-armed Bandit Problem
A K-armed Bandit Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
t – Discrete time step or play number
𝐴 𝑡 - Action at time t
𝑅𝑡 - Reward at time t
𝑞∗ 𝑎 – True value (expected reward) of action a
A K-armed Bandit Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
In our k-armed bandit problem, each of the k actions has
an expected or mean reward given that that action is selected;
let us call this the value of that action.
3. Simple-average Action-value Methods
3. Simple-average Action-value Methods
Simple-average Method
Simple-average Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
𝑄𝑡(𝑎) converges to 𝑞∗(𝑎)
3. Simple-average Action-value Methods
Action-value Methods
Action-value Methods
Action-value Methods
- Greedy Action Selection Method
- 𝜀-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
Action-value Methods
Action-value Methods
- Greedy Action Selection Method
- 𝜀-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
Greedy Action Selection Method
𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑓(𝑎) - a value of 𝑎 at which 𝑓(𝑎) takes its maximal value
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Greedy Action Selection Method
Greedy action selection always exploits
current knowledge to maximize immediate reward
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Greedy Action Selection Method
Greedy Action Selection Method’s disadvantage
Is it a good idea to select greedy action,
exploit that action selection
and maximize the current immediate reward?
Action-value Methods
Action-value Methods
- Greedy Action Selection Method
- 𝜀-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
𝜀-greedy Action Selection Method
Exploitation is the right thing to do to maximize
the expected reward on the one step,
but Exploration may produce the greater total reward in the long run.
𝜀-greedy Action Selection Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
𝜀 – probability of taking a random action in an 𝜀-greedy policy
𝜀-greedy Action Selection Method
Exploitation
Exploration
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Action-value Methods
Action-value Methods
- Greedy Action Selection Method
- 𝜀-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
Upper-Confidence-Bound(UCB) Action Selection Method
ln 𝑡 - natural logarithm of 𝑡
𝑁𝑡(𝑎) – the number of times that action 𝑎 has been selected prior to time 𝑡
𝑐 – the number 𝑐 > 0 controls the degree of exploration
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Upper-Confidence-Bound(UCB) Action Selection Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
The idea of this UCB action selection is that
The square-root term is a measure of
the uncertainty(or potential) in the estimate of 𝑎’s value
The probability that the slot machine
may be the optimal slot machine
Upper-Confidence-Bound(UCB) Action Selection Method
UCB Action Selection Method’s disadvantage
UCB is more difficult than 𝜀-greedy to extend beyond bandits
to the more general reinforcement learning settings
One difficulty is in dealing with nonstationary problems
Another difficulty is dealing with large state spaces
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
4. A simple Bandit Algorithm
4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
Incremental Implementation
𝑄 𝑛 denote the estimate of 𝑅 𝑛−1‘s action value
after 𝑅 𝑛−1 has been selected n − 1 times
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
𝑄 𝑛+1 = 𝑄 𝑛 +
1
𝑛
𝑅 𝑛 − 𝑄 𝑛
holds even for 𝑛 = 1,
obtaining 𝑄2 = 𝑅1 for arbitrary 𝑄1
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Unstable(↔Constant)
Available on stationary problem
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
The expression 𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑂𝑙𝑑𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 is an error in the estimate.
The target is presumed to indicate a desirable direction
in which to move, though it may be noisy.
4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Why do you think it should be changed from
1
𝑛
to 𝛼?
Tracking a Nonstationary Problem
Why do you think it should be changed from
1
𝑛
to 𝛼?
We often encounter RL problems that are effectively nonstationary.
In such cases it makes sense to give more weight
to recent rewards than to long-past rewards.
One of the most popular ways of doing this is to use
a constant step-size parameter.
The step-size parameter 𝛼 ∈ (0,1] is constant.
Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Tracking a Nonstationary Problem
?
= 𝑄 𝑛 + 𝛼𝑅 𝑛 − 𝛼𝑄 𝑛
= 𝛼𝑅 𝑛 + (1 − 𝛼)𝑄 𝑛
𝑄 𝑛 = 𝛼𝑅 𝑛−1 + 1 − 𝛼 𝑄 𝑛−1 ?
∴ 𝑄 𝑛 = 𝛼𝑅 𝑛−1 + (1 − 𝛼)𝑄 𝑛−1
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Tracking a Nonstationary Problem
?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Tracking a Nonstationary Problem
∴ 𝑄 𝑛−1 = 𝛼 𝑅 𝑛−2 + 1 − 𝛼 𝑅 𝑛−3 + ⋯ + 1 − 𝛼 𝑛−3 𝑅1 + 1 − 𝛼 𝑛−2 𝑄1
1 − 𝛼 2
𝑄 𝑛−1 = 1 − 𝛼 2
𝛼𝑅 𝑛−2 + 1 − 𝛼 3
𝛼𝑅 𝑛−3 + ⋯
+ 1 − 𝛼 𝑛−1
𝛼𝑅1 + 1 − 𝛼 𝑛
𝑄1 ?
= 1 − 𝛼 2{𝛼𝑅 𝑛−2 + 1 − 𝛼 𝛼𝑅 𝑛−3 + ⋯
+ 1 − 𝛼 𝑛−3 𝛼𝑅1 + 1 − 𝛼 𝑛−2 𝑄1}
Tracking a Nonstationary Problem
Sequences of step-size parameters often converge
very slowly or need considerable tuning
in order to obtain a satisfactory convergence rate.
Thus, step-size parameters should be tuned effectively.
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
5. Gradient Bandit Algorithm
5. Gradient Bandit Algorithm
In addition to a simple bandit algorithm,
there is another way to use the gradient method as a bandit algorithm
5. Gradient Bandit Algorithm
We consider learning a numerical 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒
for each action 𝑎, which we denote 𝐻𝑡(𝑎).
The larger the preference, the more often that action is taken,
but the preference has no interpretation in terms of reward.
In other wards, just because the preference(𝐻𝑡(𝑎)) is large,
the reward is not unconditionally large.
However, if the reward is large, It can affect the preference(𝐻𝑡(𝑎))
5. Gradient Bandit Algorithm
The action probabilities are determined according to
a 𝑠𝑜𝑓𝑡 − 𝑚𝑎𝑥 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 (i.e., Gibbs or Boltzmann distribution)
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
5. Gradient Bandit Algorithm
𝜋 𝑡(𝑎) – Probability of selecting action 𝑎 at time 𝑡
Initially all action preferences are the same
so that all actions have an equal probability of being selected.
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
There is a natural learning algorithm for this setting
based on the idea of stochastic gradient ascent.
On each step, after selecting action 𝐴 𝑡 and receiving the reward 𝑅𝑡,
the action preferences are updated.
Selected action 𝐴 𝑡
Non-selected actions
5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does 𝑅𝑡 mean?
2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does 𝑅𝑡 mean?
2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
?
What does 𝑅𝑡 mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
𝑅𝑡 ∈ R is the average of all the rewards.
The 𝑅𝑡 term serves as a baseline.
If the reward is higher than the baseline,
then the probability of taking 𝐴 𝑡 in the future is increased,
and if the reward is below baseline, then probability is decreased.
The non-selected actions move in the opposite direction.
5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does 𝑅𝑡 mean?
2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
?
What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
- Stochastic approximation to gradient ascent
in Bandit Gradient Algorithm
- Expected reward
- Expected reward by Low of total expectationE[𝑅𝑡] = E[ E 𝑅𝑡 𝐴 𝑡 ]
What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
The gradient sums to zero over all the actions, σ 𝑥
𝜕𝜋 𝑡(𝑥)
𝜕𝐻𝑡(𝑎)
= 0
– as 𝐻𝑡(𝑎) is changed, some actions’ probabilities go up
and some go down, but the sum of the changes must be
zero because the sum of the probabilities is always one.
What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
E[𝑅𝑡] = E[ E 𝑅𝑡 𝐴 𝑡 ]
What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Please refer page 40 in link of reference slide
What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
∴
We can substitute a sample of the expectation above
for the performance gradient in 𝐻𝑡+1 𝑎 ≅ 𝐻𝑡 𝑎 + 𝛼
𝜕E[𝑅 𝑡]
𝜕𝐻𝑡(𝑎)
6. Summary
6. Summary
In this chapter, ‘Exploitation & Exploration‘ is the core idea.
6. Summary
Action-value Methods
- Greedy Action Selection Method
- 𝜀-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
6. Summary
A simple bandit algorithm : Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
6. Summary
A simple bandit algorithm : Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
6. Summary
A simple bandit algorithm : Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
6. Summary
A simple bandit algorithm : Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
6. Summary
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Gradient Bandit Algorithm
6. Summary
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
A parameter study of the various bandit algorithms
Reinforcement Learning is LOVE♥
Thank you

More Related Content

PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Reinforcement Learning 2. Multi-armed Bandits
PDF
Reinforcement Learning 5. Monte Carlo Methods
PDF
Reinforcement Learning 1. Introduction
PDF
Reinforcement Learning 6. Temporal Difference Learning
PDF
Multi armed bandit
PDF
Reinforcement learning-ebook-part1
PDF
Multi-Armed Bandit and Applications
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 1. Introduction
Reinforcement Learning 6. Temporal Difference Learning
Multi armed bandit
Reinforcement learning-ebook-part1
Multi-Armed Bandit and Applications

What's hot (20)

PPTX
multi-armed bandit
PDF
Reinforcement Learning 4. Dynamic Programming
PPTX
Deep Reinforcement Learning
PDF
An introduction to deep reinforcement learning
PDF
Reinforcement Learning - DQN
PDF
Temporal difference learning
PDF
Policy gradient
PDF
Multi-Agent Reinforcement Learning
PDF
Introduction of Deep Reinforcement Learning
PDF
Lecture 9 Markov decision process
PDF
Introduction to SAC(Soft Actor-Critic)
PDF
Deep Reinforcement Learning
PDF
Maximum Entropy Reinforcement Learning (Stochastic Control)
PPTX
Intro to Deep Reinforcement Learning
PDF
Introduction to Few shot learning
PDF
Actor critic algorithm
PDF
Proximal Policy Optimization (Reinforcement Learning)
PDF
An introduction to reinforcement learning
PDF
파이썬으로 나만의 강화학습 환경 만들기
PDF
Markov Chain Monte Carlo Methods
multi-armed bandit
Reinforcement Learning 4. Dynamic Programming
Deep Reinforcement Learning
An introduction to deep reinforcement learning
Reinforcement Learning - DQN
Temporal difference learning
Policy gradient
Multi-Agent Reinforcement Learning
Introduction of Deep Reinforcement Learning
Lecture 9 Markov decision process
Introduction to SAC(Soft Actor-Critic)
Deep Reinforcement Learning
Maximum Entropy Reinforcement Learning (Stochastic Control)
Intro to Deep Reinforcement Learning
Introduction to Few shot learning
Actor critic algorithm
Proximal Policy Optimization (Reinforcement Learning)
An introduction to reinforcement learning
파이썬으로 나만의 강화학습 환경 만들기
Markov Chain Monte Carlo Methods
Ad

Similar to Multi-armed Bandits (20)

PDF
Introduction to Multi-armed Bandits
PDF
Multi-Armed Bandit: an algorithmic perspective
PDF
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
PDF
Sutton reinforcement learning new ppt.pdf
PDF
bandits problems robert platt northeaster.pdf
PDF
25 introduction reinforcement_learning
PDF
Bandit Algorithms
PDF
Practical AI for Business: Bandit Algorithms
PDF
S19_lecture6_exploreexploitinbandits.pdf
PDF
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
PDF
Meta-learning of exploration-exploitation strategies in reinforcement learning
PDF
PPTX
2Multi_armed_bandits.pptx
PDF
Reinforcement Learning in Economics and Finance
PDF
Multi-Armed Bandits:
 Intro, examples and tricks
PPTX
reinforcement-learning-141009013546-conversion-gate02.pptx
PDF
REINFORCEMENT LEARNING
PDF
Reinfrocement Learning
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
Introduction to Multi-armed Bandits
Multi-Armed Bandit: an algorithmic perspective
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
Sutton reinforcement learning new ppt.pdf
bandits problems robert platt northeaster.pdf
25 introduction reinforcement_learning
Bandit Algorithms
Practical AI for Business: Bandit Algorithms
S19_lecture6_exploreexploitinbandits.pdf
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Meta-learning of exploration-exploitation strategies in reinforcement learning
2Multi_armed_bandits.pptx
Reinforcement Learning in Economics and Finance
Multi-Armed Bandits:
 Intro, examples and tricks
reinforcement-learning-141009013546-conversion-gate02.pptx
REINFORCEMENT LEARNING
Reinfrocement Learning
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
Ad

More from Dongmin Lee (14)

PDF
Causal Confusion in Imitation Learning
PDF
Character Controllers using Motion VAEs
PDF
Causal Confusion in Imitation Learning
PDF
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
PDF
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PDF
Exploration Strategies in Reinforcement Learning
PDF
Let's do Inverse RL
PDF
모두를 위한 PG여행 가이드
PDF
Safe Reinforcement Learning
PDF
안.전.제.일. 강화학습!
PDF
Planning and Learning with Tabular Methods
PDF
강화학습 알고리즘의 흐름도 Part 2
PDF
강화학습의 흐름도 Part 1
PDF
강화학습의 개요
Causal Confusion in Imitation Learning
Character Controllers using Motion VAEs
Causal Confusion in Imitation Learning
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
Exploration Strategies in Reinforcement Learning
Let's do Inverse RL
모두를 위한 PG여행 가이드
Safe Reinforcement Learning
안.전.제.일. 강화학습!
Planning and Learning with Tabular Methods
강화학습 알고리즘의 흐름도 Part 2
강화학습의 흐름도 Part 1
강화학습의 개요

Recently uploaded (20)

PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Cell Membrane: Structure, Composition & Functions
PPT
Chemical bonding and molecular structure
Introduction to Fisheries Biotechnology_Lesson 1.pptx
famous lake in india and its disturibution and importance
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
microscope-Lecturecjchchchchcuvuvhc.pptx
Biophysics 2.pdffffffffffffffffffffffffff
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Taita Taveta Laboratory Technician Workshop Presentation.pptx
neck nodes and dissection types and lymph nodes levels
The KM-GBF monitoring framework – status & key messages.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
. Radiology Case Scenariosssssssssssssss
2. Earth - The Living Planet Module 2ELS
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
INTRODUCTION TO EVS | Concept of sustainability
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Cell Membrane: Structure, Composition & Functions
Chemical bonding and molecular structure

Multi-armed Bandits

  • 2. Reference - Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto (Link) - 멀티 암드 밴딧(Multi-Armed Bandits), 송호연 (https://brunch.co.kr/@chris-song/62)
  • 3. Index 1. Why do you need to know Multi-armed Bandits? 2. A k-armed Bandit Problem 3. Simple-average Action-value Methods 4. A simple Bandit Algorithm 5. Gradient Bandit Algorithm 6. Summary
  • 4. 1. Why do you need to know Multi-armed Bandits(MAB)?
  • 5. 1. Why do you need to know MAB? - Reinforcement Learning(RL) uses training information that ‘Evaluate’(‘Instruct’ X) the actions - Evaluative feedback indicates ‘How good the action taken was’ - Because of simplicity, ‘Nonassociative’ one situation - Most prior work involving evaluative feedback - ‘Nonassociative’, ‘Evaluative feedback’ -> MAB - In order to Introduce basic learning methods in later chapters
  • 6. 1. Why do you need to know MAB? In my opinion, - I think we can’t seem to know RL without knowing MAB. - MAB deal with ‘Exploitation & Exploration’ of the core ideas in RL. - In the full reinforcement learning problem, MAB is always used. - In every profession, MAB is very useful.
  • 7. 2. A k-armed Bandit Problem
  • 8. 2. A k-armed Bandit Problem Do you know what MAB is?
  • 9. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 - Slot Machine -> Bandit - Slot Machine’s lever -> Armed - N slot Machine  Multi-armed Bandits
  • 10. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 Among the various slot machines, which slot machine should I put my money on and lower the lever?
  • 11. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 How can you make the best return on your investment?
  • 12. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 MAB is a algorithm created to optimize investment in slot machines
  • 13. 2. A k-armed Bandit Problem A K-armed Bandit Problem
  • 14. A K-armed Bandit Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view t – Discrete time step or play number 𝐴 𝑡 - Action at time t 𝑅𝑡 - Reward at time t 𝑞∗ 𝑎 – True value (expected reward) of action a
  • 15. A K-armed Bandit Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view In our k-armed bandit problem, each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action.
  • 17. 3. Simple-average Action-value Methods Simple-average Method
  • 18. Simple-average Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 𝑄𝑡(𝑎) converges to 𝑞∗(𝑎)
  • 19. 3. Simple-average Action-value Methods Action-value Methods
  • 20. Action-value Methods Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 21. Action-value Methods Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 22. Greedy Action Selection Method 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑓(𝑎) - a value of 𝑎 at which 𝑓(𝑎) takes its maximal value Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 23. Greedy Action Selection Method Greedy action selection always exploits current knowledge to maximize immediate reward Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 24. Greedy Action Selection Method Greedy Action Selection Method’s disadvantage Is it a good idea to select greedy action, exploit that action selection and maximize the current immediate reward?
  • 25. Action-value Methods Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 26. 𝜀-greedy Action Selection Method Exploitation is the right thing to do to maximize the expected reward on the one step, but Exploration may produce the greater total reward in the long run.
  • 27. 𝜀-greedy Action Selection Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 𝜀 – probability of taking a random action in an 𝜀-greedy policy
  • 28. 𝜀-greedy Action Selection Method Exploitation Exploration Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 29. Action-value Methods Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 30. Upper-Confidence-Bound(UCB) Action Selection Method ln 𝑡 - natural logarithm of 𝑡 𝑁𝑡(𝑎) – the number of times that action 𝑎 has been selected prior to time 𝑡 𝑐 – the number 𝑐 > 0 controls the degree of exploration Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 31. Upper-Confidence-Bound(UCB) Action Selection Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view The idea of this UCB action selection is that The square-root term is a measure of the uncertainty(or potential) in the estimate of 𝑎’s value The probability that the slot machine may be the optimal slot machine
  • 32. Upper-Confidence-Bound(UCB) Action Selection Method UCB Action Selection Method’s disadvantage UCB is more difficult than 𝜀-greedy to extend beyond bandits to the more general reinforcement learning settings One difficulty is in dealing with nonstationary problems Another difficulty is dealing with large state spaces Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 33. 4. A simple Bandit Algorithm
  • 34. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  • 35. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  • 36. Incremental Implementation 𝑄 𝑛 denote the estimate of 𝑅 𝑛−1‘s action value after 𝑅 𝑛−1 has been selected n − 1 times Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 37. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 38. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 𝑄 𝑛+1 = 𝑄 𝑛 + 1 𝑛 𝑅 𝑛 − 𝑄 𝑛 holds even for 𝑛 = 1, obtaining 𝑄2 = 𝑅1 for arbitrary 𝑄1
  • 39. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 40. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 41. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Unstable(↔Constant) Available on stationary problem
  • 42. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 43. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view The expression 𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑂𝑙𝑑𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 is an error in the estimate. The target is presumed to indicate a desirable direction in which to move, though it may be noisy.
  • 44. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  • 45. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 46. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Why do you think it should be changed from 1 𝑛 to 𝛼?
  • 47. Tracking a Nonstationary Problem Why do you think it should be changed from 1 𝑛 to 𝛼? We often encounter RL problems that are effectively nonstationary. In such cases it makes sense to give more weight to recent rewards than to long-past rewards. One of the most popular ways of doing this is to use a constant step-size parameter. The step-size parameter 𝛼 ∈ (0,1] is constant.
  • 48. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 49. Tracking a Nonstationary Problem ? = 𝑄 𝑛 + 𝛼𝑅 𝑛 − 𝛼𝑄 𝑛 = 𝛼𝑅 𝑛 + (1 − 𝛼)𝑄 𝑛 𝑄 𝑛 = 𝛼𝑅 𝑛−1 + 1 − 𝛼 𝑄 𝑛−1 ? ∴ 𝑄 𝑛 = 𝛼𝑅 𝑛−1 + (1 − 𝛼)𝑄 𝑛−1 Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 50. Tracking a Nonstationary Problem ? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 51. Tracking a Nonstationary Problem ∴ 𝑄 𝑛−1 = 𝛼 𝑅 𝑛−2 + 1 − 𝛼 𝑅 𝑛−3 + ⋯ + 1 − 𝛼 𝑛−3 𝑅1 + 1 − 𝛼 𝑛−2 𝑄1 1 − 𝛼 2 𝑄 𝑛−1 = 1 − 𝛼 2 𝛼𝑅 𝑛−2 + 1 − 𝛼 3 𝛼𝑅 𝑛−3 + ⋯ + 1 − 𝛼 𝑛−1 𝛼𝑅1 + 1 − 𝛼 𝑛 𝑄1 ? = 1 − 𝛼 2{𝛼𝑅 𝑛−2 + 1 − 𝛼 𝛼𝑅 𝑛−3 + ⋯ + 1 − 𝛼 𝑛−3 𝛼𝑅1 + 1 − 𝛼 𝑛−2 𝑄1}
  • 52. Tracking a Nonstationary Problem Sequences of step-size parameters often converge very slowly or need considerable tuning in order to obtain a satisfactory convergence rate. Thus, step-size parameters should be tuned effectively. Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 53. 5. Gradient Bandit Algorithm
  • 54. 5. Gradient Bandit Algorithm In addition to a simple bandit algorithm, there is another way to use the gradient method as a bandit algorithm
  • 55. 5. Gradient Bandit Algorithm We consider learning a numerical 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 for each action 𝑎, which we denote 𝐻𝑡(𝑎). The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. In other wards, just because the preference(𝐻𝑡(𝑎)) is large, the reward is not unconditionally large. However, if the reward is large, It can affect the preference(𝐻𝑡(𝑎))
  • 56. 5. Gradient Bandit Algorithm The action probabilities are determined according to a 𝑠𝑜𝑓𝑡 − 𝑚𝑎𝑥 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 (i.e., Gibbs or Boltzmann distribution) Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 57. 5. Gradient Bandit Algorithm 𝜋 𝑡(𝑎) – Probability of selecting action 𝑎 at time 𝑡 Initially all action preferences are the same so that all actions have an equal probability of being selected. Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 58. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. On each step, after selecting action 𝐴 𝑡 and receiving the reward 𝑅𝑡, the action preferences are updated. Selected action 𝐴 𝑡 Non-selected actions
  • 59. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does 𝑅𝑡 mean? 2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean?
  • 60. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does 𝑅𝑡 mean? 2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? ?
  • 61. What does 𝑅𝑡 mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 𝑅𝑡 ∈ R is the average of all the rewards. The 𝑅𝑡 term serves as a baseline. If the reward is higher than the baseline, then the probability of taking 𝐴 𝑡 in the future is increased, and if the reward is below baseline, then probability is decreased. The non-selected actions move in the opposite direction.
  • 62. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does 𝑅𝑡 mean? 2) What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? ?
  • 63. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view - Stochastic approximation to gradient ascent in Bandit Gradient Algorithm - Expected reward - Expected reward by Low of total expectationE[𝑅𝑡] = E[ E 𝑅𝑡 𝐴 𝑡 ]
  • 64. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 65. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ?
  • 66. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ? The gradient sums to zero over all the actions, σ 𝑥 𝜕𝜋 𝑡(𝑥) 𝜕𝐻𝑡(𝑎) = 0 – as 𝐻𝑡(𝑎) is changed, some actions’ probabilities go up and some go down, but the sum of the changes must be zero because the sum of the probabilities is always one.
  • 67. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 68. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ? E[𝑅𝑡] = E[ E 𝑅𝑡 𝐴 𝑡 ]
  • 69. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Please refer page 40 in link of reference slide
  • 70. What does (𝑅𝑡 − 𝑅𝑡)(1 − 𝜋 𝑡 𝐴 𝑡 ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ∴ We can substitute a sample of the expectation above for the performance gradient in 𝐻𝑡+1 𝑎 ≅ 𝐻𝑡 𝑎 + 𝛼 𝜕E[𝑅 𝑡] 𝜕𝐻𝑡(𝑎)
  • 72. 6. Summary In this chapter, ‘Exploitation & Exploration‘ is the core idea.
  • 73. 6. Summary Action-value Methods - Greedy Action Selection Method - 𝜀-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 74. 6. Summary A simple bandit algorithm : Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 75. 6. Summary A simple bandit algorithm : Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 76. 6. Summary A simple bandit algorithm : Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 77. 6. Summary A simple bandit algorithm : Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 78. 6. Summary Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Gradient Bandit Algorithm
  • 79. 6. Summary Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view A parameter study of the various bandit algorithms