Reinforcement Learning 2. Multi-armed Bandits

Chapter 2: Multi-armed Bandits
Seungjae Ryan Lee

● Slot machine
● Each spin (action) is independent
One-armed Bandit

● Multiple slot machines to choose from
● Simplified setting to avoid complexities of RL problems
○ No observation
○ Action does not have delayed effect
Multi-armed Bandit problem

10-armed Testbed
● 10 actions, 10 reward distributions
● Reward chosen from stationary probability distributions

● Knowing expected reward trivializes the problem
● Estimate with
Expected Reward

● Estimate by averaging received rewards
● Default value (ex. 0) if action was never selected
● converges to as denominator goes to infinity
Sample-average

● Always select greedily :
● No exploration
● Often stuck in suboptimal actions
Greedy method
Eat the usual cereal?Agent
1

● Select random action with probability ε
● All converges to as denominator goes to infinity
ε-greedy method
Try a new cereal? Eat the usual cereal?Agent
0.1 0.9

● Don’t store reward for each step
● Compute incrementally
Incremental Implementation

● changes over time
● Want to give new experience more weight
Nonstationary problem

● Constant step-size parameter
● Give more weight to recent rewards
Exponentially weighted average

Weighted average
● Never completely converges
● Desirable in nonstationary problems
Sample-average
● Guaranteed convergence
● Converge slowly: need tuning
● Seldomly used in applications

● Set initial action values optimistically (ex. +5)
● Temporarily encourage exploration
● Doesn’t work in nonstationary problems
Optimistic Initial Values
+5 +4.5
R = 0 R = 0.1
+4.06
R = -0.1
+3.64
R = 0
+3.28

Optimistic Greedy vs. Realistic ε-greedy

● Take into account each action’s potential to be optimal
● Selected less → more potential
● Difficult to extend beyond multi-armed bandits
Upper Confidence Bound (UCB)

● Learn a numerical preference for each action
● Convert to probability with softmax:
Gradient Bandit Algorithms

● Update preference with SGD
● Baseline : average of all rewards
● Increase probability if reward is above baseline
● Decrease probability if reward is below baseline
Gradient Bandit: Stochastic Gradient Descent

Parameter Study
● Check performance in best setting
● Check hyperparameter sensitivity

● Observe some context that can help decision
● Intermediate between multi-armed bandit and full RL problem
○ Need to learn a policy to associate observations and actions
○ Each action only affects immediate reward
Associative Search (Contextual Bandit)

Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

Reinforcement Learning 2. Multi-armed Bandits

More Related Content

What's hot (20)

Similar to Reinforcement Learning 2. Multi-armed Bandits (20)

Recently uploaded (20)

Reinforcement Learning 2. Multi-armed Bandits