SlideShare a Scribd company logo
7
Most read
9
Most read
16
Most read
Chapter 2: Multi-armed Bandits
Seungjae Ryan Lee
● Slot machine
● Each spin (action) is independent
One-armed Bandit
● Multiple slot machines to choose from
● Simplified setting to avoid complexities of RL problems
○ No observation
○ Action does not have delayed effect
Multi-armed Bandit problem
10-armed Testbed
● 10 actions, 10 reward distributions
● Reward chosen from stationary probability distributions
● Knowing expected reward trivializes the problem
● Estimate with
Expected Reward
● Estimate by averaging received rewards
● Default value (ex. 0) if action was never selected
● converges to as denominator goes to infinity
Sample-average
● Always select greedily :
● No exploration
● Often stuck in suboptimal actions
Greedy method
Eat the usual cereal?Agent
1
● Select random action with probability ε
● All converges to as denominator goes to infinity
ε-greedy method
Try a new cereal? Eat the usual cereal?Agent
0.1 0.9
Greedy vs. ε-greedy
● Don’t store reward for each step
● Compute incrementally
Incremental Implementation
● changes over time
● Want to give new experience more weight
Nonstationary problem
● Constant step-size parameter
● Give more weight to recent rewards
Exponentially weighted average
Weighted average
● Never completely converges
● Desirable in nonstationary problems
Sample-average
● Guaranteed convergence
● Converge slowly: need tuning
● Seldomly used in applications
● Set initial action values optimistically (ex. +5)
● Temporarily encourage exploration
● Doesn’t work in nonstationary problems
Optimistic Initial Values
+5 +4.5
R = 0 R = 0.1
+4.06
R = -0.1
+3.64
R = 0
+3.28
Optimistic Greedy vs. Realistic ε-greedy
● Take into account each action’s potential to be optimal
● Selected less → more potential
● Difficult to extend beyond multi-armed bandits
Upper Confidence Bound (UCB)
UCB vs. ε-greedy
● Learn a numerical preference for each action
● Convert to probability with softmax:
Gradient Bandit Algorithms
● Update preference with SGD
● Baseline : average of all rewards
● Increase probability if reward is above baseline
● Decrease probability if reward is below baseline
Gradient Bandit: Stochastic Gradient Descent
Gradient Bandit: Results
Parameter Study
● Check performance in best setting
● Check hyperparameter sensitivity
● Observe some context that can help decision
● Intermediate between multi-armed bandit and full RL problem
○ Need to learn a policy to associate observations and actions
○ Each action only affects immediate reward
Associative Search (Contextual Bandit)
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

More Related Content

PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Reinforcement Learning 1. Introduction
PDF
Reinforcement Learning 6. Temporal Difference Learning
PDF
Multi armed bandit
PDF
Reinforcement Learning 5. Monte Carlo Methods
PDF
Reinforcement Learning 4. Dynamic Programming
PDF
Multi-armed Bandits
PDF
Markov decision process
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 1. Introduction
Reinforcement Learning 6. Temporal Difference Learning
Multi armed bandit
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 4. Dynamic Programming
Multi-armed Bandits
Markov decision process

What's hot (20)

PDF
Markov Chain Monte Carlo Methods
PDF
Multi-Armed Bandit and Applications
PDF
An introduction to deep reinforcement learning
PDF
Reinforcement Learning 10. On-policy Control with Approximation
PPTX
multi-armed bandit
PPTX
An introduction to reinforcement learning
PDF
Deep Q-Learning
PDF
Reinforcement learning, Q-Learning
PPTX
Proximal Policy Optimization
PDF
Policy gradient
PPTX
Deep Reinforcement Learning
PDF
Reinforcement Learning 7. n-step Bootstrapping
PDF
Introduction of Deep Reinforcement Learning
PPTX
Local search algorithms
PDF
Reinforcement learning
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PDF
Intro to Reinforcement learning - part III
PDF
Generative adversarial networks
PPTX
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
PDF
Feature Engineering
Markov Chain Monte Carlo Methods
Multi-Armed Bandit and Applications
An introduction to deep reinforcement learning
Reinforcement Learning 10. On-policy Control with Approximation
multi-armed bandit
An introduction to reinforcement learning
Deep Q-Learning
Reinforcement learning, Q-Learning
Proximal Policy Optimization
Policy gradient
Deep Reinforcement Learning
Reinforcement Learning 7. n-step Bootstrapping
Introduction of Deep Reinforcement Learning
Local search algorithms
Reinforcement learning
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Intro to Reinforcement learning - part III
Generative adversarial networks
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Feature Engineering
Ad

Similar to Reinforcement Learning 2. Multi-armed Bandits (20)

PDF
bandits problems robert platt northeaster.pdf
PDF
Bandit Algorithms
PDF
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
PDF
Practical AI for Business: Bandit Algorithms
PDF
Introduction to Multi-armed Bandits
PDF
Multi-Armed Bandit: an algorithmic perspective
PDF
Sutton reinforcement learning new ppt.pdf
PDF
Artwork Personalization at Netflix
ODP
Choosing between several options in uncertain environments
PDF
Multi-Armed Bandits:
 Intro, examples and tricks
PPTX
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
PDF
Contextual Bandit Survey
PDF
Rotting Infinitely Many-Armed Bandits
PDF
Meta-learning of exploration-exploitation strategies in reinforcement learning
PPTX
mcp-bandits.pptx
PDF
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
PDF
07-Richard-Combes Multi-Armed Bandits.pdf
PDF
Multi Armed Bandits
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
PPTX
2Multi_armed_bandits.pptx
bandits problems robert platt northeaster.pdf
Bandit Algorithms
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
Practical AI for Business: Bandit Algorithms
Introduction to Multi-armed Bandits
Multi-Armed Bandit: an algorithmic perspective
Sutton reinforcement learning new ppt.pdf
Artwork Personalization at Netflix
Choosing between several options in uncertain environments
Multi-Armed Bandits:
 Intro, examples and tricks
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
Contextual Bandit Survey
Rotting Infinitely Many-Armed Bandits
Meta-learning of exploration-exploitation strategies in reinforcement learning
mcp-bandits.pptx
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
07-Richard-Combes Multi-Armed Bandits.pdf
Multi Armed Bandits
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
2Multi_armed_bandits.pptx
Ad

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Dropbox Q2 2025 Financial Results & Investor Presentation
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
Modernizing your data center with Dell and AMD
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Reinforcement Learning 2. Multi-armed Bandits

  • 1. Chapter 2: Multi-armed Bandits Seungjae Ryan Lee
  • 2. ● Slot machine ● Each spin (action) is independent One-armed Bandit
  • 3. ● Multiple slot machines to choose from ● Simplified setting to avoid complexities of RL problems ○ No observation ○ Action does not have delayed effect Multi-armed Bandit problem
  • 4. 10-armed Testbed ● 10 actions, 10 reward distributions ● Reward chosen from stationary probability distributions
  • 5. ● Knowing expected reward trivializes the problem ● Estimate with Expected Reward
  • 6. ● Estimate by averaging received rewards ● Default value (ex. 0) if action was never selected ● converges to as denominator goes to infinity Sample-average
  • 7. ● Always select greedily : ● No exploration ● Often stuck in suboptimal actions Greedy method Eat the usual cereal?Agent 1
  • 8. ● Select random action with probability ε ● All converges to as denominator goes to infinity ε-greedy method Try a new cereal? Eat the usual cereal?Agent 0.1 0.9
  • 10. ● Don’t store reward for each step ● Compute incrementally Incremental Implementation
  • 11. ● changes over time ● Want to give new experience more weight Nonstationary problem
  • 12. ● Constant step-size parameter ● Give more weight to recent rewards Exponentially weighted average
  • 13. Weighted average ● Never completely converges ● Desirable in nonstationary problems Sample-average ● Guaranteed convergence ● Converge slowly: need tuning ● Seldomly used in applications
  • 14. ● Set initial action values optimistically (ex. +5) ● Temporarily encourage exploration ● Doesn’t work in nonstationary problems Optimistic Initial Values +5 +4.5 R = 0 R = 0.1 +4.06 R = -0.1 +3.64 R = 0 +3.28
  • 15. Optimistic Greedy vs. Realistic ε-greedy
  • 16. ● Take into account each action’s potential to be optimal ● Selected less → more potential ● Difficult to extend beyond multi-armed bandits Upper Confidence Bound (UCB)
  • 18. ● Learn a numerical preference for each action ● Convert to probability with softmax: Gradient Bandit Algorithms
  • 19. ● Update preference with SGD ● Baseline : average of all rewards ● Increase probability if reward is above baseline ● Decrease probability if reward is below baseline Gradient Bandit: Stochastic Gradient Descent
  • 21. Parameter Study ● Check performance in best setting ● Check hyperparameter sensitivity
  • 22. ● Observe some context that can help decision ● Intermediate between multi-armed bandit and full RL problem ○ Need to learn a policy to associate observations and actions ○ Each action only affects immediate reward Associative Search (Contextual Bandit)
  • 23. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai