SlideShare a Scribd company logo
1
Multi-armed bandits
• At each time t pick arm i;
• get independent payoff ft with mean ui
• Classic model for exploration – exploitation tradeoff
• Extensively studied (Robbins ’52, Gittins ’79)
• Typically assume each arm is tried multiple times
• Goal: minimize regret
…
u1 u2 u3 uK
1
[ ]
t
T opt t
i
T E f
R 

  
2
Infinite-armed bandits
…
p1 p2 p3 pk
… p∞
p1 p2
…
In many applications, number of arms is huge
(sponsored search, sensor selection)
Cannot try each arm even once
Assumptions on payoff function f essential
Optimizing Noisy, Unknown Functions
• Given: Set of possible inputs D;
black-box access to unknown function f
• Want: Adaptive choice of inputs
from D maximizing
• Many applications: robotic control [Lizotte
et al. ’07], sponsored search [Pande &
Olston, ’07], clinical trials, …
• Sampling is expensive
• Algorithms evaluated using regret
Goal: minimize
Running example: Noisy Search
• How to find the hottest point in a building?
• Many noisy sensors available but sampling is expensive
• D: set of sensors; : temperature at chosen at step i
Observe
• Goal: Find with minimal number of queries
4
Relating to us: Active learning for PMF
A bandit setting for movie recommendation
Task: recommend movies for a new user
M-armed Bandit
Movie item as arm of bandit
For a new user i
At each round t, pick a movie j
Observe a rating Xij
Goal: maximize cumulative reward
sum of the ratings of all recommended movies
Model: PMF
X=UV+E, where
U: N*K matrix, V: K*M matrix, E: N*M matrix, zero-mean normal distributed
Assume movie feature V is fully observed. User feature Ui is unknown at first
Xi(j) = Ui Vj + ε (regard the ith row vector of X as a function Xi)
Xi(.): random linear function
5
Key insight: Exploit correlation
• Sampling f(x) at one point x yields information about f(x’)
for points x’ near x
• In this paper:
Model correlation using a Gaussian process (GP) prior for f
6
Temperature is
spatially correlated
Gaussian Processes to model payoff f
• Gaussian process (GP) = normal distribution over functions
• Finite marginals are multivariate Gaussians
• Closed form formulae for Bayesian posterior update exist
• Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’))
7
Normal dist.
(1-D Gaussian)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
-1
0
1
2
0
0.1
0.2
0.3
0.4
Multivariate normal
(n-D Gaussian)
+
+
+
+
Gaussian process
(∞-D Gaussian)
8
Thinking about GPs
• Kernel function K(x, x’) specifies covariance
• Encodes smoothness assumptions
x
f(x)
P(f(x))
f(x)
9
Example of GPs
• Squared exponential kernel
K(x,x’) = exp(-(x-x’)2/h2)
0 0.2 0.4 0.6 0.8 1
-4
-3
-2
-1
0
1
2
Bandwidth h=.1
0 100 200 300 400 500 600 700
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Distance |x-x’|
0 0.2 0.4 0.6 0.8 1
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Bandwidth h=.3
Samples from P(f)
-3 -2 -1 0 1 2 3
Gaussian process optimization
[e.g., Jones et al ’98]
10
x
f(x)
Goal: Adaptively pick
inputs such that
Key question: how should we pick samples?
So far, only heuristics:
Expected Improvement [Močkus et al. ‘78]
Most Probable Improvement [Močkus ‘89]
Used successfully in machine learning [Ginsbourger et al. ‘08,
Jones ‘01, Lizotte et al. ’07]
No theoretical guarantees on their regret!
11
Simple algorithm for GP optimization
• In each round t do:
• Pick
• Observe
• Use Bayes’ rule to get posterior mean
Can get stuck in local maxima!
11
x
f(x)
12
Uncertainty sampling
Pick:
That’s equivalent to (greedily) maximizing
information gain
Popular objective in Bayesian experimental design
(where the goal is pure exploration of f)
But…wastes samples by exploring f everywhere!
12
x
f(x)
Avoiding unnecessary samples
Key insight: Never need to sample where Upper
Confidence Bound (UCB) < best lower bound! 13
x
f(x)
Best lower
bound
14
Upper Confidence Bound (UCB) Algorithm
Naturally trades off explore and exploit; no samples wasted
Regret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07]
But none in the GP optimization setting! (popular heuristic)
x
f(x)
Pick input that maximizes Upper Confidence Bound (UCB):
How should
we choose ¯t?
Need theory!
15
How well does UCB work?
• Intuitively, performance should depend on how
“learnable” the function is
15
“Easy” “Hard”
The quicker confidence bands collapse, the easier to learn
Key idea: Rate of collapse  growth of information gain
0 0.2 0.4 0.6 0.8 1
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Bandwidth h=.3
0 0.2 0.4 0.6 0.8 1
-4
-3
-2
-1
0
1
2
Bandwidth h=.1
Learnability and information gain
• We show that regret bounds depend on how quickly we can
gain information
• Mathematically:
• Establishes a novel connection between GP optimization
and Bayesian experimental design
16
T
17
Performance of optimistic sampling
Theorem
If we choose ¯t = £(log t), then with high probability,
Hereby
The slower γT grows, the easier f is to learn
Key question: How quickly does γ T grow?
17
Maximal information gain
due to sampling!
Learnability and information gain
• Information gain exhibits diminishing returns (submodularity)
[Krause & Guestrin ’05]
• Our bounds depend on “rate” of diminishment
18
Little diminishing returns
Returns diminish fast
Dealing with high dimensions
Theorem: For various popular kernels, we have:
• Linear: ;
• Squared-exponential: ;
• Matérn with , ;
Smoothness of f helps battle curse of dimensionality!
Our bounds rely on submodularity of
19
What if f is not from a GP?
• In practice, f may not be Gaussian
Theorem: Let f lie in the RKHS of kernel K with ,
and let the noise be bounded almost surely by .
Choose .Then with high probab.,
• Frees us from knowing the “true prior”
• Intuitively, the bound depends on the “complexity” of
the function through its RKHS norm
20
Experiments: UCB vs. heuristics
• Temperature data
• 46 sensors deployed at Intel Research, Berkeley
• Collected data for 5 days (1 sample/minute)
• Want to adaptively find highest temperature as quickly as
possible
• Traffic data
• Speed data from 357 sensors deployed along highway I-880
South
• Collected during 6am-11am, for one month
• Want to find most congested (lowest speed) area as quickly as
possible
21
Comparison: UCB vs. heuristics
22
GP-UCB compares favorably with existing heuristics
23
Assumptions on f
Linear?
[Dani et al, ’07]
Lipschitz-continuous
(bounded slope)
[Kleinberg ‘08]
Fast convergence;
But strong assumption
Very flexible, but
Conclusions
• First theoretical guarantees and convergence rates
for GP optimization
• Both true prior and agnostic case covered
• Performance depends on “learnability”, captured by
maximal information gain
• Connects GP Bandit Optimization & Experimental Design!
• Performance on real data comparable to other heuristics
24

More Related Content

PDF
Multi-Armed Bandit: an algorithmic perspective
PDF
Multi-Armed Bandits:
 Intro, examples and tricks
PPTX
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
PPTX
2Multi_armed_bandits.pptx
PDF
Scott Clark, Software Engineer, Yelp at MLconf SF
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Optimal Learning for Fun and Profit with MOE
PDF
Facebook Talk at Netflix ML Platform meetup Sep 2019
Multi-Armed Bandit: an algorithmic perspective
Multi-Armed Bandits:
 Intro, examples and tricks
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
2Multi_armed_bandits.pptx
Scott Clark, Software Engineer, Yelp at MLconf SF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Optimal Learning for Fun and Profit with MOE
Facebook Talk at Netflix ML Platform meetup Sep 2019

Similar to GAUSSIAN PRESENTATION (1).ppt (20)

PDF
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
PDF
Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based E...
PDF
Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based E...
PPT
Introduction
PPTX
効率的反実仮想学習
PDF
Practical-bayesian-optimization-of-machine-learning-algorithms_ver2
PDF
AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...
PPTX
Gaussian processing
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
PDF
"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...
PDF
Introduction to Multi-armed Bandits
PPTX
PRML Chapter 6
PDF
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
PDF
Epsrcws08 campbell isvm_01
PDF
A practical Introduction to Machine(s) Learning
PDF
Bandit algorithms for website optimization - A summary
PDF
Still works
PPTX
STLtalk about statistical analysis and its application
PDF
07-Richard-Combes Multi-Armed Bandits.pdf
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based E...
Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based E...
Introduction
効率的反実仮想学習
Practical-bayesian-optimization-of-machine-learning-algorithms_ver2
AI Data Summit 2019 Thompson Sampling - Thompson Sampling Tutorial “The redis...
Gaussian processing
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...
Introduction to Multi-armed Bandits
PRML Chapter 6
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
Epsrcws08 campbell isvm_01
A practical Introduction to Machine(s) Learning
Bandit algorithms for website optimization - A summary
Still works
STLtalk about statistical analysis and its application
07-Richard-Combes Multi-Armed Bandits.pdf
Ad

Recently uploaded (20)

PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPT
Chemical bonding and molecular structure
PDF
An interstellar mission to test astrophysical black holes
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Microbiology with diagram medical studies .pptx
PPTX
2. Earth - The Living Planet earth and life
PPT
protein biochemistry.ppt for university classes
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Chemical bonding and molecular structure
An interstellar mission to test astrophysical black holes
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
7. General Toxicologyfor clinical phrmacy.pptx
Derivatives of integument scales, beaks, horns,.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
neck nodes and dissection types and lymph nodes levels
The KM-GBF monitoring framework – status & key messages.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Microbiology with diagram medical studies .pptx
2. Earth - The Living Planet earth and life
protein biochemistry.ppt for university classes
bbec55_b34400a7914c42429908233dbd381773.pdf
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
2. Earth - The Living Planet Module 2ELS
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Ad

GAUSSIAN PRESENTATION (1).ppt

  • 1. 1 Multi-armed bandits • At each time t pick arm i; • get independent payoff ft with mean ui • Classic model for exploration – exploitation tradeoff • Extensively studied (Robbins ’52, Gittins ’79) • Typically assume each arm is tried multiple times • Goal: minimize regret … u1 u2 u3 uK 1 [ ] t T opt t i T E f R     
  • 2. 2 Infinite-armed bandits … p1 p2 p3 pk … p∞ p1 p2 … In many applications, number of arms is huge (sponsored search, sensor selection) Cannot try each arm even once Assumptions on payoff function f essential
  • 3. Optimizing Noisy, Unknown Functions • Given: Set of possible inputs D; black-box access to unknown function f • Want: Adaptive choice of inputs from D maximizing • Many applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, … • Sampling is expensive • Algorithms evaluated using regret Goal: minimize
  • 4. Running example: Noisy Search • How to find the hottest point in a building? • Many noisy sensors available but sampling is expensive • D: set of sensors; : temperature at chosen at step i Observe • Goal: Find with minimal number of queries 4
  • 5. Relating to us: Active learning for PMF A bandit setting for movie recommendation Task: recommend movies for a new user M-armed Bandit Movie item as arm of bandit For a new user i At each round t, pick a movie j Observe a rating Xij Goal: maximize cumulative reward sum of the ratings of all recommended movies Model: PMF X=UV+E, where U: N*K matrix, V: K*M matrix, E: N*M matrix, zero-mean normal distributed Assume movie feature V is fully observed. User feature Ui is unknown at first Xi(j) = Ui Vj + ε (regard the ith row vector of X as a function Xi) Xi(.): random linear function 5
  • 6. Key insight: Exploit correlation • Sampling f(x) at one point x yields information about f(x’) for points x’ near x • In this paper: Model correlation using a Gaussian process (GP) prior for f 6 Temperature is spatially correlated
  • 7. Gaussian Processes to model payoff f • Gaussian process (GP) = normal distribution over functions • Finite marginals are multivariate Gaussians • Closed form formulae for Bayesian posterior update exist • Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 7 Normal dist. (1-D Gaussian) -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1 0 1 2 0 0.1 0.2 0.3 0.4 Multivariate normal (n-D Gaussian) + + + + Gaussian process (∞-D Gaussian)
  • 8. 8 Thinking about GPs • Kernel function K(x, x’) specifies covariance • Encodes smoothness assumptions x f(x) P(f(x)) f(x)
  • 9. 9 Example of GPs • Squared exponential kernel K(x,x’) = exp(-(x-x’)2/h2) 0 0.2 0.4 0.6 0.8 1 -4 -3 -2 -1 0 1 2 Bandwidth h=.1 0 100 200 300 400 500 600 700 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Distance |x-x’| 0 0.2 0.4 0.6 0.8 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Bandwidth h=.3 Samples from P(f) -3 -2 -1 0 1 2 3
  • 10. Gaussian process optimization [e.g., Jones et al ’98] 10 x f(x) Goal: Adaptively pick inputs such that Key question: how should we pick samples? So far, only heuristics: Expected Improvement [Močkus et al. ‘78] Most Probable Improvement [Močkus ‘89] Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07] No theoretical guarantees on their regret!
  • 11. 11 Simple algorithm for GP optimization • In each round t do: • Pick • Observe • Use Bayes’ rule to get posterior mean Can get stuck in local maxima! 11 x f(x)
  • 12. 12 Uncertainty sampling Pick: That’s equivalent to (greedily) maximizing information gain Popular objective in Bayesian experimental design (where the goal is pure exploration of f) But…wastes samples by exploring f everywhere! 12 x f(x)
  • 13. Avoiding unnecessary samples Key insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound! 13 x f(x) Best lower bound
  • 14. 14 Upper Confidence Bound (UCB) Algorithm Naturally trades off explore and exploit; no samples wasted Regret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07] But none in the GP optimization setting! (popular heuristic) x f(x) Pick input that maximizes Upper Confidence Bound (UCB): How should we choose ¯t? Need theory!
  • 15. 15 How well does UCB work? • Intuitively, performance should depend on how “learnable” the function is 15 “Easy” “Hard” The quicker confidence bands collapse, the easier to learn Key idea: Rate of collapse  growth of information gain 0 0.2 0.4 0.6 0.8 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Bandwidth h=.3 0 0.2 0.4 0.6 0.8 1 -4 -3 -2 -1 0 1 2 Bandwidth h=.1
  • 16. Learnability and information gain • We show that regret bounds depend on how quickly we can gain information • Mathematically: • Establishes a novel connection between GP optimization and Bayesian experimental design 16 T
  • 17. 17 Performance of optimistic sampling Theorem If we choose ¯t = £(log t), then with high probability, Hereby The slower γT grows, the easier f is to learn Key question: How quickly does γ T grow? 17 Maximal information gain due to sampling!
  • 18. Learnability and information gain • Information gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05] • Our bounds depend on “rate” of diminishment 18 Little diminishing returns Returns diminish fast
  • 19. Dealing with high dimensions Theorem: For various popular kernels, we have: • Linear: ; • Squared-exponential: ; • Matérn with , ; Smoothness of f helps battle curse of dimensionality! Our bounds rely on submodularity of 19
  • 20. What if f is not from a GP? • In practice, f may not be Gaussian Theorem: Let f lie in the RKHS of kernel K with , and let the noise be bounded almost surely by . Choose .Then with high probab., • Frees us from knowing the “true prior” • Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 20
  • 21. Experiments: UCB vs. heuristics • Temperature data • 46 sensors deployed at Intel Research, Berkeley • Collected data for 5 days (1 sample/minute) • Want to adaptively find highest temperature as quickly as possible • Traffic data • Speed data from 357 sensors deployed along highway I-880 South • Collected during 6am-11am, for one month • Want to find most congested (lowest speed) area as quickly as possible 21
  • 22. Comparison: UCB vs. heuristics 22 GP-UCB compares favorably with existing heuristics
  • 23. 23 Assumptions on f Linear? [Dani et al, ’07] Lipschitz-continuous (bounded slope) [Kleinberg ‘08] Fast convergence; But strong assumption Very flexible, but
  • 24. Conclusions • First theoretical guarantees and convergence rates for GP optimization • Both true prior and agnostic case covered • Performance depends on “learnability”, captured by maximal information gain • Connects GP Bandit Optimization & Experimental Design! • Performance on real data comparable to other heuristics 24

Editor's Notes

  • #2: Explanation of k-armed bandit ! 
  • #4: Repeat what f is – give an example !
  • #5: Floorplan looks funny (pixelated)
  • #6: Floorplan looks funny (pixelated)
  • #17: Add cartoon plot for \gamma_T; need axes, etc.