SlideShare a Scribd company logo
Lab Seminar: Contextual Bandit Survey
Sangwoo Mo
KAIST
swmo@kaist.ac.kr
August 4, 2016
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 1 / 32
Overview
1 Problem Setting
2 Na¨ıve Approach: Reduce to MAB
3 Stochastic Contextual Bandit
UCB & Thompson Sampling
Arbitrary Set of Policies
4 Adversarial Contextual Bandit
5 Supervised Learning to Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 2 / 32
Problem Setting
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 3 / 32
Multi-Armed Bandit
At each time t, the agent selects an arm at (at ∈ {1, ..., K})
Then, the agent recieves a reward rt(= rat ,t) from the enviroment
If ri,t is i.i.d. of some distribution, we call it stochastic bandit, and if
ri,t is selected by the enviroment, we call it adversarial bandit
The goal of MAB is to find the policy π ∈ Π s.t.
π(a1, r1, ...at−1, rt−1) = at
which minimizes the regret1
RT := max
i=1,...,K
E
T
t=1
ri,t −
T
t=1
rat ,t
1
Properly speaking, cumulative pseudo-regret.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 4 / 32
Contextual Bandit
In contextual bandit, the agent recieves an additional information
(=context) ct
1 ∈ C at the begining of time t
In stochastic contextual bandit, the reward ri,t can be represented as
a function of the context ci,t and noise i,t
ri,t = f (ci,t) + i,t
or simply ri,t = fi (ct) + i,t if ct is independent to i
In adversarial contextual bandit, the reward ri,t is selected by the
enviroment, as in the non-contextual MAB
1
Many literatures often notate ci,t to emphasize that each arm i has a corresponding context ci,t . However, both notations
are identical since we can construct a single vector ct by concatenating ci,t s.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 5 / 32
Optimal Regret Bound
Stochastic Bandit: Ω(log T)1
Adversarial Bandit: Ω(
√
KT)2
Contextual Bandit: Ω(d
√
T)3
1
Lai & Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985.
2
Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. FOCS, 1995. By minmax strategy.
Note that adversarial bandit can be thought as a 2-player game by the agent and the enviroment.
3
Dani et al. Stochastic Linear Optimization under Bandit Feedback. COLT, 2012. Remark that the lower bound is Ω(
√
T)
even for the stochastic contextual bandit, since context may come in adversarially.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 6 / 32
Na¨ıve Approach: Reduce to MAB
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 7 / 32
Na¨ıve Approach: Reduce to MAB
Approach 1: assume the context set is finite (|C| = N)
Run MAB algorithm (ex. EXP3) for each context independently
The regret bound is O(
√
TNK log K)1 (w/ EXP3)
Approach 2: assume the policy space is finite (|H| = M)
Run MAB algorithm (ex. EXP3) on policies, instead of arms
The regret bound is O(
√
TM log M) (w/ EXP3)
1 N
c=1 O(nc
√
K log K) ≤ O(
√
TN
√
K log K) where nc is number of context c observed (by Cauchy-Schwarz inequality)
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 8 / 32
Stochastic Contextual Bandit
UCB & Thompson Sampling
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 9 / 32
Review: Index Policy and Greedy Algorithm
Since Gittins Index1, index policy became one of the most popular
strategy for MAB problems
Idea: for each time t, define a score si,t (=index) for each arm i.
Select an arm which has the highest score
Question: how to define proper si,t?
Na¨ıve approach: use empirical mean2! (greedy algorithm)
However, na¨ıve greedy algorithm may occur O(T) regret
1
Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society, 1979.
2
Note that MAB becomes trivial if we know the true mean. The general goal of MAB algorithms is to estimate mean
correctly and rapidly (explore-exploit dilema)
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 10 / 32
Review: UCB1
Assume ri,t ∼ Pi with support [0, 1] and mean µi
Idea: select more seldom-selected arms and less often-selected arms.
In other words, give a confidence bonus1!
UCB12: define score as
si,t = ˆµi,t +
2 log t
ni,t
where ˆµi,t is empirical mean, and ni,t is number of arm i selected
UCB1 policy garantees the optimal regret O(log T)
Also, there are other choices for UCB (ex. KL-UCB3, Bayes-UCB4)
1
We call this bonus UCB(upper confidence bound). Thus, score = estimated mean + UCB.
2
Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002.
3
Garivier & Capp´e. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT, 2011.
4
Kaufmann et al. On Bayesian Upper Confidence Bounds for Bandit Problems. AISTATS, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 11 / 32
LinUCB
Assume ri,t ∼ P(ri,t | ci,t, θ) where E[ri,t] = cT
i,tθ∗ (ci,t, θ ∈ Rd )
Like UCB1, want to define score as
si,t = cT
i,t
ˆθt + UCBi,t
Question: how to choose proper UCBi,t?
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 12 / 32
LinUCB
Idea: let ˆθt be an estimator of θ∗ by ridge regression
ˆθt = (CT
t Ct + λId )−1
CT
t Rt
where Ct = {c1, ..., ct−1} and Rt = {r1, ..., rt−1}
Then, the inequality below holds with probability 1 − δ
T
cT
i,t
ˆθt − cT
i,tθ∗
≤ ( + 1) cT
i,tA−1
t ci,t
where At = CT
t Ct + Id and = 1
2 log 2TK
δ
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 13 / 32
LinUCB
LinUCB1: define score as
si,t = cT
i,t
ˆθt + α cT
i,tA−1
t ci,t
Regret bound (with probability 1 − δ) is
O(d T log
1 + T
δ
)
LinUCB policy garantees the optimal regret ˜O(d
√
T)
Also, there are other choices for UCB (ex. LinREL2, CoFineUCB3)
1
Li et al. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.
2
Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002.
3
Yue et al. Hierarchical Exploration for Accelerating Contextual Bandits. ICML, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 14 / 32
Review: Thompson Sampling
Another popular strategy for MAB is Thompson Sampling1
It can be applied to both contextual and non-contextual bandit
Assume ri,t ∼ P(ri,t | ci,t, θ∗) with prior θ∗ ∼ P(θ)
Idea: sample estimator ˆθt from the posterior distribution
step 1. draw θt from posterior P(θ | D = {ct, at, rt})
step 2. select arm ai = arg maxi E[ri,t | ci,t, θt]
The idea is simple, but it works well both in theory2 and in practice3
1
Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.
Biometrica, 1933.
2
Agrawal et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012.
3
Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 15 / 32
LinTS
Assume ri,t ∼ N(cT
i,tθ∗, v2) and θ∗ ∼ N(θt, v2B−1
t ) where
Bt =
t−1
τ=1
ci,τ cT
i,τ + Id , ˆθt = B−1
t
t−1
τ=1
ci,τ ri,τ
ri,t ∈ [¯ri,t − R, ¯ri,t + R], v = R
24
d log
t
δ
Then, the posterior of θ∗ is N(θt+1, v2B−1
t+1)
LinTS1: run Thompson Sampling in this assumption
Regret bound (with probability 1 − δ) is
O(
d2 √
T1+ log(Td) log
1
δ
)
1
Agrawal et al. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 16 / 32
UCB & TS: Nonlinear Case
Assume E[ri,t] = f (ci,t) is general nonlinear function
If we assume f is a member of exponential family, we can use
GLM-UCB1
If we assume f is sampled from a Guassian Process, we can use
GP-UCB2/CGP-UCB3
If we assume f is an element of Reproducing Kernel Hilbert Space,
we can use KernelUCB4
Also, we can use Thompson Sampling if we know the form of
probability distribution
1
Filippi et al. Parametric Bandits: The Generalized Linear Case. NIPS, 2010.
2
Srinivas et al. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010.
3
Krause & Ong. Contextual Gaussian Process Bandit Optimization. NIPS, 2011.
4
Valko et al. Finite-Time Analysis of Kernelised Contextual Bandits. UAI, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 17 / 32
Stochastic Contextual Bandit
Arbitrary Set of Policies
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 18 / 32
Epoch-Greedy
Assume policy space H if finite1
Idea: explore T steps and exploit T − T steps (epsilon-first)
issue 1. how to get an unbiased estimator of the best policy?
issue 2. how to balance explore and exploit if we don’t know T?
trick 1: use D = {ct, at, rt} observed in explore step
ˆπ = max
π∈H
(ct ,at ,rt )∈D
raI(π(ct) = at)
1/K
trick 2: run epsilon-first in mini-batches (partition of T)
1
Infinite w/ finite VC-dimension can be derived in similar way
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 19 / 32
Epoch-Greedy
Epoch-Greedy1: combine trick 1 & trick 2
Regret bound is ˜O(T2/3) (not optimal!)
1
Langford & Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS, 2007.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 20 / 32
RandomizedUCB
Idea: estimate the distribution Pt over the policy space H
RandomizedUCB1:
Regret bound is ˜O(
√
T), but time complexity is O(T6)
1
Dudik et al. Efficient Optimal Learning for Contextual Bandits. UAI, 2011.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 21 / 32
ILOVECONBANDITS
Idea: similar to RandomizedUCB, improve time complexity
ILOVECONBANDITS1 (Importance-weighted LOw-Variance
Epoch-Timed Oracleized CONtextual BANDITS):
Regret bound is ˜O(
√
T), and time complexity is O(T1.5)
1
Agrawal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML, 2014.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 22 / 32
Adversarial Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 23 / 32
Review: EXP3
Assume ri,t ∈ [0, 1] is selected by the enviroment
In adversarial setting, the agent must select arm randomly
Idea: weight more probability to higher-reward ovserved arms
EXP31 (EXPonential-weight algorithm for EXPloration and
EXPloitation):
Regret bound is O(
√
TK log K)
1
Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 24 / 32
EXP4
Idea: run EXP3 on policies, instead of arms
EXP41 (EXPonential-weight algorithm for EXPloration and
EXPloitation using EXPert advice):
Regret bound is O(
√
TK log N), but variance is high
1
Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 25 / 32
EXP4.P
Idea: run EXP4 with better weight, to make algorithm stable
EXP4.P1 (EXP4 with Probability):
Regret bound is O(
√
TK log N), with high probability
1
Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees. AISTATS, 2011.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 26 / 32
Supervised Learning to Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 27 / 32
Supervised Learning to Contextual Bandit
Idea: note that contextual bandit can be thought as a supervised
learing problem with partially-observed restriction
Trick: use randomized algorithm (ex. epsilon-greedy) and unbiased
(true) reward estimator ˆrat ,t =
rat ,t
pat
instead of observed reward rat ,t.
Then,
E[ˆri,t] = pi ·
ri,t
pi
+ (1 − pi ) · 0 = ri,t
Using this trick, any supervised learning algorithm can be converted
to a contextual bandit algorithm
Banditron and NeuralBandit are examples using neural network
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 28 / 32
Banditron and NeuralBandit
Both Banditron1 and NeuralBandit2 uses multi-layer perceptron and
epsilon-greedy algorithm w/ unbiased reward estimator
However, Banditron uses 0-1 loss (classification) while NeuralBandit
uses L2 loss (regression)
Regret bound of original Banditron is O(T2/3), and a 2nd-order
variant3 reduced it to ˜O(
√
T)
No theoretical garnatee is proved for NeuralBandit yet
1
Kakade et al. Efficient Bandit Algorithms for Online Multiclass Prediction. ICML, 2008.
2
Allesiardo et al. A Neural Networks Committee for the Contextual Bandit Problem. ICONIP, 2014.
3
Crammer & Gentile. Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 29 / 32
Summary & Reference
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 30 / 32
Summary
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 31 / 32
Reference
[Zhou 2015] A Survey on Contextual Multi-armed Bandits. arXiv,
2015.
[Burtini’ 2015] A Survey of Online Experiment Design with the
Stochastic Multi-Armed Bandit. arXiv, 2015.
[Bubeck’ 2012] Regret Analysis of Stochastic and Nonstochastic
Multi-armed Bandit Problems. arXiv, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 32 / 32

More Related Content

PPTX
Practical contextual bandits for business
PDF
Multi-armed Bandits
PDF
Multi-Armed Bandit and Applications
PDF
いろんなバンディットアルゴリズムを理解しよう
PDF
強化学習その4
PDF
Soft Actor Critic 解説
PDF
NumPy闇入門
PDF
Multi armed bandit
Practical contextual bandits for business
Multi-armed Bandits
Multi-Armed Bandit and Applications
いろんなバンディットアルゴリズムを理解しよう
強化学習その4
Soft Actor Critic 解説
NumPy闇入門
Multi armed bandit

What's hot (20)

PDF
Reinforcement Learning 2. Multi-armed Bandits
PDF
論文紹介 Pixel Recurrent Neural Networks
PDF
Shallow and Deep Latent Models for Recommender System
PDF
不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)
PDF
強化学習入門
PDF
Practical AI for Business: Bandit Algorithms
PDF
ブラックボックス最適化とその応用
PDF
DID, Synthetic Control, CausalImpact
PDF
General Tips for participating Kaggle Competitions
PPTX
[DL輪読会]Meta Reinforcement Learning
PDF
グラフデータ分析 入門編
PDF
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
PPTX
Fractional cascading
PPTX
multi-armed bandit
PDF
AtCoder Regular Contest 039 解説
PDF
強化学習その3
PDF
最小カットを使って「燃やす埋める問題」を解く
PPTX
第五回統計学勉強会@東大駒場
Reinforcement Learning 2. Multi-armed Bandits
論文紹介 Pixel Recurrent Neural Networks
Shallow and Deep Latent Models for Recommender System
不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)
強化学習入門
Practical AI for Business: Bandit Algorithms
ブラックボックス最適化とその応用
DID, Synthetic Control, CausalImpact
General Tips for participating Kaggle Competitions
[DL輪読会]Meta Reinforcement Learning
グラフデータ分析 入門編
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
Fractional cascading
multi-armed bandit
AtCoder Regular Contest 039 解説
強化学習その3
最小カットを使って「燃やす埋める問題」を解く
第五回統計学勉強会@東大駒場
Ad

Similar to Contextual Bandit Survey (20)

PDF
Multi-Armed Bandit: an algorithmic perspective
PDF
Multi-Armed Bandits:
 Intro, examples and tricks
PDF
Bandit Algorithms
PPTX
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
PPTX
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
PDF
MAB-ppt.pdf
PDF
Introduction to Multi-armed Bandits
ODP
Choosing between several options in uncertain environments
PDF
07-Richard-Combes Multi-Armed Bandits.pdf
PDF
Sequential and reinforcement learning for demand side management by Margaux B...
PDF
Multi Armed Bandits
PDF
Rotting Infinitely Many-Armed Bandits
PDF
Sutton reinforcement learning new ppt.pdf
PDF
bandits problems robert platt northeaster.pdf
PDF
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
PDF
Improving experimentation velocity via Multi-Armed Bandits
PPT
GAUSSIAN PRESENTATION (1).ppt
PPT
GAUSSIAN PRESENTATION.ppt
Multi-Armed Bandit: an algorithmic perspective
Multi-Armed Bandits:
 Intro, examples and tricks
Bandit Algorithms
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
MAB-ppt.pdf
Introduction to Multi-armed Bandits
Choosing between several options in uncertain environments
07-Richard-Combes Multi-Armed Bandits.pdf
Sequential and reinforcement learning for demand side management by Margaux B...
Multi Armed Bandits
Rotting Infinitely Many-Armed Bandits
Sutton reinforcement learning new ppt.pdf
bandits problems robert platt northeaster.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
Improving experimentation velocity via Multi-Armed Bandits
GAUSSIAN PRESENTATION (1).ppt
GAUSSIAN PRESENTATION.ppt
Ad

More from Sangwoo Mo (20)

PDF
Brief History of Visual Representation Learning
PDF
Learning Visual Representations from Uncurated Data
PDF
Hyperbolic Deep Reinforcement Learning
PDF
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
PDF
Self-supervised Learning Lecture Note
PDF
Deep Learning Theory Seminar (Chap 3, part 2)
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
PDF
Introduction to Diffusion Models
PDF
Object-Region Video Transformers
PDF
Deep Implicit Layers: Learning Structured Problems with Neural Networks
PDF
Learning Theory 101 ...and Towards Learning the Flat Minima
PDF
Sharpness-aware minimization (SAM)
PDF
Explicit Density Models
PDF
Score-Based Generative Modeling through Stochastic Differential Equations
PDF
Self-Attention with Linear Complexity
PDF
Meta-Learning with Implicit Gradients
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
PDF
Generative Models for General Audiences
PDF
Bayesian Model-Agnostic Meta-Learning
PDF
Deep Learning for Natural Language Processing
Brief History of Visual Representation Learning
Learning Visual Representations from Uncurated Data
Hyperbolic Deep Reinforcement Learning
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Self-supervised Learning Lecture Note
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 1-2, part 1)
Introduction to Diffusion Models
Object-Region Video Transformers
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Learning Theory 101 ...and Towards Learning the Flat Minima
Sharpness-aware minimization (SAM)
Explicit Density Models
Score-Based Generative Modeling through Stochastic Differential Equations
Self-Attention with Linear Complexity
Meta-Learning with Implicit Gradients
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Generative Models for General Audiences
Bayesian Model-Agnostic Meta-Learning
Deep Learning for Natural Language Processing

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Foundation of Data Science unit number two notes
PPT
Quality review (1)_presentation of this 21
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Introduction to Business Data Analytics.
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Global journeys: estimating international migration
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Reliability_Chapter_ presentation 1221.5784
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction-to-Cloud-ComputingFinal.pptx
Foundation of Data Science unit number two notes
Quality review (1)_presentation of this 21
IB Computer Science - Internal Assessment.pptx
Introduction to Business Data Analytics.
Clinical guidelines as a resource for EBP(1).pdf
Fluorescence-microscope_Botany_detailed content
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Global journeys: estimating international migration
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Galatica Smart Energy Infrastructure Startup Pitch Deck
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Lecture1 pattern recognition............
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Reliability_Chapter_ presentation 1221.5784

Contextual Bandit Survey

  • 1. Lab Seminar: Contextual Bandit Survey Sangwoo Mo KAIST swmo@kaist.ac.kr August 4, 2016 Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 1 / 32
  • 2. Overview 1 Problem Setting 2 Na¨ıve Approach: Reduce to MAB 3 Stochastic Contextual Bandit UCB & Thompson Sampling Arbitrary Set of Policies 4 Adversarial Contextual Bandit 5 Supervised Learning to Contextual Bandit Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 2 / 32
  • 3. Problem Setting Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 3 / 32
  • 4. Multi-Armed Bandit At each time t, the agent selects an arm at (at ∈ {1, ..., K}) Then, the agent recieves a reward rt(= rat ,t) from the enviroment If ri,t is i.i.d. of some distribution, we call it stochastic bandit, and if ri,t is selected by the enviroment, we call it adversarial bandit The goal of MAB is to find the policy π ∈ Π s.t. π(a1, r1, ...at−1, rt−1) = at which minimizes the regret1 RT := max i=1,...,K E T t=1 ri,t − T t=1 rat ,t 1 Properly speaking, cumulative pseudo-regret. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 4 / 32
  • 5. Contextual Bandit In contextual bandit, the agent recieves an additional information (=context) ct 1 ∈ C at the begining of time t In stochastic contextual bandit, the reward ri,t can be represented as a function of the context ci,t and noise i,t ri,t = f (ci,t) + i,t or simply ri,t = fi (ct) + i,t if ct is independent to i In adversarial contextual bandit, the reward ri,t is selected by the enviroment, as in the non-contextual MAB 1 Many literatures often notate ci,t to emphasize that each arm i has a corresponding context ci,t . However, both notations are identical since we can construct a single vector ct by concatenating ci,t s. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 5 / 32
  • 6. Optimal Regret Bound Stochastic Bandit: Ω(log T)1 Adversarial Bandit: Ω( √ KT)2 Contextual Bandit: Ω(d √ T)3 1 Lai & Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985. 2 Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. FOCS, 1995. By minmax strategy. Note that adversarial bandit can be thought as a 2-player game by the agent and the enviroment. 3 Dani et al. Stochastic Linear Optimization under Bandit Feedback. COLT, 2012. Remark that the lower bound is Ω( √ T) even for the stochastic contextual bandit, since context may come in adversarially. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 6 / 32
  • 7. Na¨ıve Approach: Reduce to MAB Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 7 / 32
  • 8. Na¨ıve Approach: Reduce to MAB Approach 1: assume the context set is finite (|C| = N) Run MAB algorithm (ex. EXP3) for each context independently The regret bound is O( √ TNK log K)1 (w/ EXP3) Approach 2: assume the policy space is finite (|H| = M) Run MAB algorithm (ex. EXP3) on policies, instead of arms The regret bound is O( √ TM log M) (w/ EXP3) 1 N c=1 O(nc √ K log K) ≤ O( √ TN √ K log K) where nc is number of context c observed (by Cauchy-Schwarz inequality) Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 8 / 32
  • 9. Stochastic Contextual Bandit UCB & Thompson Sampling Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 9 / 32
  • 10. Review: Index Policy and Greedy Algorithm Since Gittins Index1, index policy became one of the most popular strategy for MAB problems Idea: for each time t, define a score si,t (=index) for each arm i. Select an arm which has the highest score Question: how to define proper si,t? Na¨ıve approach: use empirical mean2! (greedy algorithm) However, na¨ıve greedy algorithm may occur O(T) regret 1 Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society, 1979. 2 Note that MAB becomes trivial if we know the true mean. The general goal of MAB algorithms is to estimate mean correctly and rapidly (explore-exploit dilema) Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 10 / 32
  • 11. Review: UCB1 Assume ri,t ∼ Pi with support [0, 1] and mean µi Idea: select more seldom-selected arms and less often-selected arms. In other words, give a confidence bonus1! UCB12: define score as si,t = ˆµi,t + 2 log t ni,t where ˆµi,t is empirical mean, and ni,t is number of arm i selected UCB1 policy garantees the optimal regret O(log T) Also, there are other choices for UCB (ex. KL-UCB3, Bayes-UCB4) 1 We call this bonus UCB(upper confidence bound). Thus, score = estimated mean + UCB. 2 Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002. 3 Garivier & Capp´e. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT, 2011. 4 Kaufmann et al. On Bayesian Upper Confidence Bounds for Bandit Problems. AISTATS, 2012. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 11 / 32
  • 12. LinUCB Assume ri,t ∼ P(ri,t | ci,t, θ) where E[ri,t] = cT i,tθ∗ (ci,t, θ ∈ Rd ) Like UCB1, want to define score as si,t = cT i,t ˆθt + UCBi,t Question: how to choose proper UCBi,t? Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 12 / 32
  • 13. LinUCB Idea: let ˆθt be an estimator of θ∗ by ridge regression ˆθt = (CT t Ct + λId )−1 CT t Rt where Ct = {c1, ..., ct−1} and Rt = {r1, ..., rt−1} Then, the inequality below holds with probability 1 − δ T cT i,t ˆθt − cT i,tθ∗ ≤ ( + 1) cT i,tA−1 t ci,t where At = CT t Ct + Id and = 1 2 log 2TK δ Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 13 / 32
  • 14. LinUCB LinUCB1: define score as si,t = cT i,t ˆθt + α cT i,tA−1 t ci,t Regret bound (with probability 1 − δ) is O(d T log 1 + T δ ) LinUCB policy garantees the optimal regret ˜O(d √ T) Also, there are other choices for UCB (ex. LinREL2, CoFineUCB3) 1 Li et al. A contextual-bandit approach to personalized news article recommendation. WWW, 2010. 2 Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002. 3 Yue et al. Hierarchical Exploration for Accelerating Contextual Bandits. ICML, 2012. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 14 / 32
  • 15. Review: Thompson Sampling Another popular strategy for MAB is Thompson Sampling1 It can be applied to both contextual and non-contextual bandit Assume ri,t ∼ P(ri,t | ci,t, θ∗) with prior θ∗ ∼ P(θ) Idea: sample estimator ˆθt from the posterior distribution step 1. draw θt from posterior P(θ | D = {ct, at, rt}) step 2. select arm ai = arg maxi E[ri,t | ci,t, θt] The idea is simple, but it works well both in theory2 and in practice3 1 Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrica, 1933. 2 Agrawal et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012. 3 Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 15 / 32
  • 16. LinTS Assume ri,t ∼ N(cT i,tθ∗, v2) and θ∗ ∼ N(θt, v2B−1 t ) where Bt = t−1 τ=1 ci,τ cT i,τ + Id , ˆθt = B−1 t t−1 τ=1 ci,τ ri,τ ri,t ∈ [¯ri,t − R, ¯ri,t + R], v = R 24 d log t δ Then, the posterior of θ∗ is N(θt+1, v2B−1 t+1) LinTS1: run Thompson Sampling in this assumption Regret bound (with probability 1 − δ) is O( d2 √ T1+ log(Td) log 1 δ ) 1 Agrawal et al. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML, 2013. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 16 / 32
  • 17. UCB & TS: Nonlinear Case Assume E[ri,t] = f (ci,t) is general nonlinear function If we assume f is a member of exponential family, we can use GLM-UCB1 If we assume f is sampled from a Guassian Process, we can use GP-UCB2/CGP-UCB3 If we assume f is an element of Reproducing Kernel Hilbert Space, we can use KernelUCB4 Also, we can use Thompson Sampling if we know the form of probability distribution 1 Filippi et al. Parametric Bandits: The Generalized Linear Case. NIPS, 2010. 2 Srinivas et al. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010. 3 Krause & Ong. Contextual Gaussian Process Bandit Optimization. NIPS, 2011. 4 Valko et al. Finite-Time Analysis of Kernelised Contextual Bandits. UAI, 2013. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 17 / 32
  • 18. Stochastic Contextual Bandit Arbitrary Set of Policies Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 18 / 32
  • 19. Epoch-Greedy Assume policy space H if finite1 Idea: explore T steps and exploit T − T steps (epsilon-first) issue 1. how to get an unbiased estimator of the best policy? issue 2. how to balance explore and exploit if we don’t know T? trick 1: use D = {ct, at, rt} observed in explore step ˆπ = max π∈H (ct ,at ,rt )∈D raI(π(ct) = at) 1/K trick 2: run epsilon-first in mini-batches (partition of T) 1 Infinite w/ finite VC-dimension can be derived in similar way Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 19 / 32
  • 20. Epoch-Greedy Epoch-Greedy1: combine trick 1 & trick 2 Regret bound is ˜O(T2/3) (not optimal!) 1 Langford & Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS, 2007. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 20 / 32
  • 21. RandomizedUCB Idea: estimate the distribution Pt over the policy space H RandomizedUCB1: Regret bound is ˜O( √ T), but time complexity is O(T6) 1 Dudik et al. Efficient Optimal Learning for Contextual Bandits. UAI, 2011. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 21 / 32
  • 22. ILOVECONBANDITS Idea: similar to RandomizedUCB, improve time complexity ILOVECONBANDITS1 (Importance-weighted LOw-Variance Epoch-Timed Oracleized CONtextual BANDITS): Regret bound is ˜O( √ T), and time complexity is O(T1.5) 1 Agrawal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML, 2014. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 22 / 32
  • 23. Adversarial Contextual Bandit Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 23 / 32
  • 24. Review: EXP3 Assume ri,t ∈ [0, 1] is selected by the enviroment In adversarial setting, the agent must select arm randomly Idea: weight more probability to higher-reward ovserved arms EXP31 (EXPonential-weight algorithm for EXPloration and EXPloitation): Regret bound is O( √ TK log K) 1 Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 24 / 32
  • 25. EXP4 Idea: run EXP3 on policies, instead of arms EXP41 (EXPonential-weight algorithm for EXPloration and EXPloitation using EXPert advice): Regret bound is O( √ TK log N), but variance is high 1 Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 25 / 32
  • 26. EXP4.P Idea: run EXP4 with better weight, to make algorithm stable EXP4.P1 (EXP4 with Probability): Regret bound is O( √ TK log N), with high probability 1 Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees. AISTATS, 2011. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 26 / 32
  • 27. Supervised Learning to Contextual Bandit Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 27 / 32
  • 28. Supervised Learning to Contextual Bandit Idea: note that contextual bandit can be thought as a supervised learing problem with partially-observed restriction Trick: use randomized algorithm (ex. epsilon-greedy) and unbiased (true) reward estimator ˆrat ,t = rat ,t pat instead of observed reward rat ,t. Then, E[ˆri,t] = pi · ri,t pi + (1 − pi ) · 0 = ri,t Using this trick, any supervised learning algorithm can be converted to a contextual bandit algorithm Banditron and NeuralBandit are examples using neural network Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 28 / 32
  • 29. Banditron and NeuralBandit Both Banditron1 and NeuralBandit2 uses multi-layer perceptron and epsilon-greedy algorithm w/ unbiased reward estimator However, Banditron uses 0-1 loss (classification) while NeuralBandit uses L2 loss (regression) Regret bound of original Banditron is O(T2/3), and a 2nd-order variant3 reduced it to ˜O( √ T) No theoretical garnatee is proved for NeuralBandit yet 1 Kakade et al. Efficient Bandit Algorithms for Online Multiclass Prediction. ICML, 2008. 2 Allesiardo et al. A Neural Networks Committee for the Contextual Bandit Problem. ICONIP, 2014. 3 Crammer & Gentile. Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML, 2013. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 29 / 32
  • 30. Summary & Reference Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 30 / 32
  • 31. Summary Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 31 / 32
  • 32. Reference [Zhou 2015] A Survey on Contextual Multi-armed Bandits. arXiv, 2015. [Burtini’ 2015] A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit. arXiv, 2015. [Bubeck’ 2012] Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. arXiv, 2012. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 32 / 32