Multi-Armed Bandit and Applications

Multi-Armed Bandits and Applications
Sangwoo Mo
KAIST
swmo@kaist.ac.kr
December 23, 2016
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 1 / 28

Papers
Theory
Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem.
Machine Learning, 2002.
Application
Kveton et al. Cascading Bandits: Learning to Rank in the Cascade
Model. ICML, 2015.
Caron & Bhagat. Mixing Bandits: A recipe for Improved Cold-Start
Recommendations in a Social Network. SNA-KDD, 2013.

Overview
1 Multi-Armed Bandit
2 UCB: The Optimal Algorithm
3 Application 1: Ranking
4 Application 2: Recommendation

Multi-Armed Bandit

What is Multi-Armed Bandit?
One-Armed Bandit = Slot Machine (English slang)
source: infoslotmachine.com

What is Multi-Armed Bandit?
Multi-Armed Bandit = Multiple Slot Machine
Objective: maximize reward in a casino
source: Microsoft Research

Real Motivation
A/B Test, Online Advertisement, etc.
Objective: maximize conversion rate, etc.
source: VWO

Real Motivation

Problem Setting
# of arms K, # of rounds T
For each round t = 1, ..., T
1. the reward vector rt = (r1,t, ..., rK,t) is generated
2. the agent chooses an arm it ∈ {1, ..., K}
3. the agent recieves the reward rit ,t
Remark: rewards of unchosen arms ri=it ,t are not revealed
We call this partially observable property as the bandit setting

Problem Setting (Stochastic Bandit)
The reward ri,t follows the probability distribution Pi , with mean µi
Here, the agent should ﬁnd the arm with the highest µi
source: Pandey et al.’s slide
Today, we will only consider the stochastic bandit

Objective
Objective: minimize the (expected cumulative) regret
RT = E[
T
t=1
(ri∗,t − rit ,t)] =
T
t=1
(µ∗
− µit ) =
K
i=1
∆i ni
where i∗ = arg max[µi ], ∆i = µ∗ − µi , and ni = T
t=1 1[it = i]
It is shown that the asymptotic lower bound [LR 85] of the regret is
lim
T→∞
RT
log T
≥
∆i >0
∆i
KL(Pi ||Pi∗ )
We call a bandit algorithm is optimal if its regret is O(log T)

Exploration-Exploitation Dilema
Exploration vs Exploitation
exploration gather more information
exploitation make the best decision with given information
Two Na¨ıve Algorithms
Random (= full exploration): choose arm randomly
Greedy (= full exploitation): choose the empirical best arm
Both algorithm occurs the linear regret (why?)

Exploration-Exploitation Dilema
Fundamental question of bandit:
How to balance between exploration and exploitation?
source: RSM Discovery

UCB: The Optimal Algorithm

Motivation of UCB
Recall: Greedy algorithm occurs the linear regret
Reason: It chooses the wrong answer with overconfidence
Idea: Add confidence bonus to the estimated mean!
(If the estimator is reliable, choose less; if not, choose more)
source: Garivier & Cappé’s slide

UCB1 [ACF 02]
UCB1: choose the arm s.t.
it = arg max






ˆµi +
c log t
ni
ucbi






Theorem
Let ri,t is bounded in [0, 1]. Let c = 2. Then, the regret of UCB1 is
Rt =

8
i:µi <µ∗
(
log t
∆i
)

 + 1 +
π2
3
K
i=1
∆i .

UCB1 [ACF 02]
Sketch of Proof.
If the agent chooses a suboptimal arm,
ˆµi + ucbi ≥ ˆµ∗
+ ucb∗
.
With some modiﬁcation,
ˆµi − (µi + ucbi )
A
+ (µi + 2ucbi ) − µ∗
B
≥ ˆµ∗
− (µ∗
− ucb∗
)
−C
.
Here, at least one of A, B, or C should be nonnegative.

UCB1 [ACF 02]
Sketch of Proof (Cont.)
By Chernoﬀ bound, Pr(A ≥ 0) 0 and Pr(C ≥ 0) 0.
Pr(A) = Pr(ˆµi ≥ µi + ucbi ) ≤ exp(−2
2 log t
ni
ni ) = t−4
Also, Pr(B ≥ 0) = 0 if ni ≥ 8 log t
∆2
i
.
µi + 2ucbi = µi + 2
2 log t
ni
≤ µi + ∆i = µ∗
Combining two results,
E[ni ] ≤
8 log t
∆2
i
+ t−4
= O(log t)

UCB variants
UCB1 achived the optimality in order, but not in constant
Recall: The assymptotic lower bound [LR 85] of the regret is
lim
T→∞
RT
log T
≥
∆i >0
∆i
KL(Pi ||Pi∗ )
A lot of UCB variants were proposed to achieve the optimality
Finally, KL-UCB [GC 11] and Bayes-UCB [KCG 12] achieved it
Proof scheme is similar to UCB1
(1) UCB term goes to zero by AH-like inequality (A/C of UCB1)
(2) residual term after O(log T) goes to zero (B of UCB1)

Application 1: Ranking

Learning-to-Rank
In some applications, we need to ﬁnd top-K, not the best
source: Gupta’s tutorial slide

Relation to Multi-Armed Bandit
There is an exploration-exploitation dilema
exploration gather more information about user’s preference
exploitation recommend the top-K list with given information
It is natural to apply bandit approach

Algorithm [KSWA 2015]
source: original paper

Application 2: Recommendation

Cold-Start Problem
Collaborative filtering is widely used for recommendation
It is highly effective when there is sufficient data, but suffers when
new user enters; which is called cold-start problem
source: Elahi’s survey slide

Relation to Multi-Armed Bandit
There is an exploration-exploitation dilema
exploration gather more information about user’s preference
exploitation recommend the best item with given information
It is natural to apply bandit approach

Algorithm [CB 2013]
source: original paper

Take Home Message
Bandit is an insteresting topic for both theorists and practitioners
The core of bandit is partially observable and exp-exp dilema
UCB is one great idea to attack the problem
Single message to keep in mind:
If your research encounters exp-exp dilema,
consider to apply bandit approach!

Multi-Armed Bandit and Applications

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Multi-Armed Bandit and Applications (20)

More from Sangwoo Mo (20)

Recently uploaded (20)

Multi-Armed Bandit and Applications