SlideShare a Scribd company logo
Multi-Armed Bandits and Applications
Sangwoo Mo
KAIST
swmo@kaist.ac.kr
December 23, 2016
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 1 / 28
Papers
Theory
Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem.
Machine Learning, 2002.
Application
Kveton et al. Cascading Bandits: Learning to Rank in the Cascade
Model. ICML, 2015.
Caron & Bhagat. Mixing Bandits: A recipe for Improved Cold-Start
Recommendations in a Social Network. SNA-KDD, 2013.
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 2 / 28
Overview
1 Multi-Armed Bandit
2 UCB: The Optimal Algorithm
3 Application 1: Ranking
4 Application 2: Recommendation
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 3 / 28
Multi-Armed Bandit
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 4 / 28
What is Multi-Armed Bandit?
One-Armed Bandit = Slot Machine (English slang)
source: infoslotmachine.com
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 5 / 28
What is Multi-Armed Bandit?
Multi-Armed Bandit = Multiple Slot Machine
Objective: maximize reward in a casino
source: Microsoft Research
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 6 / 28
Real Motivation
A/B Test, Online Advertisement, etc.
Objective: maximize conversion rate, etc.
source: VWO
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 7 / 28
Real Motivation
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 8 / 28
Problem Setting
# of arms K, # of rounds T
For each round t = 1, ..., T
1. the reward vector rt = (r1,t, ..., rK,t) is generated
2. the agent chooses an arm it ∈ {1, ..., K}
3. the agent recieves the reward rit ,t
Remark: rewards of unchosen arms ri=it ,t are not revealed
We call this partially observable property as the bandit setting
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 9 / 28
Problem Setting (Stochastic Bandit)
The reward ri,t follows the probability distribution Pi , with mean µi
Here, the agent should find the arm with the highest µi
source: Pandey et al.’s slide
Today, we will only consider the stochastic bandit
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 10 / 28
Objective
Objective: minimize the (expected cumulative) regret
RT = E[
T
t=1
(ri∗,t − rit ,t)] =
T
t=1
(µ∗
− µit ) =
K
i=1
∆i ni
where i∗ = arg max[µi ], ∆i = µ∗ − µi , and ni = T
t=1 1[it = i]
It is shown that the asymptotic lower bound [LR 85] of the regret is
lim
T→∞
RT
log T
≥
∆i >0
∆i
KL(Pi ||Pi∗ )
We call a bandit algorithm is optimal if its regret is O(log T)
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 11 / 28
Exploration-Exploitation Dilema
Exploration vs Exploitation
exploration gather more information
exploitation make the best decision with given information
Two Na¨ıve Algorithms
Random (= full exploration): choose arm randomly
Greedy (= full exploitation): choose the empirical best arm
Both algorithm occurs the linear regret (why?)
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 12 / 28
Exploration-Exploitation Dilema
Fundamental question of bandit:
How to balance between exploration and exploitation?
source: RSM Discovery
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 13 / 28
UCB: The Optimal Algorithm
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 14 / 28
Motivation of UCB
Recall: Greedy algorithm occurs the linear regret
Reason: It chooses the wrong answer with overconfidence
Idea: Add confidence bonus to the estimated mean!
(If the estimator is reliable, choose less; if not, choose more)
source: Garivier & Capp´e’s slide
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 15 / 28
UCB1 [ACF 02]
UCB1: choose the arm s.t.
it = arg max






ˆµi +
c log t
ni
ucbi






Theorem
Let ri,t is bounded in [0, 1]. Let c = 2. Then, the regret of UCB1 is
Rt =

8
i:µi <µ∗
(
log t
∆i
)

 + 1 +
π2
3
K
i=1
∆i .
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 16 / 28
UCB1 [ACF 02]
Sketch of Proof.
If the agent chooses a suboptimal arm,
ˆµi + ucbi ≥ ˆµ∗
+ ucb∗
.
With some modification,
ˆµi − (µi + ucbi )
A
+ (µi + 2ucbi ) − µ∗
B
≥ ˆµ∗
− (µ∗
− ucb∗
)
−C
.
Here, at least one of A, B, or C should be nonnegative.
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 17 / 28
UCB1 [ACF 02]
Sketch of Proof (Cont.)
By Chernoff bound, Pr(A ≥ 0) 0 and Pr(C ≥ 0) 0.
Pr(A) = Pr(ˆµi ≥ µi + ucbi ) ≤ exp(−2
2 log t
ni
ni ) = t−4
Also, Pr(B ≥ 0) = 0 if ni ≥ 8 log t
∆2
i
.
µi + 2ucbi = µi + 2
2 log t
ni
≤ µi + ∆i = µ∗
Combining two results,
E[ni ] ≤
8 log t
∆2
i
+ t−4
= O(log t)
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 18 / 28
UCB variants
UCB1 achived the optimality in order, but not in constant
Recall: The assymptotic lower bound [LR 85] of the regret is
lim
T→∞
RT
log T
≥
∆i >0
∆i
KL(Pi ||Pi∗ )
A lot of UCB variants were proposed to achieve the optimality
Finally, KL-UCB [GC 11] and Bayes-UCB [KCG 12] achieved it
Proof scheme is similar to UCB1
(1) UCB term goes to zero by AH-like inequality (A/C of UCB1)
(2) residual term after O(log T) goes to zero (B of UCB1)
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 19 / 28
Application 1: Ranking
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 20 / 28
Learning-to-Rank
In some applications, we need to find top-K, not the best
source: Gupta’s tutorial slide
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 21 / 28
Relation to Multi-Armed Bandit
There is an exploration-exploitation dilema
exploration gather more information about user’s preference
exploitation recommend the top-K list with given information
It is natural to apply bandit approach
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 22 / 28
Algorithm [KSWA 2015]
source: original paper
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 23 / 28
Application 2: Recommendation
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 24 / 28
Cold-Start Problem
Collaborative filtering is widely used for recommendation
It is highly effective when there is sufficient data, but suffers when
new user enters; which is called cold-start problem
source: Elahi’s survey slide
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 25 / 28
Relation to Multi-Armed Bandit
There is an exploration-exploitation dilema
exploration gather more information about user’s preference
exploitation recommend the best item with given information
It is natural to apply bandit approach
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 26 / 28
Algorithm [CB 2013]
source: original paper
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 27 / 28
Take Home Message
Bandit is an insteresting topic for both theorists and practitioners
The core of bandit is partially observable and exp-exp dilema
UCB is one great idea to attack the problem
Single message to keep in mind:
If your research encounters exp-exp dilema,
consider to apply bandit approach!
Sangwoo Mo (KAIST) Network Workshop December 23, 2016 28 / 28

More Related Content

PDF
Multi armed bandit
PDF
Reinforcement Learning 2. Multi-armed Bandits
PDF
Reinforcement learning-ebook-part1
PDF
Markov decision process
PPTX
Reinforcement learning
PPTX
Intro to Deep Reinforcement Learning
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
Reinforcement Learning Tutorial | Edureka
Multi armed bandit
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement learning-ebook-part1
Markov decision process
Reinforcement learning
Intro to Deep Reinforcement Learning
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning Tutorial | Edureka

What's hot (20)

PDF
Multi-armed Bandits
PPTX
multi-armed bandit
PDF
Multi-armed bandit by Joni Turunen
PDF
Introduction to Multi-armed Bandits
PDF
Temporal difference learning
PDF
Reinforcement Learning 4. Dynamic Programming
PPTX
Naive Bayes Presentation
PDF
Generative adversarial networks
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Markov Chain Monte Carlo Methods
PPTX
Deep Reinforcement Learning
PDF
Reinforcement Learning in Practice: Contextual Bandits
PDF
Reinforcement Learning 6. Temporal Difference Learning
PDF
Machine learning Lecture 2
PDF
Bayesian networks
PDF
Multi-Agent Reinforcement Learning
PDF
Contextual Bandit Survey
PDF
Inverse Reinforcement Learning Algorithms
PPTX
Deep Multi-agent Reinforcement Learning
PDF
Sequential Decision Making in Recommendations
Multi-armed Bandits
multi-armed bandit
Multi-armed bandit by Joni Turunen
Introduction to Multi-armed Bandits
Temporal difference learning
Reinforcement Learning 4. Dynamic Programming
Naive Bayes Presentation
Generative adversarial networks
Reinforcement Learning 3. Finite Markov Decision Processes
Markov Chain Monte Carlo Methods
Deep Reinforcement Learning
Reinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning 6. Temporal Difference Learning
Machine learning Lecture 2
Bayesian networks
Multi-Agent Reinforcement Learning
Contextual Bandit Survey
Inverse Reinforcement Learning Algorithms
Deep Multi-agent Reinforcement Learning
Sequential Decision Making in Recommendations
Ad

Viewers also liked (8)

PPTX
Bandit algorithms
PDF
Multi-Armed Bandits:
 Intro, examples and tricks
PPT
Recommender Systems Tutorial (Part 3) -- Online Components
PPTX
A Multi-armed Bandit Approach to Online Spatial Task Assignment
PPTX
BanditProblems_final
PDF
Ensemble Contextual Bandits for Personalized Recommendation
PDF
Spacey random walks and higher-order data analysis
PDF
Interactive Recommender Systems with Netflix and Spotify
Bandit algorithms
Multi-Armed Bandits:
 Intro, examples and tricks
Recommender Systems Tutorial (Part 3) -- Online Components
A Multi-armed Bandit Approach to Online Spatial Task Assignment
BanditProblems_final
Ensemble Contextual Bandits for Personalized Recommendation
Spacey random walks and higher-order data analysis
Interactive Recommender Systems with Netflix and Spotify
Ad

Similar to Multi-Armed Bandit and Applications (20)

PDF
Practical AI for Business: Bandit Algorithms
PDF
Multi-Armed Bandit: an algorithmic perspective
PDF
Bandit Algorithms
PPTX
Practical contextual bandits for business
PDF
Sequential and reinforcement learning for demand side management by Margaux B...
PDF
Rotting Infinitely Many-Armed Bandits
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
PDF
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
PDF
Artwork Personalization at Netflix
PPTX
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
PDF
Sequential Learning in the Position-Based Model
PDF
Bandit algorithms for website optimization - A summary
PPTX
mcp-bandits.pptx
PPTX
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
PDF
PDF
PDF
07-Richard-Combes Multi-Armed Bandits.pdf
PDF
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
PDF
MAB-ppt.pdf
PDF
Improving experimentation velocity via Multi-Armed Bandits
Practical AI for Business: Bandit Algorithms
Multi-Armed Bandit: an algorithmic perspective
Bandit Algorithms
Practical contextual bandits for business
Sequential and reinforcement learning for demand side management by Margaux B...
Rotting Infinitely Many-Armed Bandits
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A R...
Artwork Personalization at Netflix
Byron Galbraith, Chief Data Scientist, Talla, at MLconf NYC 2017
Sequential Learning in the Position-Based Model
Bandit algorithms for website optimization - A summary
mcp-bandits.pptx
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
07-Richard-Combes Multi-Armed Bandits.pdf
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
MAB-ppt.pdf
Improving experimentation velocity via Multi-Armed Bandits

More from Sangwoo Mo (20)

PDF
Brief History of Visual Representation Learning
PDF
Learning Visual Representations from Uncurated Data
PDF
Hyperbolic Deep Reinforcement Learning
PDF
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
PDF
Self-supervised Learning Lecture Note
PDF
Deep Learning Theory Seminar (Chap 3, part 2)
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
PDF
Introduction to Diffusion Models
PDF
Object-Region Video Transformers
PDF
Deep Implicit Layers: Learning Structured Problems with Neural Networks
PDF
Learning Theory 101 ...and Towards Learning the Flat Minima
PDF
Sharpness-aware minimization (SAM)
PDF
Explicit Density Models
PDF
Score-Based Generative Modeling through Stochastic Differential Equations
PDF
Self-Attention with Linear Complexity
PDF
Meta-Learning with Implicit Gradients
PDF
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
PDF
Generative Models for General Audiences
PDF
Bayesian Model-Agnostic Meta-Learning
PDF
Deep Learning for Natural Language Processing
Brief History of Visual Representation Learning
Learning Visual Representations from Uncurated Data
Hyperbolic Deep Reinforcement Learning
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Self-supervised Learning Lecture Note
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 1-2, part 1)
Introduction to Diffusion Models
Object-Region Video Transformers
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Learning Theory 101 ...and Towards Learning the Flat Minima
Sharpness-aware minimization (SAM)
Explicit Density Models
Score-Based Generative Modeling through Stochastic Differential Equations
Self-Attention with Linear Complexity
Meta-Learning with Implicit Gradients
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Generative Models for General Audiences
Bayesian Model-Agnostic Meta-Learning
Deep Learning for Natural Language Processing

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Modernizing your data center with Dell and AMD
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
A Presentation on Artificial Intelligence
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Modernizing your data center with Dell and AMD
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
A Presentation on Artificial Intelligence
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology

Multi-Armed Bandit and Applications

  • 1. Multi-Armed Bandits and Applications Sangwoo Mo KAIST swmo@kaist.ac.kr December 23, 2016 Sangwoo Mo (KAIST) Network Workshop December 23, 2016 1 / 28
  • 2. Papers Theory Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002. Application Kveton et al. Cascading Bandits: Learning to Rank in the Cascade Model. ICML, 2015. Caron & Bhagat. Mixing Bandits: A recipe for Improved Cold-Start Recommendations in a Social Network. SNA-KDD, 2013. Sangwoo Mo (KAIST) Network Workshop December 23, 2016 2 / 28
  • 3. Overview 1 Multi-Armed Bandit 2 UCB: The Optimal Algorithm 3 Application 1: Ranking 4 Application 2: Recommendation Sangwoo Mo (KAIST) Network Workshop December 23, 2016 3 / 28
  • 4. Multi-Armed Bandit Sangwoo Mo (KAIST) Network Workshop December 23, 2016 4 / 28
  • 5. What is Multi-Armed Bandit? One-Armed Bandit = Slot Machine (English slang) source: infoslotmachine.com Sangwoo Mo (KAIST) Network Workshop December 23, 2016 5 / 28
  • 6. What is Multi-Armed Bandit? Multi-Armed Bandit = Multiple Slot Machine Objective: maximize reward in a casino source: Microsoft Research Sangwoo Mo (KAIST) Network Workshop December 23, 2016 6 / 28
  • 7. Real Motivation A/B Test, Online Advertisement, etc. Objective: maximize conversion rate, etc. source: VWO Sangwoo Mo (KAIST) Network Workshop December 23, 2016 7 / 28
  • 8. Real Motivation Sangwoo Mo (KAIST) Network Workshop December 23, 2016 8 / 28
  • 9. Problem Setting # of arms K, # of rounds T For each round t = 1, ..., T 1. the reward vector rt = (r1,t, ..., rK,t) is generated 2. the agent chooses an arm it ∈ {1, ..., K} 3. the agent recieves the reward rit ,t Remark: rewards of unchosen arms ri=it ,t are not revealed We call this partially observable property as the bandit setting Sangwoo Mo (KAIST) Network Workshop December 23, 2016 9 / 28
  • 10. Problem Setting (Stochastic Bandit) The reward ri,t follows the probability distribution Pi , with mean µi Here, the agent should find the arm with the highest µi source: Pandey et al.’s slide Today, we will only consider the stochastic bandit Sangwoo Mo (KAIST) Network Workshop December 23, 2016 10 / 28
  • 11. Objective Objective: minimize the (expected cumulative) regret RT = E[ T t=1 (ri∗,t − rit ,t)] = T t=1 (µ∗ − µit ) = K i=1 ∆i ni where i∗ = arg max[µi ], ∆i = µ∗ − µi , and ni = T t=1 1[it = i] It is shown that the asymptotic lower bound [LR 85] of the regret is lim T→∞ RT log T ≥ ∆i >0 ∆i KL(Pi ||Pi∗ ) We call a bandit algorithm is optimal if its regret is O(log T) Sangwoo Mo (KAIST) Network Workshop December 23, 2016 11 / 28
  • 12. Exploration-Exploitation Dilema Exploration vs Exploitation exploration gather more information exploitation make the best decision with given information Two Na¨ıve Algorithms Random (= full exploration): choose arm randomly Greedy (= full exploitation): choose the empirical best arm Both algorithm occurs the linear regret (why?) Sangwoo Mo (KAIST) Network Workshop December 23, 2016 12 / 28
  • 13. Exploration-Exploitation Dilema Fundamental question of bandit: How to balance between exploration and exploitation? source: RSM Discovery Sangwoo Mo (KAIST) Network Workshop December 23, 2016 13 / 28
  • 14. UCB: The Optimal Algorithm Sangwoo Mo (KAIST) Network Workshop December 23, 2016 14 / 28
  • 15. Motivation of UCB Recall: Greedy algorithm occurs the linear regret Reason: It chooses the wrong answer with overconfidence Idea: Add confidence bonus to the estimated mean! (If the estimator is reliable, choose less; if not, choose more) source: Garivier & Capp´e’s slide Sangwoo Mo (KAIST) Network Workshop December 23, 2016 15 / 28
  • 16. UCB1 [ACF 02] UCB1: choose the arm s.t. it = arg max       ˆµi + c log t ni ucbi       Theorem Let ri,t is bounded in [0, 1]. Let c = 2. Then, the regret of UCB1 is Rt =  8 i:µi <µ∗ ( log t ∆i )   + 1 + π2 3 K i=1 ∆i . Sangwoo Mo (KAIST) Network Workshop December 23, 2016 16 / 28
  • 17. UCB1 [ACF 02] Sketch of Proof. If the agent chooses a suboptimal arm, ˆµi + ucbi ≥ ˆµ∗ + ucb∗ . With some modification, ˆµi − (µi + ucbi ) A + (µi + 2ucbi ) − µ∗ B ≥ ˆµ∗ − (µ∗ − ucb∗ ) −C . Here, at least one of A, B, or C should be nonnegative. Sangwoo Mo (KAIST) Network Workshop December 23, 2016 17 / 28
  • 18. UCB1 [ACF 02] Sketch of Proof (Cont.) By Chernoff bound, Pr(A ≥ 0) 0 and Pr(C ≥ 0) 0. Pr(A) = Pr(ˆµi ≥ µi + ucbi ) ≤ exp(−2 2 log t ni ni ) = t−4 Also, Pr(B ≥ 0) = 0 if ni ≥ 8 log t ∆2 i . µi + 2ucbi = µi + 2 2 log t ni ≤ µi + ∆i = µ∗ Combining two results, E[ni ] ≤ 8 log t ∆2 i + t−4 = O(log t) Sangwoo Mo (KAIST) Network Workshop December 23, 2016 18 / 28
  • 19. UCB variants UCB1 achived the optimality in order, but not in constant Recall: The assymptotic lower bound [LR 85] of the regret is lim T→∞ RT log T ≥ ∆i >0 ∆i KL(Pi ||Pi∗ ) A lot of UCB variants were proposed to achieve the optimality Finally, KL-UCB [GC 11] and Bayes-UCB [KCG 12] achieved it Proof scheme is similar to UCB1 (1) UCB term goes to zero by AH-like inequality (A/C of UCB1) (2) residual term after O(log T) goes to zero (B of UCB1) Sangwoo Mo (KAIST) Network Workshop December 23, 2016 19 / 28
  • 20. Application 1: Ranking Sangwoo Mo (KAIST) Network Workshop December 23, 2016 20 / 28
  • 21. Learning-to-Rank In some applications, we need to find top-K, not the best source: Gupta’s tutorial slide Sangwoo Mo (KAIST) Network Workshop December 23, 2016 21 / 28
  • 22. Relation to Multi-Armed Bandit There is an exploration-exploitation dilema exploration gather more information about user’s preference exploitation recommend the top-K list with given information It is natural to apply bandit approach Sangwoo Mo (KAIST) Network Workshop December 23, 2016 22 / 28
  • 23. Algorithm [KSWA 2015] source: original paper Sangwoo Mo (KAIST) Network Workshop December 23, 2016 23 / 28
  • 24. Application 2: Recommendation Sangwoo Mo (KAIST) Network Workshop December 23, 2016 24 / 28
  • 25. Cold-Start Problem Collaborative filtering is widely used for recommendation It is highly effective when there is sufficient data, but suffers when new user enters; which is called cold-start problem source: Elahi’s survey slide Sangwoo Mo (KAIST) Network Workshop December 23, 2016 25 / 28
  • 26. Relation to Multi-Armed Bandit There is an exploration-exploitation dilema exploration gather more information about user’s preference exploitation recommend the best item with given information It is natural to apply bandit approach Sangwoo Mo (KAIST) Network Workshop December 23, 2016 26 / 28
  • 27. Algorithm [CB 2013] source: original paper Sangwoo Mo (KAIST) Network Workshop December 23, 2016 27 / 28
  • 28. Take Home Message Bandit is an insteresting topic for both theorists and practitioners The core of bandit is partially observable and exp-exp dilema UCB is one great idea to attack the problem Single message to keep in mind: If your research encounters exp-exp dilema, consider to apply bandit approach! Sangwoo Mo (KAIST) Network Workshop December 23, 2016 28 / 28