Personalized list recommendation based on multi armed bandit algorithms

Personalized List
Recommendation based on
Multi-armed Bandit Algorithms
Weiwen LIU
Computer Science & Engineering
Chinese University of Hong Kong
wwliu@cse.cuhk.edu.hk

wwliu, Term Presentation, Term 1
Content
oBackground
• Existing Methods
• Multi-armed Bandits
• Dependency Click Model
oAlgorithm
oResults
oExperiments
oConclusion and Future Work
2

Background
oFor users:
• How to discover interesting items like music/news/apps
among large amount of items.
oFor companies:
• How to create economic opportunities.
• How to provide better personalized services.
3

Existing methods
oContent-based Method/Collaborative Filtering
• Pros: perform well when user have enough click or
download records.
• Cons: cold-start problem
oContext-based Method/Regression
• Pros: efficient and easy to implement
• Cons: lack of diversity
4
Exploration vs Exploitation?

Multi-armed bandits
o Rewards 𝒙𝑖,1, 𝒙𝑖,2, … of machine 𝑖 are i.i.d. 0,1 -valued
random variables
o An allocation policy prescribes which machine 𝑰 𝑡 to play at
time 𝑡 based on the realization of 𝒙 𝑰1,1
, … , 𝒙 𝑰 𝑡−1,𝑡−1
o The target is to play as often as possible the machine with
largest reward expectation
𝜇∗
= max
𝑖=1,…,𝐾
𝔼[𝑥𝑖]
5

Bandit Solutions
oStochastic Bandits:
• Select items repeatedly and separately, one at each time
• Limitations: ignores the underlying relations; high
computational cost
oCombinatorial Cascade Bandits:
• Select a set of sequence of arms
• Limitations: can only deal with single click setting
6

Click Models
oCascade Click Model:
• Stop when first click occurs
• Can only model single click
oDependency Click Model:
• Introduce a set of termination parameters
• Can handle settings with multiple click
7
1
2
3

Dependency Click Model
o Allow user continue to
check more items after a
click.
o An extension of the
Cascade Model
• Can be reduced to CM if
the termination weights
ҧ𝑣 𝑘 = 1
8
Examine next
item ak
Attracted by the
item?
Would like to
terminate?
Reach the end of the
list?
Start
Satisfied Not satisfied
Yes
Yes
No
Yes
No
No
w(ak)
v(k)¯
¯

Problem Formulation
o Given ground item set 𝐸 = 1, … , 𝐿 , a contextual vector 𝒙𝑖,𝑡 ∈ ℝd
is known to
the agent at time 𝑡.
o Attraction weight 𝒘 𝑡 𝑎 ∈ 0,1 𝐸
• is 𝑤𝑡(𝑎)-biased Bernoulli r.m.
• denotes whether user is attracted by 𝑎 or not.
• the attraction weights 𝒘 𝑡 𝑎 𝑡=1
𝑛
are i.i.d
o Termination weight 𝒗 𝑡 𝑘 ∈ 0,1 𝐾
• is ҧ𝑣(𝑘)-biased Bernoulli r.m.
• denotes where user wants to terminate examining the list
• only depends on the position 𝑘
• the termination weights 𝒗 𝑡 𝑘 𝑡=1
𝑛
are i.i.d
9
Recommended list
𝑨 𝑡 = (𝒂1
𝑡
, … , 𝒂 𝐾
𝑡
)
Feedback 010 ⋯ 100

Objective
o The reward function is defined as
𝑓 𝐴, 𝑣, 𝑤 = 1 − ෑ
𝑘=1
𝐾
(1 − 𝑣 𝑘 𝑤(𝑎 𝑘))
indicating that 𝑓 𝑨 𝑡, 𝒗 𝑡, 𝒘 𝑡 = 1 if user clicks on a item, feels
satisfied and terminates examination.
o The pseudo-regret is defined as
ℛ 𝑛 = 𝔼 ෍
𝑡=1
𝑛
(𝑓 𝐴 𝑡
∗
, 𝑣 𝑡, 𝑤𝑡 − 𝑓(𝐴 𝑡, 𝑣 𝑡, 𝑤𝑡))
10

Partial Knowledge
oClick sequence is the only feedback for the agent
• The termination position is unobserved
• The reward is not revealed
11
010011000
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
reward=1
reward=0
Use feedback before the last click to update the model

Proposed Model: attraction weight
oAssume the expected attraction weight 𝑤𝑡(𝑎)
follows
𝑤𝑡 𝑎 = 𝔼 𝒘 𝑡 𝑎 ℋ𝑡 = 𝜇(𝜃∗
⊤ 𝑥 𝑡,𝑎)
oUse the generalized linear model as a flexible
extension
• Admits a wider range of distributions, e.g. Gaussian,
binomial, Poisson…
12
Attracted
Or Not?

Proposed Model: termination weight
o Due to the limited feedback, we assume the order of the
expected termination weights are known
• For simplicity of explanation, assume
ҧ𝑣 1 ≥ ⋯ ≥ ҧ𝑣(𝐾)
oThe expected reward is maximized by recommending the
more attractive item to the higher position.
13
Terminate
Or Not?

Proposed Model: parameter estimation
o The model parameter 𝜃 can be estimated using MLE:
෍
𝑠=1
𝑡
෍
𝑘=1
𝐶𝑡
𝑤𝑠 𝑎 𝑘
𝑠
− 𝜇 𝜃⊤ 𝑥 𝑠,𝑎 𝑘
𝑠 𝑥 𝑠,𝑎 𝑘
𝑠 = 0.
o Upper Confidence Bound (UCB):
𝑈𝑡 𝑎 = min 𝜇 ෨𝜃𝑡−1
⊤
𝑥𝑡,𝑎 + 𝜌 𝑡 − 1 𝑥𝑡,𝑎 𝑉𝑡−1
−1 , 1 ,
where 𝛽𝑡
𝑎
𝛿 = 𝜌(𝑡) 𝑥𝑡,𝑎 𝑉𝑡
−1.
14
Lemma: For any 𝑡 ≥ 1 and 𝑎 ∈ 𝐸, denote
𝛽𝑡
𝑎
𝛿 =
2𝑘 𝜇
𝑐 𝜇
𝑥𝑡,𝑎 𝑉𝑡
−1 log
1 +
𝐾𝑡
𝜆𝑑
𝑑
𝛿2
.
For all 0 ≤ 𝛿 ≤ 1, with probability at least 1 − 𝛿, it holds that:
𝜇 𝜃∗
⊤
𝑥𝑡,𝑎 − 𝜇 ෪𝜃𝑡
⊤
𝑥𝑡,𝑎 ≤ 𝛽𝑡
𝑎
𝛿 , ∀𝑡 ≥ 1.

Proposed Model: UCB
oAnalyze mean and a measure of uncertainty
(variance) for each item
oMake decisions based on mean + variance
15
0 0.2 0.4
B
C
A

Proposed Model: UCB
oThe value of 𝜌(𝑡) decreases w.r.t 𝑡
oThe uncertainty of 𝐴 reduces after several time
step
oAutomatically balances exploration and exploitation
16
0 0.2 0.4
B
C
A

Proposed Model: Algorithm
17
Recommend
based on UCB
Estimate 𝜃
Update
statistics

Theoretical Results
o The upper bound is of 𝑂(𝑑 𝑛 log 𝑛) for the regret, which
depends linearly on the dimension 𝑑 of the feature space,
but not on the number 𝐿 of base arms.
18
Theorem: If the reward function is given as 𝑓 𝐴, 𝑣, 𝑤 = 1 − Π 𝑘=1
𝐾
(1 − 𝑣 𝑘 𝑤(𝑎 𝑘)), then the
cumulative regret ℛ(𝑛) of the proposed algorithm has the following bound,
ℛ 𝑛 ≤
4𝐾Δ 𝑣 𝑘 𝜇
𝑐 𝜇 𝑝∗
𝑑𝑛 𝐾 + 1 log
1 +
𝐾𝑛
𝜆𝑑
𝑑
𝛿2
log 1 +
𝐾𝑛
𝜆𝑑
,
where 𝑘 𝜇 is the Lipschitz constant, 𝑐 𝜇 = inf 𝜇 ′.

Experimental Results
o Synthetic data
• L=200, K=4 and d=10
• 𝜇 𝑥 =
1
1+exp −𝑥
oGL-CDCM outperforms KL-
DCM by 80.27% and Lin-
CDCM by 49.04%.
19

Experimental Results
o Real-world data
• 20M MovieLens data
• L=200, K=5, d=100
o GL-CDCM is 5.69 times of
that of KL-DCM and 1.45
times of that of Lin-CDCM
20

Conclusion
oConclusion
• Formulate the DCM bandits problem
• Incorporate contextual information
• Make a weaker assumption on the expected attraction
weight function
• Prove a upper regret bound
oFuture work
• Prove a tighter bound
• Consider other practical click model
• Verify the effectiveness using more real-world dataset
21

Personalized list recommendation based on multi armed bandit algorithms

More Related Content

What's hot (17)

Similar to Personalized list recommendation based on multi armed bandit algorithms (20)

Recently uploaded (20)

Personalized list recommendation based on multi armed bandit algorithms