Evaluating decision-aware recommender systems

Rus M. Mesas, Alejandro Bellogín
Universidad Autónoma de Madrid
Spain
RecSys, August 2017
Evaluating Decision-Aware
Recommender Systems

2
Alejandro Bellogín – RecSys, August 2017
Main idea
▪ How to balance coverage and precision
Method Precision Coverage Best?
R1 0.093 100% 
R2 0.094 97.8%

3
Main idea
R1 0.093 100% 
R2 0.094 97.8%
R1 0.037 100%
R2 0.133 100%
R3 0.245 99.7% 

4
Main idea
R1 0.093 100% 
R2 0.094 97.8%
R1 0.037 100%
R2 0.133 100%
R3 0.245 99.7% 
R1 0.093 100%
R2 0.181 95.6% ?
R3 0.283 59.0% ?
R4 0.326 28.2% ?

5
Main idea
▪ To force different coverage levels, we allow
recommenders to decide if a recommendation is
worthy of being presented to the user or not
Estimations

6
Balancing coverage and precision
▪ [Herlocker et al 2004]: “there is no general
coverage metric that, at the same time, gives more
weight to relevant items when accounting for
coverage, and combines coverage and accuracy
measures”
▪ [Gunawardana & Shani 2015] leave the problem of
balancing coverage and precision as an open issue
in the area

7
Combination metrics

8
Our proposal: Correctness metric
▪ Adapted from Question Answering:
• Several questions to be answered by a system
• Each question has several options
• Only one option is correct
• If an answer is not given, it should not be considered as
an incorrect answer
• Hence, if two systems have the same number of correct
answers but one has failed less questions (it has decided
not to respond), it should be better than the other one
A. Peñas & Á. Rodrigo. 2011. A simple measure to assess non response. ACL.

9
Correctness metric for recommendation
▪ Each recommendation algorithm is a system
▪ Each candidate item to be ranked is a question
▪ If an item is recommended, it could be relevant or
not
▪ The same set of items is presented to each system
Recommended list Precision@5 Correctness

10
Correctness metrics for recommendation
▪ Four instantiations:
• Based on users
• Based on items

11
What about the decision-aware
recommenders?
Estimations

12
Decision-aware recommender systems
▪ Exploiting the confidence a system has on its own
recommendations
▪ Not completely new
• Significance weighting
• Support and confidence in case-based recommenders
▪ Focus on Collaborative Filtering algorithms
• Support of prediction score of nearest-neighbour
methods
• Uncertainty in prediction score of a probabilistic matrix
factorisation algorithm

13
Estimating confidence in
decision-aware recommendation
▪ For user-based KNN
▪ For probabilistic MF
At least n (out of k)
neighbours have
participated in
rating estimation?

14
Experimental setup
▪ Datasets
• MovieLens 100K, MovieLens 1M, Jester
• Random 5-fold training/test split
▪ Evaluation
• Generate a ranking with every item in the test set
• Metrics at cutoff 10: precision (P), user space coverage
(USC), item space coverage (ISC), correctness (UC, RUC,
IC, RIC), novelty (EPC), diversity (AggrDiv)
▪ Frameworks
• RankSys: evaluation metrics, KNN recommenders
• RiVal: data splitting

15
Performance: prediction uncertainty

16
Impact on novelty and diversity
▪ Prediction uncertainty
• More strict constraints (smaller uncertainty) decrease
novelty and diversity

17
Conclusions
▪ We have proposed a family of metrics based on
the assumption that it is better to avoid a
recommendation rather than providing a bad
recommendation
▪ We have shown that a balance between precision,
coverage, diversity, and novelty is critical
▪ We have proposed two strategies to decide if an
item should be presented to the user

18
Future work
▪ Extend the correctness metrics to combine other
evaluation dimensions
▪ Objective way to discriminate between systems:
which one is really the best one?
▪ Consider the psychological aspect of the
recommendation: the user is expecting to receive
N recommendations (better bad than none?)

19
Thank you
Evaluating Decision-Aware
Recommender Systems
Rus M. Mesas, Alejandro Bellogín
Universidad Autónoma de Madrid
Spain
RecSys, August 2017

20
Performance: prediction support

21
Impact on novelty and diversity
▪ Prediction support
• Larger n decreases the
diversity and novelty of
the lists
• More popular items are
being recommended

22
Motivation
▪ Typical evaluation: it is better to fail than avoiding
a recommendation
• Assumption: no returning an item is an advocate of that
item being considered as not relevant
▪ In this work: a recommender system may decide
not to recommend a specific item
• We need a metric where “no recommendation” does
not mean relevant or not relevant. If possible, it should
mean “better than not relevant”

23
Definition of uncertainty for PMF
▪ PMF: probabilistic matrix factorisation using a
Bayesian approximation proposed in [Lim & Teh
2007]
▪ The standard deviation is derived using mean-field
variational inference:

Evaluating decision-aware recommender systems

More Related Content

Similar to Evaluating decision-aware recommender systems (20)

More from Alejandro Bellogin (17)

Recently uploaded (20)

Evaluating decision-aware recommender systems