SlideShare a Scribd company logo
Replicable Evaluation of
Recommender Systems
Alejandro Bellogín (Universidad Autónoma de Madrid, Spain)
Alan Said (Recorded Future, Sweden)
Tutorial at ACM RecSys 2015
Stephansdom
2
Stephansdom
3
Stephansdom
4
Stephansdom
5
Stephansdom
6
Stephansdom
7
#EVALTUT
8
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
9
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
10
Background
• A recommender system aims to find and
suggest items of likely interest based on the
users’ preferences
11
Background
• A recommender system aims to find and
suggest items of likely interest based on the
users’ preferences
12
Background
• A recommender system aims to find and
suggest items of likely interest based on the
users’ preferences
• Examples:
– Netflix: TV shows and movies
– Amazon: products
– LinkedIn: jobs and colleagues
– Last.fm: music artists and tracks
– Facebook: friends
13
Background
• Typically, the interactions between user and
system are recorded in the form of ratings
– But also: clicks (implicit feedback)
• This is represented as a user-item matrix:
i1 … ik … im
u1
…
uj ?
…
un
14
Motivation
• Evaluation is an integral part of any
experimental research area
• It allows us to compare methods…
15
Motivation
• Evaluation is an integral part of any
experimental research area
• It allows us to compare methods…
• … and identify winners (in competitions)
16
Motivation
A proper evaluation culture allows advance the
field
… or at least, identify when there is a problem!
17
Motivation
In RecSys, we find inconsistent evaluation results,
for the “same”
– Dataset
– Algorithm
– Evaluation metric
Movielens 1M
[Cremonesi et al, 2010]
Movielens 100k
[Gorla et al, 2013]
Movielens 1M
[Yin et al, 2012]
Movielens 100k, SVD
[Jambor & Wang, 2010]
18
Motivation
In RecSys, we find inconsistent evaluation results,
for the “same”
– Dataset
– Algorithm
– Evaluation metric
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50
IB
UB50
[Bellogín et al, 2011]
19
Motivation
In RecSys, we find inconsistent evaluation results,
for the “same”
– Dataset
– Algorithm
– Evaluation metric
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50
IB
UB50
We need to understand why this happens
20
In this tutorial
• We will present the basics of evaluation
– Accuracy metrics: error-based, ranking-based
– Also coverage, diversity, and novelty
• We will focus on replication and reproducibility
– Define the context
– Present typical problems
– Propose some guidelines
21
Replicability
• Why do we need to
replicate?
22
Reproducibility
Why do we need to
reproduce?
Because these two are not
the same
23
NOT in this tutorial
• In-depth analysis of evaluation metrics
– See chapter 9 on handbook [Shani & Gunawardana, 2011]
• Novel evaluation dimensions
– See tutorials at WSDM ’14 and SIGIR ‘13 on
diversity and novelty
• User evaluation
– See tutorial at RecSys 2012
• Comparison of evaluation results in research
– See RepSys workshop at RecSys 2013
– See [Said & Bellogín 2014]
24
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
25
Recommender Systems Evaluation
Typically: as a black box
Train Test
Valida
tion
Dataset
Recommender
generates
a ranking
(for a user)
a prediction for a
given item (and user)
precision
error
coverage
… 26
Recommender Systems Evaluation
Train Test
Valida
tion
Dataset
Recommender
generates
a ranking
(for a user)
a prediction for a
given item (and user)
precision
error
coverage
… 27
The reproducible way: as black boxes
Recommender as a black box
What do you do when a recommender cannot
predict a score?
This has an impact on coverage
28[Said & Bellogín, 2014]
Candidate item generation
as a black box
How do you select the candidate items to be
ranked?
Solid triangle represents the target user.
Boxed ratings denote test set.
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50
IB
UB50
0
0.2
0.9
0.1
0.3
1.0
29
How do you select the candidate items to be
ranked?
[Said & Bellogín, 2014] 30
Candidate item generation
as a black box
Evaluation metric computation
as a black box
What do you do when a recommender cannot
predict a score?
– This has an impact on coverage
– It can also affect error-based metrics
MAE = Mean Absolute Error
RMSE = Root Mean Squared Error 31
Evaluation metric computation
as a black box
What do you do when a recommender cannot
predict a score?
– This has an impact on coverage
– It can also affect error-based metrics
User-item pairs Real Rec1 Rec2 Rec3
(u1, i1) 5 4 NaN 4
(u1, i2) 3 2 4 NaN
(u1, i3) 1 1 NaN 1
(u2, i1) 3 2 4 NaN
MAE/RMSE, ignoring NaNs 0.75/0.87 2.00/2.00 0.50/0.70
MAE/RMSE, NaNs as 0 0.75/0.87 2.00/2.65 1.75/2.18
MAE/RMSE, NaNs as 3 0.75/0.87 1.50/1.58 0.25/0.50 32
Using internal evaluation methods in Mahout
(AM), LensKit (LK), and MyMediaLite (MML)
[Said & Bellogín, 2014] 33
Evaluation metric computation
as a black box
Variations on metrics:
Error-based metrics can be normalized or averaged
per user:
– Normalize RMSE or MAE by the range of the
ratings (divide by rmax – rmin)
– Average RMSE or MAE to compensate for
unbalanced distributions of items or users
34
Evaluation metric computation
as a black box
Variations on metrics:
nDCG has at least two discounting functions
(linear and exponential decay)
35
Evaluation metric computation
as a black box
Variations on metrics:
Ranking-based metrics are usually computed up to
a ranking position or cutoff k
P = Precision (Precision at k)
R = Recall (Recall at k)
MAP = Mean Average Precision 36
Evaluation metric computation
as a black box
If ties are present in the ranking scores, results
may depend on the implementation
37
Evaluation metric computation
as a black box
[Bellogín et al, 2013]
Not clear how to measure diversity/novelty in
offline experiments (directly measured in online
experiments):
– Using a taxonomy (items about novel topics) [Weng
et al, 2007]
– New items over time [Lathia et al, 2010]
– Based on entropy, self-information and Kullback-
Leibler divergence [Bellogín et al, 2010; Zhou et al, 2010;
Filippone & Sanguinetti, 2010]
38
Evaluation metric computation
as a black box
Recommender Systems Evaluation:
Summary
• Usually, evaluation seen as a black box
• The evaluation process involves everything:
splitting, recommendation, candidate item
generation, and metric computation
• We should agree on standard implementations,
parameters, instantiations, …
– Example: trec_eval in IR
39
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
40
Reproducible Experimental Design
• We need to distinguish
– Replicability
– Reproducibility
• Different aspects:
– Algorithmic
– Published results
– Experimental design
• Goal: have a reproducible experimental
environment
41
Definition:
Replicability
To copy something
• The results
• The data
• The approach
Being able to evaluate
in the same setting
and obtain the same
results
42
Definition:
Reproducibility
To recreate something
• The (complete) set
of experiments
• The (complete) set
of results
• The (complete)
experimental setup
To (re) launch it in
production with the
same results
43
Comparing against the state-of-the-art
Your settings are not exactly
like those in paper X, but it is
a relevant paper
Reproduce results
of paper X
Congrats, you’re
done!
Replicate results of
paper X
Congrats! You have shown that
paper X behaves different in
the new setting
Sorry, there is something
wrong/incomplete in the
experimental design
They agree
They do not
agree
Do results
match the
original
paper?
Yes!
No!
Do results
agree with
original
paper?
44
What about Reviewer 3?
• “It would be interesting to see this done on a
different dataset…”
– Repeatability
– The same person doing the whole pipeline over
again
• “How does your approach compare to
*Reviewer 3 et al. 2003+?”
– Reproducibility or replicability (depending on how
similar the two papers are)
45
Repeat vs. replicate vs. reproduce vs. reuse
46
Motivation for reproducibility
In order to ensure that our experiments,
settings, and results are:
– Valid
– Generalizable
– Of use for others
– etc.
we must make sure that others can reproduce
our experiments in their setting
47
Making reproducibility easier
• Description, description,
description
• No magic numbers
• Specify values for all parameters
• Motivate!
• Keep a detailed protocol
• Describe process clearly
• Use standards
• Publish code (nobody expects
you to be an awesome
developer, you’re a researcher)
48
Replicability, reproducibility, and progress
• Can there be actual progress if no valid
comparison can be done?
• What is the point of comparing two
approaches if the comparison is flawed?
• How do replicability and reproducibility
facilitate actual progress in the field?
49
Summary
• Important issues in recommendation
– Validity of results (replicability)
– Comparability of results (reproducibility)
– Validity of experimental setup (repeatability)
• We need to incorporate reproducibility and
replication to facilitate the progress in the field
50
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
51
Replication by Example
• Demo time!
• Check
– http://www.recommenders.net/tutorial
• Checkout
– https://github.com/recommenders/tutorial.git
52
The things we write
mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation”
53
The things we forget to write
mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation"
-Dexec.args=”-u false"
54
mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation”
The things we forget to write
mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation"
-Dexec.args="-t 4.0"
55
mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation"
-Dexec.args=”-u false"
mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation”
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
56
Key Takeaways
• Every decision has an impact
– We should log every step taken in the
experimental part and report that log
• There are more things besides papers
– Source code, web appendix, etc. are very useful to
provide additional details not present in the paper
• You should not fool yourself
– You have to be critical about what you measure
and not trust intermediate “black boxes”
57
We must avoid this
From http://dilbert.com/strips/comic/2010-11-07/
58
Next steps?
• Agree on standard implementations
• Replicable badges for journals / conferences
59
Next steps?
• Agree on standard implementations
• Replicable badges for journals / conferences
http://validation.scienceexchange.com
The aim of the Reproducibility Initiative is to identify and reward high
quality reproducible research via independent validation of key
experimental results
60
Next steps?
• Agree on standard implementations
• Replicable badges for journals / conferences
• Investigate how to improve reproducibility
61
Next steps?
• Agree on standard implementations
• Replicable badges for journals / conferences
• Investigate how to improve reproducibility
• Benchmark, report, and store results
62
Pointers
• Email and Twitter
– Alejandro Bellogín
• alejandro.bellogin@uam.es
• @abellogin
– Alan Said
• alansaid@acm.org
• @alansaid
• Slides:
• In Slideshare... soon!
63
RiVal
Recommender System Evaluation Toolkit
http://rival.recommenders.net
http://github.com/recommenders/rival
64
Thank you!
65
References and Additional reading
• [Armstrong et al, 2009] Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since
1998. CIKM
• [Bellogín et al, 2010] A Study of Heterogeneity in Recommendations for a Social Music
Service. HetRec
• [Bellogín et al, 2011] Precision-Oriented Evaluation of Recommender Systems: an Algorithm
Comparison. RecSys
• [Bellogín et al, 2013] An Empirical Comparison of Social, Collaborative Filtering, and Hybrid
Recommenders. ACM TIST
• [Ben-Shimon et al, 2015] RecSys Challenge 2015 and the YOOCHOOSE Dataset. RecSys
• [Cremonesi et al, 2010] Performance of Recommender Algorithms on Top-N
Recommendation Tasks. RecSys
• [Filippone & Sanguinetti, 2010] Information Theoretic Novelty Detection. Pattern
Recognition
• [Fleder & Hosanagar, 2009] Blockbuster Culture’s Next Rise or Fall: The Impact of
Recommender Systems on Sales Diversity. Management Science
• [Ge et al, 2010] Beyond accuracy: evaluating recommender systems by coverage and
serendipity. RecSys
• [Gorla et al, 2013] Probabilistic Group Recommendation via Information Matching. WWW66
References and Additional reading
• [Herlocker et al, 2004] Evaluating Collaborative Filtering Recommender Systems. ACM
Transactions on Information Systems
• [Jambor & Wang, 2010] Goal-Driven Collaborative Filtering. ECIR
• [Knijnenburg et al, 2011] A Pragmatic Procedure to Support the User-Centric Evaluation of
Recommender Systems. RecSys
• [Koren, 2008] Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering
Model. KDD
• [Lathia et al, 2010] Temporal Diversity in Recommender Systems. SIGIR
• [Li et al, 2010] Improving One-Class Collaborative Filtering by Incorporating Rich User
Information. CIKM
• [Pu et al, 2011] A User-Centric Evaluation Framework for Recommender Systems. RecSys
• [Said & Bellogín, 2014] Comparative Recommender System Evaluation: Benchmarking
Recommendation Frameworks. RecSys
• [Schein et al, 2002] Methods and Metrics for Cold-Start Recommendations. SIGIR
• [Shani & Gunawardana, 2011] Evaluating Recommendation Systems. Recommender Systems
Handbook
• [Steck & Xin, 2010] A Generalized Probabilistic Framework and its Variants for Training Top-k
Recommender Systems. PRSAT 67
References and Additional reading
• [Tikk et al, 2014] Comparative Evaluation of Recommender Systems for Digital Media. IBC
• [Vargas & Castells, 2011] Rank and Relevance in Novelty and Diversity Metrics for
Recommender Systems. RecSys
• [Weng et al, 2007] Improving Recommendation Novelty Based on Topic Taxonomy. WI-IAT
• [Yin et al, 2012] Challenging the Long Tail Recommendation. VLDB
• [Zhang & Hurley, 2008] Avoiding Monotony: Improving the Diversity of Recommendation
Lists. RecSys
• [Zhang & Hurley, 2009] Statistical Modeling of Diversity in Top-N Recommender Systems. WI-
IAT
• [Zhou et al, 2010] Solving the Apparent Diversity-Accuracy Dilemma of Recommender
Systems. PNAS
• [Ziegler et al, 2005] Improving Recommendation Lists Through Topic Diversification. WWW
68
Rank-score (Half-Life Utility)
69
Mean Reciprocal Rank
70
Mean Percentage Ranking
[Li et al, 2010]
71
Global ROC
[Schein et al, 2002]
72
Customer ROC
[Schein et al, 2002]
73
Popularity-stratified recall
[Steck & Xin, 2010]
74

More Related Content

PDF
Interactive Recommender Systems with Netflix and Spotify
PDF
Recommender Systems
PPTX
Recommender system introduction
PDF
Bpr bayesian personalized ranking from implicit feedback
PDF
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
PDF
Recent advances in deep recommender systems
PDF
Past, Present & Future of Recommender Systems: An Industry Perspective
PDF
How to build a recommender system?
Interactive Recommender Systems with Netflix and Spotify
Recommender Systems
Recommender system introduction
Bpr bayesian personalized ranking from implicit feedback
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Recent advances in deep recommender systems
Past, Present & Future of Recommender Systems: An Industry Perspective
How to build a recommender system?

What's hot (20)

PPTX
The Universal Recommender
PDF
Music Recommendations at Scale with Spark
PDF
Introduction to Recommendation Systems
PDF
Overview of recommender system
PDF
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
PDF
Recommender Systems
PDF
Boston ML - Architecting Recommender Systems
PDF
Music recommendations @ MLConf 2014
PDF
Personalizing "The Netflix Experience" with Deep Learning
PDF
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
PDF
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
PDF
Deep Learning for Recommender Systems
PDF
Deep Learning for Recommender Systems
PDF
An introduction to Recommender Systems
PDF
Deep Learning for Recommender Systems RecSys2017 Tutorial
PPTX
Tutorial on sequence aware recommender systems - UMAP 2018
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
PDF
Déjà Vu: The Importance of Time and Causality in Recommender Systems
PDF
Netflix Recommendations - Beyond the 5 Stars
PPTX
Recommender system
The Universal Recommender
Music Recommendations at Scale with Spark
Introduction to Recommendation Systems
Overview of recommender system
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Recommender Systems
Boston ML - Architecting Recommender Systems
Music recommendations @ MLConf 2014
Personalizing "The Netflix Experience" with Deep Learning
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deep Learning for Recommender Systems
Deep Learning for Recommender Systems
An introduction to Recommender Systems
Deep Learning for Recommender Systems RecSys2017 Tutorial
Tutorial on sequence aware recommender systems - UMAP 2018
Lessons Learned from Building Machine Learning Software at Netflix
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Netflix Recommendations - Beyond the 5 Stars
Recommender system
Ad

Viewers also liked (20)

PPTX
Recommender Systems
PPTX
Setting Goals and Choosing Metrics for Recommender System Evaluations
PDF
Automatic Selection of Linked Open Data features in Graph-based Recommender S...
PDF
Case-based Recommender Systems for Personalized Finance Advisory
PDF
Recommender Systems, Matrices and Graphs
PPTX
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
PDF
Machine Learning for Recommender Systems MLSS 2015 Sydney
PDF
Introduction to behavior based recommendation system
PPTX
Artificial Intelligence as an Interface - How Conversation Bots Are Changing ...
PPT
Design of recommender systems
PDF
Past, present, and future of Recommender Systems: an industry perspective
PDF
Past present and future of Recommender Systems: an Industry Perspective
PDF
Deep Learning for Recommender Systems - Budapest RecSys Meetup
PPTX
Recommender system
PDF
Kdd 2014 Tutorial - the recommender problem revisited
PDF
Interactive Recommender Systems
PDF
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
PDF
Recommender system algorithm and architecture
PDF
Algorithmic Music Recommendations at Spotify
PPTX
Collaborative Filtering at Spotify
Recommender Systems
Setting Goals and Choosing Metrics for Recommender System Evaluations
Automatic Selection of Linked Open Data features in Graph-based Recommender S...
Case-based Recommender Systems for Personalized Finance Advisory
Recommender Systems, Matrices and Graphs
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Machine Learning for Recommender Systems MLSS 2015 Sydney
Introduction to behavior based recommendation system
Artificial Intelligence as an Interface - How Conversation Bots Are Changing ...
Design of recommender systems
Past, present, and future of Recommender Systems: an industry perspective
Past present and future of Recommender Systems: an Industry Perspective
Deep Learning for Recommender Systems - Budapest RecSys Meetup
Recommender system
Kdd 2014 Tutorial - the recommender problem revisited
Interactive Recommender Systems
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Recommender system algorithm and architecture
Algorithmic Music Recommendations at Spotify
Collaborative Filtering at Spotify
Ad

Similar to Replicable Evaluation of Recommender Systems (20)

PDF
Replication of Recommender Systems Research
PDF
Information Retrieval Models for Recommender Systems - PhD slides
PPTX
Comparative Recommender System Evaluation: Benchmarking Recommendation Frame...
PPT
Recommender systems
PDF
Offline evaluation of recommender systems: all pain and no gain?
PDF
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...
PDF
A Flexible Recommendation System for Cable TV
PDF
A flexible recommenndation system for Cable TV
PDF
Notes on Recommender Systems pdf 2nd module
PDF
Tutorial: Context In Recommender Systems
PPTX
Udacity webinar on Recommendation Systems
PDF
Evaluating Collaborative Filtering Recommender Systems
PPTX
Collaborative Filtering Recommendation System
PDF
Recsys 2018 overview and highlights
PPTX
[UPDATE] Udacity webinar on Recommendation Systems
PPTX
Lessons learnt at building recommendation services at industry scale
PDF
Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
PDF
Modern Perspectives on Recommender Systems and their Applications in Mendeley
PDF
The International Journal of Engineering and Science (The IJES)
PPTX
Lecture Notes on Recommender System Introduction
Replication of Recommender Systems Research
Information Retrieval Models for Recommender Systems - PhD slides
Comparative Recommender System Evaluation: Benchmarking Recommendation Frame...
Recommender systems
Offline evaluation of recommender systems: all pain and no gain?
On the Robustness and Discriminative Power of IR Metrics for Top-N Recommenda...
A Flexible Recommendation System for Cable TV
A flexible recommenndation system for Cable TV
Notes on Recommender Systems pdf 2nd module
Tutorial: Context In Recommender Systems
Udacity webinar on Recommendation Systems
Evaluating Collaborative Filtering Recommender Systems
Collaborative Filtering Recommendation System
Recsys 2018 overview and highlights
[UPDATE] Udacity webinar on Recommendation Systems
Lessons learnt at building recommendation services at industry scale
Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
Modern Perspectives on Recommender Systems and their Applications in Mendeley
The International Journal of Engineering and Science (The IJES)
Lecture Notes on Recommender System Introduction

More from Alejandro Bellogin (19)

PDF
Recommender Systems and Misinformation: The Problem or the Solution?
PDF
Revisiting neighborhood-based recommenders for temporal scenarios
PDF
Evaluating decision-aware recommender systems
PDF
Implicit vs Explicit trust in Social Matrix Factorization
PDF
RiVal - A toolkit to foster reproducibility in Recommender System evaluation
PDF
CWI @ Contextual Suggestion track - TREC 2013
PDF
CWI @ Federated Web Track - TREC 2013
PDF
Probabilistic Collaborative Filtering with Negative Cross Entropy
PDF
Understanding Similarity Metrics in Neighbour-based Recommender Systems
PDF
Artist popularity: do web and social music services agree?
PDF
Improving Memory-Based Collaborative Filtering by Neighbour Selection based o...
PDF
Performance prediction and evaluation in Recommender Systems: an Information ...
PDF
Using Graph Partitioning Techniques for Neighbour Selection in User-Based Col...
PDF
Using Graph Partitioning Techniques for Neighbour Selection in User-Based Col...
PDF
Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparis...
PDF
Predicting performance in Recommender Systems - Slides
PDF
Predicting performance in Recommender Systems - Poster slam
PDF
Predicting performance in Recommender Systems - Poster
PDF
Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparis...
Recommender Systems and Misinformation: The Problem or the Solution?
Revisiting neighborhood-based recommenders for temporal scenarios
Evaluating decision-aware recommender systems
Implicit vs Explicit trust in Social Matrix Factorization
RiVal - A toolkit to foster reproducibility in Recommender System evaluation
CWI @ Contextual Suggestion track - TREC 2013
CWI @ Federated Web Track - TREC 2013
Probabilistic Collaborative Filtering with Negative Cross Entropy
Understanding Similarity Metrics in Neighbour-based Recommender Systems
Artist popularity: do web and social music services agree?
Improving Memory-Based Collaborative Filtering by Neighbour Selection based o...
Performance prediction and evaluation in Recommender Systems: an Information ...
Using Graph Partitioning Techniques for Neighbour Selection in User-Based Col...
Using Graph Partitioning Techniques for Neighbour Selection in User-Based Col...
Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparis...
Predicting performance in Recommender Systems - Slides
Predicting performance in Recommender Systems - Poster slam
Predicting performance in Recommender Systems - Poster
Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparis...

Recently uploaded (20)

PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PDF
HPLC-PPT.docx high performance liquid chromatography
PPT
Chemical bonding and molecular structure
PDF
The scientific heritage No 166 (166) (2025)
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
diccionario toefl examen de ingles para principiante
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Crime Scene Investigation: A Guide for Law Enforcement (2013 Update)
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
famous lake in india and its disturibution and importance
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Derivatives of integument scales, beaks, horns,.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Viruses (History, structure and composition, classification, Bacteriophage Re...
HPLC-PPT.docx high performance liquid chromatography
Chemical bonding and molecular structure
The scientific heritage No 166 (166) (2025)
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
diccionario toefl examen de ingles para principiante
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Crime Scene Investigation: A Guide for Law Enforcement (2013 Update)
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
famous lake in india and its disturibution and importance
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...

Replicable Evaluation of Recommender Systems

  • 1. Replicable Evaluation of Recommender Systems Alejandro Bellogín (Universidad Autónoma de Madrid, Spain) Alan Said (Recorded Future, Sweden) Tutorial at ACM RecSys 2015
  • 9. Outline • Background and Motivation [10 minutes] • Evaluating Recommender Systems [20 minutes] • Replicating Evaluation Results [20 minutes] • Replication by Example [20 minutes] • Conclusions and Wrap-up [10 minutes] • Questions [10 minutes] 9
  • 10. Outline • Background and Motivation [10 minutes] • Evaluating Recommender Systems [20 minutes] • Replicating Evaluation Results [20 minutes] • Replication by Example [20 minutes] • Conclusions and Wrap-up [10 minutes] • Questions [10 minutes] 10
  • 11. Background • A recommender system aims to find and suggest items of likely interest based on the users’ preferences 11
  • 12. Background • A recommender system aims to find and suggest items of likely interest based on the users’ preferences 12
  • 13. Background • A recommender system aims to find and suggest items of likely interest based on the users’ preferences • Examples: – Netflix: TV shows and movies – Amazon: products – LinkedIn: jobs and colleagues – Last.fm: music artists and tracks – Facebook: friends 13
  • 14. Background • Typically, the interactions between user and system are recorded in the form of ratings – But also: clicks (implicit feedback) • This is represented as a user-item matrix: i1 … ik … im u1 … uj ? … un 14
  • 15. Motivation • Evaluation is an integral part of any experimental research area • It allows us to compare methods… 15
  • 16. Motivation • Evaluation is an integral part of any experimental research area • It allows us to compare methods… • … and identify winners (in competitions) 16
  • 17. Motivation A proper evaluation culture allows advance the field … or at least, identify when there is a problem! 17
  • 18. Motivation In RecSys, we find inconsistent evaluation results, for the “same” – Dataset – Algorithm – Evaluation metric Movielens 1M [Cremonesi et al, 2010] Movielens 100k [Gorla et al, 2013] Movielens 1M [Yin et al, 2012] Movielens 100k, SVD [Jambor & Wang, 2010] 18
  • 19. Motivation In RecSys, we find inconsistent evaluation results, for the “same” – Dataset – Algorithm – Evaluation metric 0 0.05 0.30 0.35 0.40 TR 3 TR 4 TeI TrI AI OPR P@50 SVD50 IB UB50 [Bellogín et al, 2011] 19
  • 20. Motivation In RecSys, we find inconsistent evaluation results, for the “same” – Dataset – Algorithm – Evaluation metric 0 0.05 0.30 0.35 0.40 TR 3 TR 4 TeI TrI AI OPR P@50 SVD50 IB UB50 We need to understand why this happens 20
  • 21. In this tutorial • We will present the basics of evaluation – Accuracy metrics: error-based, ranking-based – Also coverage, diversity, and novelty • We will focus on replication and reproducibility – Define the context – Present typical problems – Propose some guidelines 21
  • 22. Replicability • Why do we need to replicate? 22
  • 23. Reproducibility Why do we need to reproduce? Because these two are not the same 23
  • 24. NOT in this tutorial • In-depth analysis of evaluation metrics – See chapter 9 on handbook [Shani & Gunawardana, 2011] • Novel evaluation dimensions – See tutorials at WSDM ’14 and SIGIR ‘13 on diversity and novelty • User evaluation – See tutorial at RecSys 2012 • Comparison of evaluation results in research – See RepSys workshop at RecSys 2013 – See [Said & Bellogín 2014] 24
  • 25. Outline • Background and Motivation [10 minutes] • Evaluating Recommender Systems [20 minutes] • Replicating Evaluation Results [20 minutes] • Replication by Example [20 minutes] • Conclusions and Wrap-up [10 minutes] • Questions [10 minutes] 25
  • 26. Recommender Systems Evaluation Typically: as a black box Train Test Valida tion Dataset Recommender generates a ranking (for a user) a prediction for a given item (and user) precision error coverage … 26
  • 27. Recommender Systems Evaluation Train Test Valida tion Dataset Recommender generates a ranking (for a user) a prediction for a given item (and user) precision error coverage … 27 The reproducible way: as black boxes
  • 28. Recommender as a black box What do you do when a recommender cannot predict a score? This has an impact on coverage 28[Said & Bellogín, 2014]
  • 29. Candidate item generation as a black box How do you select the candidate items to be ranked? Solid triangle represents the target user. Boxed ratings denote test set. 0 0.05 0.30 0.35 0.40 TR 3 TR 4 TeI TrI AI OPR P@50 SVD50 IB UB50 0 0.2 0.9 0.1 0.3 1.0 29
  • 30. How do you select the candidate items to be ranked? [Said & Bellogín, 2014] 30 Candidate item generation as a black box
  • 31. Evaluation metric computation as a black box What do you do when a recommender cannot predict a score? – This has an impact on coverage – It can also affect error-based metrics MAE = Mean Absolute Error RMSE = Root Mean Squared Error 31
  • 32. Evaluation metric computation as a black box What do you do when a recommender cannot predict a score? – This has an impact on coverage – It can also affect error-based metrics User-item pairs Real Rec1 Rec2 Rec3 (u1, i1) 5 4 NaN 4 (u1, i2) 3 2 4 NaN (u1, i3) 1 1 NaN 1 (u2, i1) 3 2 4 NaN MAE/RMSE, ignoring NaNs 0.75/0.87 2.00/2.00 0.50/0.70 MAE/RMSE, NaNs as 0 0.75/0.87 2.00/2.65 1.75/2.18 MAE/RMSE, NaNs as 3 0.75/0.87 1.50/1.58 0.25/0.50 32
  • 33. Using internal evaluation methods in Mahout (AM), LensKit (LK), and MyMediaLite (MML) [Said & Bellogín, 2014] 33 Evaluation metric computation as a black box
  • 34. Variations on metrics: Error-based metrics can be normalized or averaged per user: – Normalize RMSE or MAE by the range of the ratings (divide by rmax – rmin) – Average RMSE or MAE to compensate for unbalanced distributions of items or users 34 Evaluation metric computation as a black box
  • 35. Variations on metrics: nDCG has at least two discounting functions (linear and exponential decay) 35 Evaluation metric computation as a black box
  • 36. Variations on metrics: Ranking-based metrics are usually computed up to a ranking position or cutoff k P = Precision (Precision at k) R = Recall (Recall at k) MAP = Mean Average Precision 36 Evaluation metric computation as a black box
  • 37. If ties are present in the ranking scores, results may depend on the implementation 37 Evaluation metric computation as a black box [Bellogín et al, 2013]
  • 38. Not clear how to measure diversity/novelty in offline experiments (directly measured in online experiments): – Using a taxonomy (items about novel topics) [Weng et al, 2007] – New items over time [Lathia et al, 2010] – Based on entropy, self-information and Kullback- Leibler divergence [Bellogín et al, 2010; Zhou et al, 2010; Filippone & Sanguinetti, 2010] 38 Evaluation metric computation as a black box
  • 39. Recommender Systems Evaluation: Summary • Usually, evaluation seen as a black box • The evaluation process involves everything: splitting, recommendation, candidate item generation, and metric computation • We should agree on standard implementations, parameters, instantiations, … – Example: trec_eval in IR 39
  • 40. Outline • Background and Motivation [10 minutes] • Evaluating Recommender Systems [20 minutes] • Replicating Evaluation Results [20 minutes] • Replication by Example [20 minutes] • Conclusions and Wrap-up [10 minutes] • Questions [10 minutes] 40
  • 41. Reproducible Experimental Design • We need to distinguish – Replicability – Reproducibility • Different aspects: – Algorithmic – Published results – Experimental design • Goal: have a reproducible experimental environment 41
  • 42. Definition: Replicability To copy something • The results • The data • The approach Being able to evaluate in the same setting and obtain the same results 42
  • 43. Definition: Reproducibility To recreate something • The (complete) set of experiments • The (complete) set of results • The (complete) experimental setup To (re) launch it in production with the same results 43
  • 44. Comparing against the state-of-the-art Your settings are not exactly like those in paper X, but it is a relevant paper Reproduce results of paper X Congrats, you’re done! Replicate results of paper X Congrats! You have shown that paper X behaves different in the new setting Sorry, there is something wrong/incomplete in the experimental design They agree They do not agree Do results match the original paper? Yes! No! Do results agree with original paper? 44
  • 45. What about Reviewer 3? • “It would be interesting to see this done on a different dataset…” – Repeatability – The same person doing the whole pipeline over again • “How does your approach compare to *Reviewer 3 et al. 2003+?” – Reproducibility or replicability (depending on how similar the two papers are) 45
  • 46. Repeat vs. replicate vs. reproduce vs. reuse 46
  • 47. Motivation for reproducibility In order to ensure that our experiments, settings, and results are: – Valid – Generalizable – Of use for others – etc. we must make sure that others can reproduce our experiments in their setting 47
  • 48. Making reproducibility easier • Description, description, description • No magic numbers • Specify values for all parameters • Motivate! • Keep a detailed protocol • Describe process clearly • Use standards • Publish code (nobody expects you to be an awesome developer, you’re a researcher) 48
  • 49. Replicability, reproducibility, and progress • Can there be actual progress if no valid comparison can be done? • What is the point of comparing two approaches if the comparison is flawed? • How do replicability and reproducibility facilitate actual progress in the field? 49
  • 50. Summary • Important issues in recommendation – Validity of results (replicability) – Comparability of results (reproducibility) – Validity of experimental setup (repeatability) • We need to incorporate reproducibility and replication to facilitate the progress in the field 50
  • 51. Outline • Background and Motivation [10 minutes] • Evaluating Recommender Systems [20 minutes] • Replicating Evaluation Results [20 minutes] • Replication by Example [20 minutes] • Conclusions and Wrap-up [10 minutes] • Questions [10 minutes] 51
  • 52. Replication by Example • Demo time! • Check – http://www.recommenders.net/tutorial • Checkout – https://github.com/recommenders/tutorial.git 52
  • 53. The things we write mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation” 53
  • 54. The things we forget to write mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation" -Dexec.args=”-u false" 54 mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation”
  • 55. The things we forget to write mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation" -Dexec.args="-t 4.0" 55 mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation" -Dexec.args=”-u false" mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation”
  • 56. Outline • Background and Motivation [10 minutes] • Evaluating Recommender Systems [20 minutes] • Replicating Evaluation Results [20 minutes] • Replication by Example [20 minutes] • Conclusions and Wrap-up [10 minutes] • Questions [10 minutes] 56
  • 57. Key Takeaways • Every decision has an impact – We should log every step taken in the experimental part and report that log • There are more things besides papers – Source code, web appendix, etc. are very useful to provide additional details not present in the paper • You should not fool yourself – You have to be critical about what you measure and not trust intermediate “black boxes” 57
  • 58. We must avoid this From http://dilbert.com/strips/comic/2010-11-07/ 58
  • 59. Next steps? • Agree on standard implementations • Replicable badges for journals / conferences 59
  • 60. Next steps? • Agree on standard implementations • Replicable badges for journals / conferences http://validation.scienceexchange.com The aim of the Reproducibility Initiative is to identify and reward high quality reproducible research via independent validation of key experimental results 60
  • 61. Next steps? • Agree on standard implementations • Replicable badges for journals / conferences • Investigate how to improve reproducibility 61
  • 62. Next steps? • Agree on standard implementations • Replicable badges for journals / conferences • Investigate how to improve reproducibility • Benchmark, report, and store results 62
  • 63. Pointers • Email and Twitter – Alejandro Bellogín • alejandro.bellogin@uam.es • @abellogin – Alan Said • alansaid@acm.org • @alansaid • Slides: • In Slideshare... soon! 63
  • 64. RiVal Recommender System Evaluation Toolkit http://rival.recommenders.net http://github.com/recommenders/rival 64
  • 66. References and Additional reading • [Armstrong et al, 2009] Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since 1998. CIKM • [Bellogín et al, 2010] A Study of Heterogeneity in Recommendations for a Social Music Service. HetRec • [Bellogín et al, 2011] Precision-Oriented Evaluation of Recommender Systems: an Algorithm Comparison. RecSys • [Bellogín et al, 2013] An Empirical Comparison of Social, Collaborative Filtering, and Hybrid Recommenders. ACM TIST • [Ben-Shimon et al, 2015] RecSys Challenge 2015 and the YOOCHOOSE Dataset. RecSys • [Cremonesi et al, 2010] Performance of Recommender Algorithms on Top-N Recommendation Tasks. RecSys • [Filippone & Sanguinetti, 2010] Information Theoretic Novelty Detection. Pattern Recognition • [Fleder & Hosanagar, 2009] Blockbuster Culture’s Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity. Management Science • [Ge et al, 2010] Beyond accuracy: evaluating recommender systems by coverage and serendipity. RecSys • [Gorla et al, 2013] Probabilistic Group Recommendation via Information Matching. WWW66
  • 67. References and Additional reading • [Herlocker et al, 2004] Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems • [Jambor & Wang, 2010] Goal-Driven Collaborative Filtering. ECIR • [Knijnenburg et al, 2011] A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys • [Koren, 2008] Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. KDD • [Lathia et al, 2010] Temporal Diversity in Recommender Systems. SIGIR • [Li et al, 2010] Improving One-Class Collaborative Filtering by Incorporating Rich User Information. CIKM • [Pu et al, 2011] A User-Centric Evaluation Framework for Recommender Systems. RecSys • [Said & Bellogín, 2014] Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks. RecSys • [Schein et al, 2002] Methods and Metrics for Cold-Start Recommendations. SIGIR • [Shani & Gunawardana, 2011] Evaluating Recommendation Systems. Recommender Systems Handbook • [Steck & Xin, 2010] A Generalized Probabilistic Framework and its Variants for Training Top-k Recommender Systems. PRSAT 67
  • 68. References and Additional reading • [Tikk et al, 2014] Comparative Evaluation of Recommender Systems for Digital Media. IBC • [Vargas & Castells, 2011] Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems. RecSys • [Weng et al, 2007] Improving Recommendation Novelty Based on Topic Taxonomy. WI-IAT • [Yin et al, 2012] Challenging the Long Tail Recommendation. VLDB • [Zhang & Hurley, 2008] Avoiding Monotony: Improving the Diversity of Recommendation Lists. RecSys • [Zhang & Hurley, 2009] Statistical Modeling of Diversity in Top-N Recommender Systems. WI- IAT • [Zhou et al, 2010] Solving the Apparent Diversity-Accuracy Dilemma of Recommender Systems. PNAS • [Ziegler et al, 2005] Improving Recommendation Lists Through Topic Diversification. WWW 68
  • 71. Mean Percentage Ranking [Li et al, 2010] 71
  • 72. Global ROC [Schein et al, 2002] 72
  • 73. Customer ROC [Schein et al, 2002] 73