SlideShare a Scribd company logo
RecSys Boston,	Sept	17,	2016 1
Contrasting Offline and Online
Results when Evaluating
Recommendation Algorithms
Marco	Rossetti
Trainline Ltd.,	London
(previously	University	of	Milan-Bicocca)
Fabio	Stella
Department	of	Informatics,	Systems	and	Communication
University	of	Milano-Bicocca
Markus	Zanker
Faculty	of	Computer	Science
Free	University	of	Bozen-Bolzano
RecSys Boston,	Sept	17,	2016 2
Research Goal
• Given the dominance of offline evaluation reflecting on its validity
becomes important
• Said and Bellogin (RecSys 2014) identified serious problems with the
internal validity (not reproducible results with different open source
frameworks).
• Different results from offline and online evaluations have also been
identified putting question marks on the external validity (e.g.
Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et
al. 2014, Maksai et al., 2015).
• Proposition:
• Compare performance of an offline experimentation with an online
evaluation.
• Use of a within-users experimental design, where we can test for
differences in paired samples.
RecSys Boston,	Sept	17,	2016 3
Research Questions
1. Does the relative ranking of algorithms based on offline accuracy
measurements predict the relative ranking according to an accuracy
measurement in a user-centric evaluation?
2. Does the relative ranking of algorithms based on offline measurements of
the predictive accuracy for long- tail items produce comparable results to
a user-centric evaluation?
3. Do offline accuracy measurements allow to predict the utility of
recommendations in a user-centric evaluation?
RecSys Boston,	Sept	17,	2016 4
Study Design
• Collected likes on ML movies
from 241 users
• On average 137 ratings per user
1
• Same users, evaluated 4 algorithms, 5
recommendations each
• On average 17.4 + 2 recommendations
• 122 users returned, 100 after cleaning
2
RecSys Boston,	Sept	17,	2016 5
Offline and Online Evaluations
ML1M
All-but-1	validation Users	Answers
Popularity
MF80:	Matrix	Factorization	with	80	factors
MF400:	Matrix	Factorization	with	400	factors
I2I:	Item	To	Item	K-Nearest	Neighbors
train
Offline	evaluation Online	evaluation
Metrics
à precision on all items ß
à precision on long tail ß
useful recommendations ß
RecSys Boston,	Sept	17,	2016 6
Precision All Items
MF400 MF80
POP I2I
p = 0.05 p = 0.05 p = 0.05
MF80 MF400
POP I2I
p = 0.05 p = 0.05 p = 0.1
Algorithm Offline Online
I2I 0.438 0.546
MF80 0.504 0.598
MF400 0.454 0.604
POP 0.340 0.516
Offline	precision	all	items
Online	precision	all	items
RecSys Boston,	Sept	17,	2016 7
Precision on Long Tail Items
MF80
MF400
POP
I2I
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
Offline	=	Online	precision	long	tail	items
Algorithm Offline Online
I2I 0.280 0.356
MF80 0.018 0.054
MF400 0.360 0.628
POP 0.000 0.000
RecSys Boston,	Sept	17,	2016 8
Useful Recommendations
MF400I2I
POP
p = 0.05 p = 0.05
MF80
p = 0.05 p = 0.05
p = 0.05
Useful	recommendations
Algorithm Online
I2I 0.126
MF80 0.082
MF400 0.116
POP 0.026
RecSys Boston,	Sept	17,	2016 9
Conclusions
• Comparison of different algorithms online and offline based on
a within-users experimental design.
• The algorithm performing best according to a traditional offline
accuracy measurement was significantly worse, when it comes
to useful (i.e. relevant and novel) recommendations measured
online.
• Academia and industry should keep investigating this topic in
order to find the best possible way to validate offline
evaluations.
RecSys Boston,	Sept	17,	2016
Thank you!
10
Marco	Rossetti
Trainline Ltd.,	London
@ross85

More Related Content

PPTX
Metrics are rubbish, but ...
PPT
Open Sample Intro Pgraham 0910
PPTX
Uni2go week4_interview summary
PDF
PARCC Grade 6 Math
PDF
voting advice slides
PPTX
Code Ownership and Software Quality: A Replication Study @ MSR 2015
PPT
Usability evaluation in exclusive domains_presentation
PPTX
Multivocal literature reviews in software engineering: preliminary findings f...
Metrics are rubbish, but ...
Open Sample Intro Pgraham 0910
Uni2go week4_interview summary
PARCC Grade 6 Math
voting advice slides
Code Ownership and Software Quality: A Replication Study @ MSR 2015
Usability evaluation in exclusive domains_presentation
Multivocal literature reviews in software engineering: preliminary findings f...

What's hot (8)

PDF
Rp mr course quiz 05
PDF
Handling missing Social Network data
PDF
2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...
PPT
BugDay2012 Test Design with CTE XL(SharingDay)
PDF
Automated Testing for Web Applications - Wurbe #36
PDF
Investigating the effects of popularity data on predictive relevance judgment...
PPT
Using Data to Drive Instruction
PDF
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
Rp mr course quiz 05
Handling missing Social Network data
2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...
BugDay2012 Test Design with CTE XL(SharingDay)
Automated Testing for Web Applications - Wurbe #36
Investigating the effects of popularity data on predictive relevance judgment...
Using Data to Drive Instruction
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
Ad

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms (20)

PDF
[DOLAP2019] Augmented Business Intelligence
PPTX
Software engineering practices and software quality empirical research results
PPT
bonino
PPTX
Open citations: Next steps
PDF
From Bugs to Decision Support - Selected Research Highlights
PDF
MIP Award presentation at the IEEE International Conference on Software Analy...
PPTX
Incentives for infrastructure modernization
PDF
2.pdf
PPTX
Intelligent Software Engineering: Synergy between AI and Software Engineering
PDF
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
PDF
Exploratory Analysis of User Data
PDF
DataMind Pitch August 2013
PDF
Benchmarking Linked Data Introductory Remarks
PDF
How To Structure Your Search Team for Success
PDF
A Context-Aware Retrieval System for Mobile Applications
PPTX
Semantic Data Retrieval: Search, Ranking, and Summarization
PDF
Software Analytics - Achievements and Challenges
PDF
User Personality and the New User Problem in a Context-Aware Point of Interes...
PDF
productionising-recommenders
PPTX
Frontiers: Five Year Plan
[DOLAP2019] Augmented Business Intelligence
Software engineering practices and software quality empirical research results
bonino
Open citations: Next steps
From Bugs to Decision Support - Selected Research Highlights
MIP Award presentation at the IEEE International Conference on Software Analy...
Incentives for infrastructure modernization
2.pdf
Intelligent Software Engineering: Synergy between AI and Software Engineering
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Exploratory Analysis of User Data
DataMind Pitch August 2013
Benchmarking Linked Data Introductory Remarks
How To Structure Your Search Team for Success
A Context-Aware Retrieval System for Mobile Applications
Semantic Data Retrieval: Search, Ranking, and Summarization
Software Analytics - Achievements and Challenges
User Personality and the New User Problem in a Context-Aware Point of Interes...
productionising-recommenders
Frontiers: Five Year Plan
Ad

Recently uploaded (20)

PPTX
2. Earth - The Living Planet earth and life
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
BIOMOLECULES PPT........................
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
2. Earth - The Living Planet earth and life
bbec55_b34400a7914c42429908233dbd381773.pdf
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
The scientific heritage No 166 (166) (2025)
Cell Membrane: Structure, Composition & Functions
ECG_Course_Presentation د.محمد صقران ppt
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Classification Systems_TAXONOMY_SCIENCE8.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Placing the Near-Earth Object Impact Probability in Context
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Viruses (History, structure and composition, classification, Bacteriophage Re...
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
BIOMOLECULES PPT........................
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
An interstellar mission to test astrophysical black holes
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
7. General Toxicologyfor clinical phrmacy.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

  • 1. RecSys Boston, Sept 17, 2016 1 Contrasting Offline and Online Results when Evaluating Recommendation Algorithms Marco Rossetti Trainline Ltd., London (previously University of Milan-Bicocca) Fabio Stella Department of Informatics, Systems and Communication University of Milano-Bicocca Markus Zanker Faculty of Computer Science Free University of Bozen-Bolzano
  • 2. RecSys Boston, Sept 17, 2016 2 Research Goal • Given the dominance of offline evaluation reflecting on its validity becomes important • Said and Bellogin (RecSys 2014) identified serious problems with the internal validity (not reproducible results with different open source frameworks). • Different results from offline and online evaluations have also been identified putting question marks on the external validity (e.g. Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et al. 2014, Maksai et al., 2015). • Proposition: • Compare performance of an offline experimentation with an online evaluation. • Use of a within-users experimental design, where we can test for differences in paired samples.
  • 3. RecSys Boston, Sept 17, 2016 3 Research Questions 1. Does the relative ranking of algorithms based on offline accuracy measurements predict the relative ranking according to an accuracy measurement in a user-centric evaluation? 2. Does the relative ranking of algorithms based on offline measurements of the predictive accuracy for long- tail items produce comparable results to a user-centric evaluation? 3. Do offline accuracy measurements allow to predict the utility of recommendations in a user-centric evaluation?
  • 4. RecSys Boston, Sept 17, 2016 4 Study Design • Collected likes on ML movies from 241 users • On average 137 ratings per user 1 • Same users, evaluated 4 algorithms, 5 recommendations each • On average 17.4 + 2 recommendations • 122 users returned, 100 after cleaning 2
  • 5. RecSys Boston, Sept 17, 2016 5 Offline and Online Evaluations ML1M All-but-1 validation Users Answers Popularity MF80: Matrix Factorization with 80 factors MF400: Matrix Factorization with 400 factors I2I: Item To Item K-Nearest Neighbors train Offline evaluation Online evaluation Metrics à precision on all items ß à precision on long tail ß useful recommendations ß
  • 6. RecSys Boston, Sept 17, 2016 6 Precision All Items MF400 MF80 POP I2I p = 0.05 p = 0.05 p = 0.05 MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.1 Algorithm Offline Online I2I 0.438 0.546 MF80 0.504 0.598 MF400 0.454 0.604 POP 0.340 0.516 Offline precision all items Online precision all items
  • 7. RecSys Boston, Sept 17, 2016 7 Precision on Long Tail Items MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 Offline = Online precision long tail items Algorithm Offline Online I2I 0.280 0.356 MF80 0.018 0.054 MF400 0.360 0.628 POP 0.000 0.000
  • 8. RecSys Boston, Sept 17, 2016 8 Useful Recommendations MF400I2I POP p = 0.05 p = 0.05 MF80 p = 0.05 p = 0.05 p = 0.05 Useful recommendations Algorithm Online I2I 0.126 MF80 0.082 MF400 0.116 POP 0.026
  • 9. RecSys Boston, Sept 17, 2016 9 Conclusions • Comparison of different algorithms online and offline based on a within-users experimental design. • The algorithm performing best according to a traditional offline accuracy measurement was significantly worse, when it comes to useful (i.e. relevant and novel) recommendations measured online. • Academia and industry should keep investigating this topic in order to find the best possible way to validate offline evaluations.