COMPARING ON-LINE and OFF-LINE: EVALUATION RESULTS
Off-line vs. On-line Evaluation of Recommender
Systems in Small E-commerceLadislav Peška
Department of Software Engineering
Charles University in Prague, Czech Republic
ABSTRACT
Recommending in context of small e-commerce enterprises is
rather challenging due to the lower volume of interactions and low
user loyalty, rarely extending beyond a single session. On the other
hand, we usually have to deal with lower volumes of objects, which
are easier to discover by users via searching or browsing GUI.
The main goal of this paper is to determine applicability of off-line
evaluation metrics in learning usability of recommender systems
(evaluated on-line in A/B testing). In total 800 variants of
recommending algorithms were evaluated off-line w.r.t. 18 metrics
covering rating-based, ranking-based, novelty and diversity
evaluation.
Off-line results were compared with on-line evaluation of 12
selected recommender variants. Off-line results shown a great
variance in performance w.r.t. different metrics with the Pareto
front covering 68% of the approaches. On-line results were highly
diversified w.r.t. the volume of objects visited by the user. Ranking-
based metrics provided best estimation for novel users. We further
train two regressors to predict on-line results based on the off-line
metrics and estimate performance of recommenders not evaluated
in A/B testing directly.
RESULTS and FUTURE WORK
• CTR and VRR became gradually less consistent with more
objects visited by the user.
o Perhaps, some users tend to observe all objects from some category ->
Modify VRR metric to incorporate elapsed time.
o Evaluate more business-oriented metrics in the future (conversions,
revenue, actions after click).
• For users with lower volume of visited objects (majority of the
dataset), ranking-based metrics are best estimators of on-line
performance.
o Intra list diversity (ILD) seems to gain some importance for users with
longer history.
o Rating-based and novelty metrics were mostly negatively correlated or
indifferent throughout the dataset.
o Aim on finer-grained classification of users in the future.
• Evaluate metrics based on knowledge of user’s choices (MNAR).
o However, high ratio of VRR / CTR scores indicates potentially lower effect
of missing not at random.
• Evaluate other off-line metrics, such as object’s popularity.
• Evaluate regression/ranking methods aiming to predict on-line
results from off-line metrics.
• Verify results on additional small e-commerce vendors & for
additional recommending algorithms.
DOMAIN: Czech travel agency
• Approx. 300-800 visitors daily, several hundreds to thousands objects.
• However, just few visited objects per user, low user loyalty.
• Over 2 years of historic data, 560K records, complex feedback available.
ALGORITHMS: Item-to-Item models
Low user loyalty and high fluctuation of users would prevent effective usage of
user-based algorithms, such as matrix factorization.
• Word2vec (item2vec): CF based on the stream of object’s visits.
• Doc2vec: CB based on the textual descriptions of tours.
• VSM (Cosine): CB based on the descriptive features of tours (length, price,
destination, meal plan etc.)
Hyperparameters: embeddings size, context window size, diversity and novelty
enhancements, user profile.
USER PROFILE: (Which objects were used to represent user?)
• Mean: All visited objects are used, similarities per visited object are averaged.
• Max: All objects are used, max(similarity w.r.t. visited object) is used.
• Last(-k): Only last (k) objects are used with linearly decreasing weight.
• Temporal(-k): Last k (all) objects are used with decreasing weight based on
the real time (days) elapsed from the feedback observation.
EVALUATION:
Off-line phase: June 1 – July 19, 2018
- 970 users (with visited objects in both train and test set)
- 800 variants of [algorithm, hyperparameters, user profile] were evaluated.
On-line phase: July 19 – August 17, 2018
• Selected 12 algorithms (best & worst w.r.t. each off-line metric)
• 4287 users (with some visited objects to create a user profile)
o One RS’s variant assigned to each user
• In total 928 click-throughs (CTR)
• In total 10961 visits after recommendation (VRR)
Peter Vojtáš
Department of Software Engineering
Charles University in Prague, Czech Republic
OFF-LINE METRICS PEARSON’S CORRELATION
rating ranking novelty diversity
Id Algorithm MAE AUC MRR nDCG100 Nov10𝑡 Nov10 𝒖 ILD10 CTR VRR
0 Doc2vec; e:128, w:1, last, nov. 0.292 0.617 0.031 0.057 0.234 1.000 0.800 0.0070 0.050
1 Doc2vec; e:128, w:1, temp., div. 0.362 0.679 0.031 0.075 0.221 0.999 0.838 0.0084 0.075
2 Doc2vec; e:32, w:5, mean 0.455 0.555 0.028 0.050 0.211 0.997 0.786 0.0089 0.054
3 Doc2vec; e:32, w:5, mean, div. 0.455 0.555 0.025 0.046 0.214 0.998 0.859 0.0062 0.060
4 Doc2vec; e:128, w:5, max, nov. 0.214 0.526 0.012 0.031 0.229 0.995 0.741 0.0077 0.052
5 Cosine; temp., nov. 0.406 0.797 0.146 0.215 0.255 0.994 0.270 0.0057 0.020
6 Cosine; mean, nov. 0.400 0.795 0.149 0.214 0.229 0.994 0.223 0.0119 0.088
7 Cosine; last-10 0.390 0.783 0.127 0.205 0.218 0.996 0.208 0.0075 0.055
8 Word2vec; e:64, w:5, mean, div. 0.414 0.809 0.103 0.182 0.215 0.973 0.683 0.0090 0.062
9 Word2vec; e:32, w:5, temp., nov. 0.438 0.816 0.102 0.195 0.244 0.977 0.495 0.0095 0.065
10 Word2vec; e:128, w:3, last 0.290 0.734 0.097 0.168 0.212 0.997 0.534 0.0077 0.056
11 Word2vec; e:32, w:3, last-10 0.432 0.814 0.134 0.229 0.214 0.988 0.443 0.0080 0.089
Comparing on-line and off-line results for users with 1-5 visited objects. Parameters e = embeddings size,
w = context window size, nov. and div. denotes novelty and diversity enhancements. Best results w.r.t. each
metric are in bold, green, worst results are in red, best w.r.t. algorithm type are in italic.
KEY FINDINGS
• One of the surprising findings was that results are highly dependent
on the volume of visited objects by the user. While for users with
lower volume of objects, both on-line metrics were highly correlated
and ranking-based metrics provide most relevant estimations, CTR
and VRR became gradually less consistent for users with longer history.
This is further illustrated by the volume of interactions. While per-user
CTR gradually decreases with visited objects, VRR increases.
• Reasonable level of diversity seems important for users with more
visited objects. This was also indicated by the regression methods
trained to predict on-line results from off-line metrics.
• As for the recommending algorithms, trained regressors prefer Cosine
and Word2vec models over Doc2vec, which is also observable from
the actual on-line results.
Evaluation site:
www.slantour.cz
Code & full results:
github.com/lpeska/REVEAL2018
Contact:
peska@ksi.mff.cuni.cz
vojtas@ksi.mff.cuni.cz
Users with 1-2 visited objectsComparison of total and per-user CTR and VRR scores

More Related Content

PDF
PDF
Machine Language and Pattern Analysis IEEE 2015 Projects
PDF
AIM NIAC PNNL-SA-116502
PDF
M phil-computer-science-machine-language-and-pattern-analysis-projects
PPTX
Web Page Ranking using Machine Learning
DOCX
Incentive compatible privacy preserving data
PDF
Click Model-Based Information Retrieval Metrics
PDF
How to Interpret Implicit User Feedback
Machine Language and Pattern Analysis IEEE 2015 Projects
AIM NIAC PNNL-SA-116502
M phil-computer-science-machine-language-and-pattern-analysis-projects
Web Page Ranking using Machine Learning
Incentive compatible privacy preserving data
Click Model-Based Information Retrieval Metrics
How to Interpret Implicit User Feedback

Similar to Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce (20)

PDF
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
PDF
IRJET-Survey on Identification of Top-K Competitors using Data Mining
PDF
IRJET-Fake Product Review Monitoring
PPTX
Orchestrating Collective Intelligence
PDF
1440 track 2 boire_using our laptop
PDF
2010.080 1226
PDF
CV-Grace-DataAnalytics-UCL
PPTX
Data analysis
PPTX
PhD Consortium ADBIS presetation.
DOCX
Imtiaz khan data_science_analytics
PDF
IRJET - House Price Predictor using ML through Artificial Neural Network
DOCX
SHAHBAZ_TECHNICAL_SEMINAR.docx
PDF
“Electronic Shopping Website with Recommendation System”
PDF
Providing highly accurate service recommendation for semantic clustering over...
PDF
Data analytics to improve home broadband cx & network insight
PDF
IRJET- Credit Card Fraud Detection Analysis
PDF
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
PDF
IRJET-Smart Tourism Recommender System
PDF
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
PDF
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
IRJET-Survey on Identification of Top-K Competitors using Data Mining
IRJET-Fake Product Review Monitoring
Orchestrating Collective Intelligence
1440 track 2 boire_using our laptop
2010.080 1226
CV-Grace-DataAnalytics-UCL
Data analysis
PhD Consortium ADBIS presetation.
Imtiaz khan data_science_analytics
IRJET - House Price Predictor using ML through Artificial Neural Network
SHAHBAZ_TECHNICAL_SEMINAR.docx
“Electronic Shopping Website with Recommendation System”
Providing highly accurate service recommendation for semantic clustering over...
Data analytics to improve home broadband cx & network insight
IRJET- Credit Card Fraud Detection Analysis
Cold-Start Management with Cross-Domain Collaborative Filtering and Tags
IRJET-Smart Tourism Recommender System
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Ad

More from Ladislav Peska (9)

PDF
Fuzzy D’Hondt’s Algorithm for On-line Recommendations Aggregation
PPTX
LineIT: Similarity search and recommendations for photo lineup assembling
PDF
Towards Similarity Models in Police Photo Lineup Assembling Tasks
PDF
Towards Recommender Systems for Police Photo Lineup
PPTX
Linking Content Information with Bayesian Personalized Ranking via Multiple C...
PPTX
Towards Complex User Feedback and Presentation Context in Recommender Systems
PPTX
Using the Context of User Feedback in Recommender Systems
PDF
Using Implicit Preference Relations to Improve Content-based Recommendations,...
PPTX
RecSys Challenge 2014, SemWexMFF group
Fuzzy D’Hondt’s Algorithm for On-line Recommendations Aggregation
LineIT: Similarity search and recommendations for photo lineup assembling
Towards Similarity Models in Police Photo Lineup Assembling Tasks
Towards Recommender Systems for Police Photo Lineup
Linking Content Information with Bayesian Personalized Ranking via Multiple C...
Towards Complex User Feedback and Presentation Context in Recommender Systems
Using the Context of User Feedback in Recommender Systems
Using Implicit Preference Relations to Improve Content-based Recommendations,...
RecSys Challenge 2014, SemWexMFF group
Ad

Recently uploaded (20)

PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
DOCX
Factor Analysis Word Document Presentation
PPT
Image processing and pattern recognition 2.ppt
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
New ISO 27001_2022 standard and the changes
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Transcultural that can help you someday.
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Steganography Project Steganography Project .pptx
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Microsoft 365 products and services descrption
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Factor Analysis Word Document Presentation
Image processing and pattern recognition 2.ppt
A Complete Guide to Streamlining Business Processes
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
New ISO 27001_2022 standard and the changes
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
STERILIZATION AND DISINFECTION-1.ppthhhbx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
SAP 2 completion done . PRESENTATION.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Transcultural that can help you someday.
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Steganography Project Steganography Project .pptx
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
IMPACT OF LANDSLIDE.....................
Microsoft 365 products and services descrption
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305

Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce

  • 1. COMPARING ON-LINE and OFF-LINE: EVALUATION RESULTS Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerceLadislav Peška Department of Software Engineering Charles University in Prague, Czech Republic ABSTRACT Recommending in context of small e-commerce enterprises is rather challenging due to the lower volume of interactions and low user loyalty, rarely extending beyond a single session. On the other hand, we usually have to deal with lower volumes of objects, which are easier to discover by users via searching or browsing GUI. The main goal of this paper is to determine applicability of off-line evaluation metrics in learning usability of recommender systems (evaluated on-line in A/B testing). In total 800 variants of recommending algorithms were evaluated off-line w.r.t. 18 metrics covering rating-based, ranking-based, novelty and diversity evaluation. Off-line results were compared with on-line evaluation of 12 selected recommender variants. Off-line results shown a great variance in performance w.r.t. different metrics with the Pareto front covering 68% of the approaches. On-line results were highly diversified w.r.t. the volume of objects visited by the user. Ranking- based metrics provided best estimation for novel users. We further train two regressors to predict on-line results based on the off-line metrics and estimate performance of recommenders not evaluated in A/B testing directly. RESULTS and FUTURE WORK • CTR and VRR became gradually less consistent with more objects visited by the user. o Perhaps, some users tend to observe all objects from some category -> Modify VRR metric to incorporate elapsed time. o Evaluate more business-oriented metrics in the future (conversions, revenue, actions after click). • For users with lower volume of visited objects (majority of the dataset), ranking-based metrics are best estimators of on-line performance. o Intra list diversity (ILD) seems to gain some importance for users with longer history. o Rating-based and novelty metrics were mostly negatively correlated or indifferent throughout the dataset. o Aim on finer-grained classification of users in the future. • Evaluate metrics based on knowledge of user’s choices (MNAR). o However, high ratio of VRR / CTR scores indicates potentially lower effect of missing not at random. • Evaluate other off-line metrics, such as object’s popularity. • Evaluate regression/ranking methods aiming to predict on-line results from off-line metrics. • Verify results on additional small e-commerce vendors & for additional recommending algorithms. DOMAIN: Czech travel agency • Approx. 300-800 visitors daily, several hundreds to thousands objects. • However, just few visited objects per user, low user loyalty. • Over 2 years of historic data, 560K records, complex feedback available. ALGORITHMS: Item-to-Item models Low user loyalty and high fluctuation of users would prevent effective usage of user-based algorithms, such as matrix factorization. • Word2vec (item2vec): CF based on the stream of object’s visits. • Doc2vec: CB based on the textual descriptions of tours. • VSM (Cosine): CB based on the descriptive features of tours (length, price, destination, meal plan etc.) Hyperparameters: embeddings size, context window size, diversity and novelty enhancements, user profile. USER PROFILE: (Which objects were used to represent user?) • Mean: All visited objects are used, similarities per visited object are averaged. • Max: All objects are used, max(similarity w.r.t. visited object) is used. • Last(-k): Only last (k) objects are used with linearly decreasing weight. • Temporal(-k): Last k (all) objects are used with decreasing weight based on the real time (days) elapsed from the feedback observation. EVALUATION: Off-line phase: June 1 – July 19, 2018 - 970 users (with visited objects in both train and test set) - 800 variants of [algorithm, hyperparameters, user profile] were evaluated. On-line phase: July 19 – August 17, 2018 • Selected 12 algorithms (best & worst w.r.t. each off-line metric) • 4287 users (with some visited objects to create a user profile) o One RS’s variant assigned to each user • In total 928 click-throughs (CTR) • In total 10961 visits after recommendation (VRR) Peter Vojtáš Department of Software Engineering Charles University in Prague, Czech Republic OFF-LINE METRICS PEARSON’S CORRELATION rating ranking novelty diversity Id Algorithm MAE AUC MRR nDCG100 Nov10𝑡 Nov10 𝒖 ILD10 CTR VRR 0 Doc2vec; e:128, w:1, last, nov. 0.292 0.617 0.031 0.057 0.234 1.000 0.800 0.0070 0.050 1 Doc2vec; e:128, w:1, temp., div. 0.362 0.679 0.031 0.075 0.221 0.999 0.838 0.0084 0.075 2 Doc2vec; e:32, w:5, mean 0.455 0.555 0.028 0.050 0.211 0.997 0.786 0.0089 0.054 3 Doc2vec; e:32, w:5, mean, div. 0.455 0.555 0.025 0.046 0.214 0.998 0.859 0.0062 0.060 4 Doc2vec; e:128, w:5, max, nov. 0.214 0.526 0.012 0.031 0.229 0.995 0.741 0.0077 0.052 5 Cosine; temp., nov. 0.406 0.797 0.146 0.215 0.255 0.994 0.270 0.0057 0.020 6 Cosine; mean, nov. 0.400 0.795 0.149 0.214 0.229 0.994 0.223 0.0119 0.088 7 Cosine; last-10 0.390 0.783 0.127 0.205 0.218 0.996 0.208 0.0075 0.055 8 Word2vec; e:64, w:5, mean, div. 0.414 0.809 0.103 0.182 0.215 0.973 0.683 0.0090 0.062 9 Word2vec; e:32, w:5, temp., nov. 0.438 0.816 0.102 0.195 0.244 0.977 0.495 0.0095 0.065 10 Word2vec; e:128, w:3, last 0.290 0.734 0.097 0.168 0.212 0.997 0.534 0.0077 0.056 11 Word2vec; e:32, w:3, last-10 0.432 0.814 0.134 0.229 0.214 0.988 0.443 0.0080 0.089 Comparing on-line and off-line results for users with 1-5 visited objects. Parameters e = embeddings size, w = context window size, nov. and div. denotes novelty and diversity enhancements. Best results w.r.t. each metric are in bold, green, worst results are in red, best w.r.t. algorithm type are in italic. KEY FINDINGS • One of the surprising findings was that results are highly dependent on the volume of visited objects by the user. While for users with lower volume of objects, both on-line metrics were highly correlated and ranking-based metrics provide most relevant estimations, CTR and VRR became gradually less consistent for users with longer history. This is further illustrated by the volume of interactions. While per-user CTR gradually decreases with visited objects, VRR increases. • Reasonable level of diversity seems important for users with more visited objects. This was also indicated by the regression methods trained to predict on-line results from off-line metrics. • As for the recommending algorithms, trained regressors prefer Cosine and Word2vec models over Doc2vec, which is also observable from the actual on-line results. Evaluation site: www.slantour.cz Code & full results: github.com/lpeska/REVEAL2018 Contact: peska@ksi.mff.cuni.cz vojtas@ksi.mff.cuni.cz Users with 1-2 visited objectsComparison of total and per-user CTR and VRR scores