Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce

COMPARING ON-LINE and OFF-LINE: EVALUATION RESULTS
Off-line vs. On-line Evaluation of Recommender
Systems in Small E-commerceLadislav Peška
Department of Software Engineering
Charles University in Prague, Czech Republic
ABSTRACT
Recommending in context of small e-commerce enterprises is
rather challenging due to the lower volume of interactions and low
user loyalty, rarely extending beyond a single session. On the other
hand, we usually have to deal with lower volumes of objects, which
are easier to discover by users via searching or browsing GUI.
The main goal of this paper is to determine applicability of off-line
evaluation metrics in learning usability of recommender systems
(evaluated on-line in A/B testing). In total 800 variants of
recommending algorithms were evaluated off-line w.r.t. 18 metrics
covering rating-based, ranking-based, novelty and diversity
evaluation.
Off-line results were compared with on-line evaluation of 12
selected recommender variants. Off-line results shown a great
variance in performance w.r.t. different metrics with the Pareto
front covering 68% of the approaches. On-line results were highly
diversified w.r.t. the volume of objects visited by the user. Ranking-
based metrics provided best estimation for novel users. We further
train two regressors to predict on-line results based on the off-line
metrics and estimate performance of recommenders not evaluated
in A/B testing directly.
RESULTS and FUTURE WORK
• CTR and VRR became gradually less consistent with more
objects visited by the user.
o Perhaps, some users tend to observe all objects from some category ->
Modify VRR metric to incorporate elapsed time.
o Evaluate more business-oriented metrics in the future (conversions,
revenue, actions after click).
• For users with lower volume of visited objects (majority of the
dataset), ranking-based metrics are best estimators of on-line
performance.
o Intra list diversity (ILD) seems to gain some importance for users with
longer history.
o Rating-based and novelty metrics were mostly negatively correlated or
indifferent throughout the dataset.
o Aim on finer-grained classification of users in the future.
• Evaluate metrics based on knowledge of user’s choices (MNAR).
o However, high ratio of VRR / CTR scores indicates potentially lower effect
of missing not at random.
• Evaluate other off-line metrics, such as object’s popularity.
• Evaluate regression/ranking methods aiming to predict on-line
results from off-line metrics.
• Verify results on additional small e-commerce vendors & for
additional recommending algorithms.
DOMAIN: Czech travel agency
• Approx. 300-800 visitors daily, several hundreds to thousands objects.
• However, just few visited objects per user, low user loyalty.
• Over 2 years of historic data, 560K records, complex feedback available.
ALGORITHMS: Item-to-Item models
Low user loyalty and high fluctuation of users would prevent effective usage of
user-based algorithms, such as matrix factorization.
• Word2vec (item2vec): CF based on the stream of object’s visits.
• Doc2vec: CB based on the textual descriptions of tours.
• VSM (Cosine): CB based on the descriptive features of tours (length, price,
destination, meal plan etc.)
Hyperparameters: embeddings size, context window size, diversity and novelty
enhancements, user profile.
USER PROFILE: (Which objects were used to represent user?)
• Mean: All visited objects are used, similarities per visited object are averaged.
• Max: All objects are used, max(similarity w.r.t. visited object) is used.
• Last(-k): Only last (k) objects are used with linearly decreasing weight.
• Temporal(-k): Last k (all) objects are used with decreasing weight based on
the real time (days) elapsed from the feedback observation.
EVALUATION:
Off-line phase: June 1 – July 19, 2018
- 970 users (with visited objects in both train and test set)
- 800 variants of [algorithm, hyperparameters, user profile] were evaluated.
On-line phase: July 19 – August 17, 2018
• Selected 12 algorithms (best & worst w.r.t. each off-line metric)
• 4287 users (with some visited objects to create a user profile)
o One RS’s variant assigned to each user
• In total 928 click-throughs (CTR)
• In total 10961 visits after recommendation (VRR)
Peter Vojtáš
Department of Software Engineering
Charles University in Prague, Czech Republic
OFF-LINE METRICS PEARSON’S CORRELATION
rating ranking novelty diversity
Id Algorithm MAE AUC MRR nDCG100 Nov10𝑡 Nov10 𝒖 ILD10 CTR VRR
0 Doc2vec; e:128, w:1, last, nov. 0.292 0.617 0.031 0.057 0.234 1.000 0.800 0.0070 0.050
1 Doc2vec; e:128, w:1, temp., div. 0.362 0.679 0.031 0.075 0.221 0.999 0.838 0.0084 0.075
2 Doc2vec; e:32, w:5, mean 0.455 0.555 0.028 0.050 0.211 0.997 0.786 0.0089 0.054
3 Doc2vec; e:32, w:5, mean, div. 0.455 0.555 0.025 0.046 0.214 0.998 0.859 0.0062 0.060
4 Doc2vec; e:128, w:5, max, nov. 0.214 0.526 0.012 0.031 0.229 0.995 0.741 0.0077 0.052
5 Cosine; temp., nov. 0.406 0.797 0.146 0.215 0.255 0.994 0.270 0.0057 0.020
6 Cosine; mean, nov. 0.400 0.795 0.149 0.214 0.229 0.994 0.223 0.0119 0.088
7 Cosine; last-10 0.390 0.783 0.127 0.205 0.218 0.996 0.208 0.0075 0.055
8 Word2vec; e:64, w:5, mean, div. 0.414 0.809 0.103 0.182 0.215 0.973 0.683 0.0090 0.062
9 Word2vec; e:32, w:5, temp., nov. 0.438 0.816 0.102 0.195 0.244 0.977 0.495 0.0095 0.065
10 Word2vec; e:128, w:3, last 0.290 0.734 0.097 0.168 0.212 0.997 0.534 0.0077 0.056
11 Word2vec; e:32, w:3, last-10 0.432 0.814 0.134 0.229 0.214 0.988 0.443 0.0080 0.089
Comparing on-line and off-line results for users with 1-5 visited objects. Parameters e = embeddings size,
w = context window size, nov. and div. denotes novelty and diversity enhancements. Best results w.r.t. each
metric are in bold, green, worst results are in red, best w.r.t. algorithm type are in italic.
KEY FINDINGS
• One of the surprising findings was that results are highly dependent
on the volume of visited objects by the user. While for users with
lower volume of objects, both on-line metrics were highly correlated
and ranking-based metrics provide most relevant estimations, CTR
and VRR became gradually less consistent for users with longer history.
This is further illustrated by the volume of interactions. While per-user
CTR gradually decreases with visited objects, VRR increases.
• Reasonable level of diversity seems important for users with more
visited objects. This was also indicated by the regression methods
trained to predict on-line results from off-line metrics.
• As for the recommending algorithms, trained regressors prefer Cosine
and Word2vec models over Doc2vec, which is also observable from
the actual on-line results.
Evaluation site:
www.slantour.cz
Code & full results:
github.com/lpeska/REVEAL2018
Contact:
peska@ksi.mff.cuni.cz
vojtas@ksi.mff.cuni.cz
Users with 1-2 visited objectsComparison of total and per-user CTR and VRR scores

Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce

More Related Content

Similar to Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce (20)

More from Ladislav Peska (9)

Recently uploaded (20)

Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce