[2018 台灣人工智慧學校校友年會] Practical experience in mining and evaluating information systems / 陳弘軒

1
Practical lessons in mining and
evaluating information systems
Hung-Hsuan Chen, National Central University

│ ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE Copyright 2015 ITRI
Data Analytics Research Team (DART)
• Discover the problems
or needs (need)
• Have the programming,
math skills, and domain
knowledge to solve the
problem (skill)
• Have passion to realize
the plan (passion)
2
https://ncu-dart.github.io/

│ ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE Copyright 2015 ITRI
My background
• An engineer wearing a
scientist’s hat?
• Deep learning and
ensemble learning on
recommender systems
(2014 – 2018)
• Academic search engine
CiteSeerX (2008 – 2013)
§ 4M+ documents
§ 87M+ citations
§ 2M – 4M hits per day
§ 300K+ monthly downloads
§ 100K documents added
monthly
3

Outline
• I will present 4 common pitfalls in training and
evaluating recommender systems
• These pitfalls appeared in many previous
studies on recommender systems and
information systems
• Details are in the following paper
§ Chen, H. H., Chung, C. A., Huang, H. C., & Tsui, W.
(2017). Common pitfalls in training and evaluating
recommender systems. ACM SIGKDD Explorations
Newsletter, 19(1), 37-45.
4

A typical flow to build a
recommender system
5

t0 t1 t2
No recommendation
period
Initial recommendation algorithm
Rorig is applied online
The logs of this period is used to train and
compare the the initial algorithm Rorig and the new
algorithm Rnew
The data used to train the
new recommendation
algorithm Rnew and re-train
the the original algorithm Rorig
Test data to
compare Rorig and
Rnew
ts
The logs (e.g.,
clickstream) of this
period is used to train
the initial
recommendation
algorithm Rorig

Issue 1: trained model may be
biased toward highly reachable
products
7

Clicks resulted from the in-page
direct links
Day Day 1 Day 2
Percentage 19.3150% 21.2812%
8
• If we use the clickstreams to generate the
positive samples, by rearranging the layout of
the pages or the link targets in the pages,
approximately 1/5 of the positive training
instances are likely to be different.

Percentage of promoted products in
the recommendation list
Meth
od
MC Categ
oryTP
TotalT
P
ICF-
U2I
ICF-I2I NMF-
U2I
NMF-
I2I
train-
all
100% 1.48% 1.84% 93.22
%
1.40% 1.48% 1.34%
train-
sel
1.08% 0.86% 0.98% 14.46
%
1.28% 1.32% 1.24%
9
• When using train-all as the training data, several
algorithms recommend many of the “promoted
products”
§ We seem to learn the “layout” of the product page
(i.e., the direct links from !" to !# ) instead of the
intrinsic relatedness of between products

Lessons learned
• The common wisdom that the clickstream
represents a user’s interest/habit could be
problematic
§ Clickstreams are highly influenced by the
reachability of the products and the layouts of the
product pages
• Training a recommender system based on the
clickstreams are likely to learn
§ The “layout” of the pages
§ The recommendation rules of the online
recommender system
• Need to select training data more carefully 10

Issue 2: the online
recommendation algorithm affects
the distribution of the test data
11

CTRs when using different online
recommendation algorithm
12

Lessons learned
• Previous studies sometimes use all the
available test data as the ground truth for
evaluation
• Unfortunately, such an evaluation process
inevitably favors the algorithms that suggest
products close to the online recommendation
algorithm
• We should carefully select the test dataset to
perform a fairer evaluation.
13

Issue 3: click through rates are
mediocre proxy to the
recommendation revenues
14

CTR vs recommendation revenue
15
recommendation revenue
CTR
• Based on ~1 year log
• The correlation of determination is only 0.089
§ A weak positive relationship

Lessons learned
• Comparing recommendation algorithms based
on the user-centric metrics (e.g., CTR) may fail
to capture the business owner’s satisfaction
(e.g., revenue)
• Unfortunately, studies on recommender
systems mostly perform comparisons based
on the user-centric metrics
• Even if a recommendation algorithm attracts
many clicks, we cannot assure this algorithm
will bring a large amount of revenue to the
website 16

Issue 4: evaluating
recommendation revenue is not
straightforward
17

Comparing number of purchases
1/25 1/29 2/2 2/6 2/10 2/14 2/18
date
totalorders
1/25 1/29 2/2 2/6 2/10 2/14 2/18
date
recmdorders
18
Green line: the channel with a
recommendation panel
Blue line: the channel without
a recommendation panel

Lessons learned
• Although a recommendation module may
help users discover their needs, these users,
even without the recommendations, may still
be able to locate the desired products by
other processes
• It is not clear whether a recommendation
module brings extra purchases, or simply re-
direct users from other purchasing processes
to recommendation
• A/B-testing might be necessary
19

Discussion
• We discussed 4 pitfalls in training and
evaluating recommender systems
• The first two issues are due to the biased data
collection of the training and the test datasets
• The third issue is regarding the proper
selection of the evaluation metrics
• The fourth issue discusses the extra purchase
vs re-directed purchase of the recommender
systems
20

21
• Hung-Hsuan Chen
• https://www.ncu.edu.tw/~hhchen/
Questions?

[2018 台灣人工智慧學校校友年會] Practical experience in mining and evaluating information systems / 陳弘軒

More Related Content

What's hot (20)

Similar to [2018 台灣人工智慧學校校友年會] Practical experience in mining and evaluating information systems / 陳弘軒 (20)

More from 台灣資料科學年會 (20)

Recently uploaded (20)

[2018 台灣人工智慧學校校友年會] Practical experience in mining and evaluating information systems / 陳弘軒