Recommender Systems: Beyond the user-item matrix

Recommender Systems 102
Beyond the (usual) user-item matrix—implementation & results
DataScience SG Meetup Jan 2020

About me
§ Lead Data Scientist @ health-tech startup
- Early detection of preventable diseases
- Healthcare resource allocation
§ Previously: VP, Data Science @ Lazada
- E-commerce ML systems
- Facilitated integration with Alibaba
§ More at https://eugeneyan.com

RecSys
Overview
Figure 1. Obligatory (cliché) recsys representation

Definition: Use
behavior data to
predict what other
users will like based on
user/item similarity

Topics*
§ Data Acquisition, Preparation, Split, etc.
§ Conventional Baseline
§ Applying Graph and NLP approaches
* Implementation and results discussed throughout

Laying the Groundwork
Data acquisition, preparation, train-val-split, etc.

Data
Acquisition
http://jmcauley.ucsd.edu/data/amazon/links.html

{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg",
"related”:
{ "also_bought":[ "B00JHONN1S",
"B002BZX8Z6",
"B00D2K1M3O",
...
"B007R2RM8W"
],
"also_viewed":[ "B002BZX8Z6",
"B00JHONN1S",
"B008F0SU0Y",
...
"B00BFXLZ8M"
],
"bought_together":[ "B002BZX8Z6"
]
},
"salesRank":
{ "Toys & Games":211836
},
"brand": "Coxlures",
"categories":[
[ "Sports & Outdoors",
"Other Sports",
"Dance"
]
]
}

Parsing
json
§ Require parsing json to tabular form
§ Fairly large, with the largest having 142.8
million rows and 20gb on disk
§ Not able to load into ram fully on regular
laptop (16gb ram)

def parse_json_to_csv(read_path: str, write_path: str) -> None:
csv_writer = csv.writer(open(write_path, 'w'))
i = 0
for d in parse(read_path):
if i == 0:
header = d.keys()
csv_writer.writerow(header)
csv_writer.writerow(d.values().lower())
i += 1
if i % 10000 == 0:
logger.info('Rows processed: {:,}'.format(i))
logger.info('Csv saved to {}'.format(write_path))

Getting
product-
pairs
§ Evaluate string and convert to dictionary
§ Get product-pairs for each relationship
§ Explode each product-pair into a row

Scoring
product-
pairs
§ Simple way: Assign 1.0 if product-pair has
any/multiple relationships, 0.0 otherwise
§ My approach: Score relationships differently*
- Bought together: 1.2, Also bought: 1.0, Also viewed: 0.5

Electronics Books
Unique products 418,749 1,948,370
Product-pairs 4,005,262 26,595,848
Sparsity 0.9999 0.9999
𝑆𝑝𝑎𝑟𝑠𝑖𝑡𝑦 = 1 −
𝐶𝑜𝑢𝑛𝑡(𝑛𝑜𝑛𝑧𝑒𝑟𝑜 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠)
𝐶𝑜𝑢𝑛𝑡(𝑡𝑜𝑡𝑎𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠)
Table 3. Unique products and sparsity for electronics and books

Train-Validation Split
Or how to create negative samples (at scale)

Splitting
the data
§ Random split: 2/3 train, 1/3 validation
§ Easy, right?
§ But our dataset only consists of positive
product-pairs—how do we validate?

Splitting
the data
§ Random split: 2/3 train, 1/3 validation
§ Easy, right?
§ Not so fast! Our dataset only has positive
product-pairs—how do we validate?

Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice
to sample; shuffle when exhausted—fast!

Creating
negative
samples
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!

Creating
negative
samples
products
----------
B001T9NUFS
0000031895
B007ZN5Y56
0000031909
B00CYBULSO
B004FOEEHC
Negative product-pair 1

Creating
negative
samples
products
----------
B001T9NUFS
0000031895
B007ZN5Y56
0000031909
B00CYBULSO
B004FOEEHC

Matrix Factorization
Let’s start with a baseline

Batch MF
§ Common approach 1: Load matrix in
memory; apply Python package (e.g.,
scipy.svd, surprise, etc.)
§ Common approach 2: Run on cluster with
SparkML Alternating Least Squares
§ Very resource intensive!
- Is there a smarter way, given the sparse data?

Iterative
MF
§ Only load (or read from disk) product-pairs,
instead of entire matrix that contains zeros
§ Matrix factorization by iterating through
each product-pair

Iterative
MF
(numeric
labels,
step 0)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
# Predict product-pair score (multiply embeddings and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()

Iterative
MF
(numeric
labels,
step 1)
# Predict product-pair score (interaction term and sum)
# Minimize loss
loss.backward()
optimizer.step()

Iterative
MF
(numeric
labels,
step 2)
# Minimize loss
loss.backward()
optimizer.step()

Iterative
MF
(numeric
labels,
step 3)
# Minimize loss
loss.backward()
optimizer.step()

Iterative
MF
(binary
labels)
prediction = sig(sum(product1_emb * product2_emb, dim=1))
# Minimize loss
loss = BinaryCrossEntropyLoss(prediction, label)
loss.backward()
optimizer.step()

Regularize!
prediction = sig(sum(product1_emb * product2_emb, dim=1))
l2_reg = lambda * sum(embedding.weight ** 2)
# Minimize loss
loss = BinaryCrossEntropyLoss(prediction, label)
loss += l2_reg
loss.backward()
optimizer.step()

Training
Schedule
Figure 2. Cosine Annealing training schedule

Results
(MF)
Binary labels
AUC-ROC = 0.8083
Time for 5 epochs = 45 min
Figure 3a and 3b. Precision recall curves for Matrix Factorization

Results
(MF)
Binary labels
AUC-ROC = 0.8083
Continuous labels
AUC-ROC = 0.9225

Results
(MF)
”Cliff of Death”

Learning
curve
(MF)
Figure 4. AUC-ROC across epochs for matrix factorization; Each time learning rate is
reset, the model seems to ”forget”, causing AUC-ROC to revert to ~0.5.
Also, a single epoch seems sufficient

Matrix Factorization + bias
Incremental improvement on the baseline

Adding
bias
§ What if a product is generally popular or
unpopular?
§ Learn a bias factor (i.e., single number for
each product)

Results
(MF-bias)
Binary labels
AUC-ROC = 0.7951
Figure 5a and 5b. Precision recall curves for Matrix Factorization with bias

Results
(MF-bias)
Binary labels
AUC-ROC = 0.7951
Continuous labels
AUC-ROC = 0.8319

Results
(MF-bias)
More
“production
friendly”

Off the Beaten Path
Natural language processing (“NLP”) and Graphs in RecSys

Word2Vec
§ In 2013, two seminal papers by Tomas
Mikolov on Word2Vec (”w2v”)
§ Demonstrated w2v could learn semantic
and syntactic word vector representations
§ TL; DR: Converts words into numbers (array)

DeepWalk
§ Unsupervised learning of representations of
nodes (i.e., vertices) in a social network
§ Generate sequences from random walks
on (social) graph
§ Learn vector representations of nodes
(e.g., profiles, content)

How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)

More groundwork
Generating graphs and sequences

Random
Walks
§ Direct approach: Traverse networkx graph
- For 10 sequences of length 10 for a starting node,
need to traverse 100 times
- 2 mil nodes for books graph = 200 mil queries
- Very slow and memory intensive
§ Hack: Work with transition probabilities

Random
Walks
§ Direct approach: Traverse networkx graph
- For 10 sequences of length 10 for a starting node,
need to traverse 100 times
- 2 mil nodes for books graph = 200 mil queries
- Very slow and memory intensive
§ Hack: Work directly on transition probabilities

Random
Walks
(Nodes and
edges)
1
2
4
5
3
1
1
1
2
3

Random
Walks
(Weighted-
adjacency
matrix)
Product1 Product2 Product3 Product4 Product5
Product1 1 1 3
Product2 1 1
Product3 1 2
Product4 3 2
Product5 1

Random
Walks
(Transition
matrix)
Product1 .2 .2 .6
Product2 .5 .5
Product3 .33 .67
Product4 .6 .4
Product5 1.0

Random
Walks
(Transition
matrix)
Product1 .2 .2 .6
Product2 .5 .5
Product3 .33 .67
Product4 .6 .4
Product5 1.0
Transition-probability(Product3)

B001T9NUFS B003AVEU6G B005C4Y4F6 B007ZN5Y56 ... B007ZN5Y56
0000031895 B00538F5OK B004FOEEHC B001T9NUFS ... 0000031895
B005C4Y4F6 0000031909 B00CYBULSO B003AVEU6G ... B00D9C32NI
B00CYBULSO B001T9NUFS B002R0FA24 B00CYBULSO ... B007ZN5Y56
B004FOEEHC B00CYBULSO B001T9NUFS B002R0FA24 ... B00B608000
...
...
...
0000031909 B00B608000 B00D9C32NI B00CYBULSO ... B007ZN5Y56
Length of sequence (10)
No. of nodes
(420k) * samples
per node (10)

Pre-canned Node2Vec
Readily available open-sourced implementations

Node2Vec
§ Seemed to work out of the box
- Just need to provide edges
- Uses networkx and gensim under the hood
§ But very memory intensive and slow
- Could not run to completion even on 64gb ram
https://github.com/aditya-grover/node2vec

Gensim Word2Vec
Using a trusted package as baseline

Gensim
w2v
§ Very easy to use
- Takes in a list of sequences
- Can be multithreaded
- CPU-only
§ Fastest to complete 5 epochs

Results
(gensim
w2v)
All products
AUC-ROC = 0.9082
Time for 5 epochs = 2.58 min
Figure 6a and 6b. Precision recall curves for gensim.word2vec

Results
(gensim
w2v)
All products
AUC-ROC = 0.9082
Seen products only
AUC-ROC = 0.9735

Results
(gensim
w2v)
Unseen products
without embeddings

Building w2v from Scratch
To plot learning curves and extend it

Data
Loader
§ Input sequences instead of product-pairs
§ Implements two features from w2v papers
- Subsampling of frequent words
- Negative sampling

Data
Loader
(sub-
sampling)
§ Drop out words of higher frequency
- Frequency of 0.0026 = 0.0 dropout
§ Accelerated learning and improved
vectors of rare words
𝐷𝑟𝑜𝑝𝑜𝑢𝑡 𝑃𝑟𝑜𝑏 𝑤𝑜𝑟𝑑 = 1 −
𝐹𝑟𝑒𝑞 𝑤𝑜𝑟𝑑
0.001
+ 1 ×
0.001
𝐹𝑟𝑒𝑞(𝑤𝑜𝑟𝑑)

Data
Loader
(Negative
sampling)
§ Original skip-gram ends with SoftMax
- If vocab = 10k words, embedding dim = 128,
1.28 million weights to update—expensive!
- In RecSys, ”vocab” in the millions
§ Negative sampling
- Only modify weights of negative pair samples
- If 6 pairs (1 pos, 5 neg) and 1 mil products, only
update 0.0006 weights—efficient!

Data
Loader
(Negative
sampling)
§ Original skip-gram ends with SoftMax
- If vocab = 10k words, embedding dim = 128,
1.28 million weights to update—expensive!
- In RecSys, ”vocab” in the millions
§ Negative sampling
- Only modify weights of negative pair samples
- If 6 pairs (1 pos, 5 neg) and 1 mil products, only
update 0.0006% weights—very efficient!

PyTorch
Word2Vec
(step 0)
class SkipGram(nn.Module):
def __init__(self, emb_size, emb_dim):
self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
def forward(self, center, context, neg_context):
emb_center, emb_context, emb_neg_context = self.get_embeddings()
# Get score for positive pairs
score = torch.mul(emb_center, emb_context)
score = torch.sum(score, dim=1)
score = -F.logsigmoid(score)
# Get score for negative pairs
neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()
neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
# Return combined score
return torch.mean(score + neg_score)

PyTorch
Word2Vec
(step 1)
score = torch.mul(emb_center, emb_context)
score = torch.sum(score, dim=1)

PyTorch
Word2Vec
(step 2)
# Get score for positive pairs (interaction term and sum)
score = torch.sum(emb_center * emb_context, dim=1)

PyTorch
Word2Vec
(step 3)
# Get score for positive pairs (interaction term and sum)
# Get score for negative pairs (batch interaction term and sum)

PyTorch
Word2Vec
(step 4)

Results
(w2v)
Figure 7a and 7b. Precision recall curves for PyTorch Word2Vec
All products
AUC-ROC = 0.9554

Results
(w2v)
Figure 7a and 7b. Precision recall curves for PyTorch Word2Vec
All products
AUC-ROC = 0.9554
Seen products only
AUC-ROC = 0.9855

Learning
curve
(w2v)
Figure 8. AUC-ROC across epochs for word2vec; a single epoch seems sufficient

Overall
results so
far
§ Improvement on gensim.word2vec and
Alibaba paper
All products Seen products only
PyTorch MF 0.7951 -
Gensim w2v 0.9082 0.9735
PyTorch w2v 0.9554 0.9855
Alibaba Paper* 0.9327 -
* Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba (https://arxiv.org/abs/1803.02349)
Table 4. AUC-ROC across various implementations

Adding side info to w2v
To help solve the cold start problem

Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56

Extending
w2v
Television Sound bar Lamp Standing Fan
Sony Sony Phillips Dyson
500 – 600 200 – 300 50 – 75 300 - 400

Extending
w2v
§ Alibaba paper reported AUC-ROC
improvement from 0.9327 to 0.9575
Television Sound bar Lamp Standing Fan
Sony Sony Phillips Dyson
500 – 600 200 – 300 50 – 75 300 - 400

Weighting
side info
§ Two version were implemented
§ 1: Equal-weighted average of embeddings
§ 2: Learn weightage for each embedding
and applying weighted average

Learning
curve
(w2v with
side info)
Figure 9. AUC-ROC across epochs for word2vec with side information

Why
doesn’t it
work?!
§ Perhaps due to sparsity of metadata
- Of 418,749 electronics, metadata available for
162,023 (39%); Of these, brand was 51% empty
§ But I assumed the weights of the (useless)
embeddings would be learnt— ¯_(ツ)_/¯
§ An example of more data ≠ better

Why w2v > MF?
Is it skip-gram? Or sequences?

Mixing it
up to pull
it apart
§ Why does w2v perform so much better?
§ For the fun of it, lets use the MF-bias model
with sequence data (used in w2v)

Results &
learning
curve
Figure 10a and 10b. Precision recall curve and learning curve
for PyTorch MF-bias with sequences
All products
AUC-ROC = 0.9320

Further Extensions
What Airbnb, Facebook, and Uber are doing

Embed
everything
§ Building user embeddings in the same vector
space as products (Airbnb)
- Train user embeddings based on interactions with
products (e.g., click, ignore, purchase)
§ Embed all discrete features and just learn
similarities (Facebook)
§ Graph Neural Networks for embeddings;
node neighbors as representation (Uber Eats)

Key Takeaways
Last two tables, I promise

Overall
results
(electronics)
All products
Seen products
only
Runtime (min)
PyTorch MF 0.7951 - 45
Gensim w2v 0.9082 0.9735 2.58
PyTorch w2v 0.9554 0.9855 23.63
PyTorch w2v
with side info
NA NA NA
PyTorch MF with
sequences
0.9320 - 70.39
Alibaba Paper* 0.9327 - -
* Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba (https://arxiv.org/abs/1803.02349)
Table 5. AUC-ROC across various implementations (electronics)

Overall
results
(books)
All products
Seen products
only
Runtime (min)
PyTorch MF 0.4996 - 1353.12
Gensim w2v 0.9701 0.9892 16.24
PyTorch w2v 0.9775 - 122.66
PyTorch w2v
with side info
NA NA NA
PyTorch MF with
sequences
0.7196 - 1393.08
Table 6. AUC-ROC across various implementations (books)

§ Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic

Thank you!
eugene@eugeneyan.com

References
McAuley, J., Targett, C., Shi, Q., & Van Den Hengel, A. (2015, August). Image-based
recommendations on styles and substitutes. In Proceedings of the 38th International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 43-52). ACM.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems (pp. 3111-3119).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781.
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social
representations. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 701-710). ACM.
Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery
and data mining (pp. 855-864). ACM.

References
Wang, J., Huang, P., Zhao, H., Zhang, Z., Zhao, B., & Lee, D. L. (2018, July). Billion-scale
commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the
24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp.
839-848). ACM.
Grbovic, M., & Cheng, H. (2018, July). Real-time personalization using embeddings for search
ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (pp. 311-320). ACM.
Wu, L. Y., Fisch, A., Chopra, S., Adams, K., Bordes, A., & Weston, J. (2018, April). Starspace:
Embed all the things!. In Thirty-Second AAAI Conference on Artificial Intelligence.
Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations,
https://eng.uber.com/uber-eats-graph-learning/, retrieved 10 Jan 2020

Recommender Systems: Beyond the user-item matrix

More Related Content

What's hot (20)

Similar to Recommender Systems: Beyond the user-item matrix (20)

More from Eugene Yan Ziyou (20)

Recently uploaded (20)

Recommender Systems: Beyond the user-item matrix