SlideShare a Scribd company logo
Recommender Systems 102
Beyond the (usual) user-item matrix—implementation & results
DataScience SG Meetup Jan 2020
About me
§ Lead Data Scientist @ health-tech startup
- Early detection of preventable diseases
- Healthcare resource allocation
§ Previously: VP, Data Science @ Lazada
- E-commerce ML systems
- Facilitated integration with Alibaba
§ More at https://eugeneyan.com
RecSys
Overview
Figure 1. Obligatory (cliché) recsys representation
Definition: Use
behavior data to
predict what other
users will like based on
user/item similarity
Topics*
§ Data Acquisition, Preparation, Split, etc.
§ Conventional Baseline
§ Applying Graph and NLP approaches
* Implementation and results discussed throughout
Laying the Groundwork
Data acquisition, preparation, train-val-split, etc.
Data
Acquisition
http://jmcauley.ucsd.edu/data/amazon/links.html
{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg",
"related”:
{ "also_bought":[ "B00JHONN1S",
"B002BZX8Z6",
"B00D2K1M3O",
...
"B007R2RM8W"
],
"also_viewed":[ "B002BZX8Z6",
"B00JHONN1S",
"B008F0SU0Y",
...
"B00BFXLZ8M"
],
"bought_together":[ "B002BZX8Z6"
]
},
"salesRank":
{ "Toys & Games":211836
},
"brand": "Coxlures",
"categories":[
[ "Sports & Outdoors",
"Other Sports",
"Dance"
]
]
}
{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg",
"related”:
{ "also_bought":[ "B00JHONN1S",
"B002BZX8Z6",
"B00D2K1M3O",
...
"B007R2RM8W"
],
"also_viewed":[ "B002BZX8Z6",
"B00JHONN1S",
"B008F0SU0Y",
...
"B00BFXLZ8M"
],
"bought_together":[ "B002BZX8Z6"
]
},
"salesRank":
{ "Toys & Games":211836
},
"brand": "Coxlures",
"categories":[
[ "Sports & Outdoors",
"Other Sports",
"Dance"
]
]
}
Parsing
json
§ Require parsing json to tabular form
§ Fairly large, with the largest having 142.8
million rows and 20gb on disk
§ Not able to load into ram fully on regular
laptop (16gb ram)
def parse_json_to_csv(read_path: str, write_path: str) -> None:
csv_writer = csv.writer(open(write_path, 'w'))
i = 0
for d in parse(read_path):
if i == 0:
header = d.keys()
csv_writer.writerow(header)
csv_writer.writerow(d.values().lower())
i += 1
if i % 10000 == 0:
logger.info('Rows processed: {:,}'.format(i))
logger.info('Csv saved to {}'.format(write_path))
Getting
product-
pairs
§ Evaluate string and convert to dictionary
§ Get product-pairs for each relationship
§ Explode each product-pair into a row
Getting
product-
pairs
§ Evaluate string and convert to dictionary
§ Get product-pairs for each relationship
§ Explode each product-pair into a row
product1 | product2 | relationship
--------------------------------------
B001T9NUFS | B003AVEU6G | also_viewed
0000031895 | B002R0FA24 | also_viewed
B007ZN5Y56 | B005C4Y4F6 | also_viewed
0000031909 | B00538F5OK | also_bought
B00CYBULSO | B00B608000 | also_bought
B004FOEEHC | B00D9C32NI | bought_together
Table 1. Product-pairs and relationships (sample)
Scoring
product-
pairs
§ Simple way: Assign 1.0 if product-pair has
any/multiple relationships, 0.0 otherwise
§ My approach: Score relationships differently*
- Bought together: 1.2, Also bought: 1.0, Also viewed: 0.5
Scoring
product-
pairs
§ Simple way: Assign 1.0 if product-pair has
any/multiple relationships, 0.0 otherwise
§ My approach: Score relationships differently*
- Bought together: 1.2, Also bought: 1.0, Also viewed: 0.5
product1 | product2 | weight
--------------------------------
B001T9NUFS | B003AVEU6G | 0.5
0000031895 | B002R0FA24 | 0.5
B007ZN5Y56 | B005C4Y4F6 | 0.5
0000031909 | B00538F5OK | 1.0
B00CYBULSO | B00B608000 | 1.0
B004FOEEHC | B00D9C32NI | 1.2
Table 2. Product-pairs and weights (sample)
* Assume relationships are symmetrical
Electronics Books
Unique products 418,749 1,948,370
Product-pairs 4,005,262 26,595,848
Sparsity 0.9999 0.9999
𝑆𝑝𝑎𝑟𝑠𝑖𝑡𝑦 = 1 −
𝐶𝑜𝑢𝑛𝑡(𝑛𝑜𝑛𝑧𝑒𝑟𝑜 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠)
𝐶𝑜𝑢𝑛𝑡(𝑡𝑜𝑡𝑎𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠)
Table 3. Unique products and sparsity for electronics and books
Train-Validation Split
Or how to create negative samples (at scale)
Splitting
the data
§ Random split: 2/3 train, 1/3 validation
§ Easy, right?
§ But our dataset only consists of positive
product-pairs—how do we validate?
Splitting
the data
§ Random split: 2/3 train, 1/3 validation
§ Easy, right?
§ Not so fast! Our dataset only has positive
product-pairs—how do we validate?
Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice
to sample; shuffle when exhausted—fast!
Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!
Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!
products
----------
B001T9NUFS
0000031895
B007ZN5Y56
0000031909
B00CYBULSO
B004FOEEHC
Negative product-pair 1
Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!
products
----------
B001T9NUFS
0000031895
B007ZN5Y56
0000031909
B00CYBULSO
B004FOEEHC
Negative product-pair 2
Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!
products
----------
B001T9NUFS
0000031895
B007ZN5Y56
0000031909
B00CYBULSO
B004FOEEHC
Negative product-pair 3
Matrix Factorization
Let’s start with a baseline
Batch MF
§ Common approach 1: Load matrix in
memory; apply Python package (e.g.,
scipy.svd, surprise, etc.)
§ Common approach 2: Run on cluster with
SparkML Alternating Least Squares
§ Very resource intensive!
- Is there a smarter way, given the sparse data?
Batch MF
§ Common approach 1: Load matrix in
memory; apply Python package (e.g.,
scipy.svd, surprise, etc.)
§ Common approach 2: Run on cluster with
SparkML Alternating Least Squares
§ Very resource intensive!
- Is there a smarter way, given the sparse data?
Iterative
MF
§ Only load (or read from disk) product-pairs,
instead of entire matrix that contains zeros
§ Matrix factorization by iterating through
each product-pair
Iterative
MF
(numeric
labels,
step 0)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (multiply embeddings and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()
Iterative
MF
(numeric
labels,
step 1)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()
Iterative
MF
(numeric
labels,
step 2)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()
Iterative
MF
(numeric
labels,
step 3)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()
Iterative
MF
(binary
labels)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sig(sum(product1_emb * product2_emb, dim=1))
# Minimize loss
loss = BinaryCrossEntropyLoss(prediction, label)
loss.backward()
optimizer.step()
Regularize!
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sig(sum(product1_emb * product2_emb, dim=1))
l2_reg = lambda * sum(embedding.weight ** 2)
# Minimize loss
loss = BinaryCrossEntropyLoss(prediction, label)
loss += l2_reg
loss.backward()
optimizer.step()
Training
Schedule
Figure 2. Cosine Annealing training schedule
Results
(MF)
Binary labels
AUC-ROC = 0.8083
Time for 5 epochs = 45 min
Figure 3a and 3b. Precision recall curves for Matrix Factorization
Results
(MF)
Binary labels
AUC-ROC = 0.8083
Time for 5 epochs = 45 min
Figure 3a and 3b. Precision recall curves for Matrix Factorization
Results
(MF)
Binary labels
AUC-ROC = 0.8083
Time for 5 epochs = 45 min
Continuous labels
AUC-ROC = 0.9225
Time for 5 epochs = 45 min
Figure 3a and 3b. Precision recall curves for Matrix Factorization
Results
(MF)
Figure 3a and 3b. Precision recall curves for Matrix Factorization
”Cliff of Death”
Learning
curve
(MF)
Figure 4. AUC-ROC across epochs for matrix factorization; Each time learning rate is
reset, the model seems to ”forget”, causing AUC-ROC to revert to ~0.5.
Also, a single epoch seems sufficient
Matrix Factorization + bias
Incremental improvement on the baseline
Adding
bias
§ What if a product is generally popular or
unpopular?
§ Learn a bias factor (i.e., single number for
each product)
Results
(MF-bias)
Binary labels
AUC-ROC = 0.7951
Time for 5 epochs = 45 min
Figure 5a and 5b. Precision recall curves for Matrix Factorization with bias
Results
(MF-bias)
Binary labels
AUC-ROC = 0.7951
Time for 5 epochs = 45 min
Continuous labels
AUC-ROC = 0.8319
Time for 5 epochs = 45 min
Figure 5a and 5b. Precision recall curves for Matrix Factorization with bias
Results
(MF-bias)
Figure 5a and 5b. Precision recall curves for Matrix Factorization with bias
More
“production
friendly”
Off the Beaten Path
Natural language processing (“NLP”) and Graphs in RecSys
Word2Vec
§ In 2013, two seminal papers by Tomas
Mikolov on Word2Vec (”w2v”)
§ Demonstrated w2v could learn semantic
and syntactic word vector representations
§ TL; DR: Converts words into numbers (array)
DeepWalk
§ Unsupervised learning of representations of
nodes (i.e., vertices) in a social network
§ Generate sequences from random walks
on (social) graph
§ Learn vector representations of nodes
(e.g., profiles, content)
How do
NLP and
Graphs
matter?
How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)
How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)
How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)
How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)
More groundwork
Generating graphs and sequences
Creating a
product
graph
§ We have product-pairs and weights
- These are our graph edges
§ Create a weighted graph with networkx
- Each graph edge is given a numerical weight,
instead of all edges having same weight
product1 | product2 | weight
--------------------------------
B001T9NUFS | B003AVEU6G | 0.5
0000031895 | B002R0FA24 | 0.5
B007ZN5Y56 | B005C4Y4F6 | 0.5
0000031909 | B00538F5OK | 1.0
B00CYBULSO | B00B608000 | 1.1
B004FOEEHC | B00D9C32NI | 1.2
Table 2. Product-pairs and weights
Random
Walks
§ Direct approach: Traverse networkx graph
- For 10 sequences of length 10 for a starting node,
need to traverse 100 times
- 2 mil nodes for books graph = 200 mil queries
- Very slow and memory intensive
§ Hack: Work with transition probabilities
Random
Walks
§ Direct approach: Traverse networkx graph
- For 10 sequences of length 10 for a starting node,
need to traverse 100 times
- 2 mil nodes for books graph = 200 mil queries
- Very slow and memory intensive
§ Hack: Work directly on transition probabilities
Random
Walks
(Nodes and
edges)
1
2
4
5
3
1
1
1
2
3
Random
Walks
(Weighted-
adjacency
matrix)
Product1 Product2 Product3 Product4 Product5
Product1 1 1 3
Product2 1 1
Product3 1 2
Product4 3 2
Product5 1
Random
Walks
(Transition
matrix)
Product1 Product2 Product3 Product4 Product5
Product1 .2 .2 .6
Product2 .5 .5
Product3 .33 .67
Product4 .6 .4
Product5 1.0
Random
Walks
(Transition
matrix)
Product1 Product2 Product3 Product4 Product5
Product1 .2 .2 .6
Product2 .5 .5
Product3 .33 .67
Product4 .6 .4
Product5 1.0
Transition-probability(Product3)
B001T9NUFS B003AVEU6G B005C4Y4F6 B007ZN5Y56 ... B007ZN5Y56
0000031895 B00538F5OK B004FOEEHC B001T9NUFS ... 0000031895
B005C4Y4F6 0000031909 B00CYBULSO B003AVEU6G ... B00D9C32NI
B00CYBULSO B001T9NUFS B002R0FA24 B00CYBULSO ... B007ZN5Y56
B004FOEEHC B00CYBULSO B001T9NUFS B002R0FA24 ... B00B608000
...
...
...
0000031909 B00B608000 B00D9C32NI B00CYBULSO ... B007ZN5Y56
Length of sequence (10)
No. of nodes
(420k) * samples
per node (10)
Pre-canned Node2Vec
Readily available open-sourced implementations
Node2Vec
§ Seemed to work out of the box
- Just need to provide edges
- Uses networkx and gensim under the hood
§ But very memory intensive and slow
- Could not run to completion even on 64gb ram
https://github.com/aditya-grover/node2vec
Gensim Word2Vec
Using a trusted package as baseline
Gensim
w2v
§ Very easy to use
- Takes in a list of sequences
- Can be multithreaded
- CPU-only
§ Fastest to complete 5 epochs
Results
(gensim
w2v)
All products
AUC-ROC = 0.9082
Time for 5 epochs = 2.58 min
Figure 6a and 6b. Precision recall curves for gensim.word2vec
Results
(gensim
w2v)
All products
AUC-ROC = 0.9082
Time for 5 epochs = 2.58 min
Seen products only
AUC-ROC = 0.9735
Time for 5 epochs = 2.58 min
Figure 6a and 6b. Precision recall curves for gensim.word2vec
Results
(gensim
w2v)
Figure 6a and 6b. Precision recall curves for gensim.word2vec
Unseen products
without embeddings
Building w2v from Scratch
To plot learning curves and extend it
Data
Loader
§ Input sequences instead of product-pairs
§ Implements two features from w2v papers
- Subsampling of frequent words
- Negative sampling
Data
Loader
(sub-
sampling)
§ Drop out words of higher frequency
- Frequency of 0.0026 = 0.0 dropout
- Frequency of 0.00746 = 0.5 dropout
- Frequency of 1.0 = 0.977 dropout
§ Accelerated learning and improved
vectors of rare words
𝐷𝑟𝑜𝑝𝑜𝑢𝑡 𝑃𝑟𝑜𝑏 𝑤𝑜𝑟𝑑 = 1 −
𝐹𝑟𝑒𝑞 𝑤𝑜𝑟𝑑
0.001
+ 1 ×
0.001
𝐹𝑟𝑒𝑞(𝑤𝑜𝑟𝑑)
Data
Loader
(sub-
sampling)
§ Drop out words of higher frequency
- Frequency of 0.0026 = 0.0 dropout
- Frequency of 0.00746 = 0.5 dropout
- Frequency of 1.0 = 0.977 dropout
§ Accelerated learning and improved
vectors of rare words
𝐷𝑟𝑜𝑝𝑜𝑢𝑡 𝑃𝑟𝑜𝑏 𝑤𝑜𝑟𝑑 = 1 −
𝐹𝑟𝑒𝑞 𝑤𝑜𝑟𝑑
0.001
+ 1 ×
0.001
𝐹𝑟𝑒𝑞(𝑤𝑜𝑟𝑑)
Data
Loader
(Negative
sampling)
§ Original skip-gram ends with SoftMax
- If vocab = 10k words, embedding dim = 128,
1.28 million weights to update—expensive!
- In RecSys, ”vocab” in the millions
§ Negative sampling
- Only modify weights of negative pair samples
- If 6 pairs (1 pos, 5 neg) and 1 mil products, only
update 0.0006 weights—efficient!
Data
Loader
(Negative
sampling)
§ Original skip-gram ends with SoftMax
- If vocab = 10k words, embedding dim = 128,
1.28 million weights to update—expensive!
- In RecSys, ”vocab” in the millions
§ Negative sampling
- Only modify weights of negative pair samples
- If 6 pairs (1 pos, 5 neg) and 1 mil products, only
update 0.0006% weights—very efficient!
PyTorch
Word2Vec
(step 0)
class SkipGram(nn.Module):
def __init__(self, emb_size, emb_dim):
self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
def forward(self, center, context, neg_context):
emb_center, emb_context, emb_neg_context = self.get_embeddings()
# Get score for positive pairs
score = torch.mul(emb_center, emb_context)
score = torch.sum(score, dim=1)
score = -F.logsigmoid(score)
# Get score for negative pairs
neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()
neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
# Return combined score
return torch.mean(score + neg_score)
PyTorch
Word2Vec
(step 1)
class SkipGram(nn.Module):
def __init__(self, emb_size, emb_dim):
self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
def forward(self, center, context, neg_context):
emb_center, emb_context, emb_neg_context = self.get_embeddings()
# Get score for positive pairs
score = torch.mul(emb_center, emb_context)
score = torch.sum(score, dim=1)
score = -F.logsigmoid(score)
# Get score for negative pairs
neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()
neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
# Return combined score
return torch.mean(score + neg_score)
PyTorch
Word2Vec
(step 2)
class SkipGram(nn.Module):
def __init__(self, emb_size, emb_dim):
self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
def forward(self, center, context, neg_context):
emb_center, emb_context, emb_neg_context = self.get_embeddings()
# Get score for positive pairs (interaction term and sum)
score = torch.sum(emb_center * emb_context, dim=1)
score = -F.logsigmoid(score)
# Get score for negative pairs
neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()
neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
# Return combined score
return torch.mean(score + neg_score)
PyTorch
Word2Vec
(step 3)
class SkipGram(nn.Module):
def __init__(self, emb_size, emb_dim):
self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
def forward(self, center, context, neg_context):
emb_center, emb_context, emb_neg_context = self.get_embeddings()
# Get score for positive pairs (interaction term and sum)
score = torch.sum(emb_center * emb_context, dim=1)
score = -F.logsigmoid(score)
# Get score for negative pairs (batch interaction term and sum)
neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()
neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
# Return combined score
return torch.mean(score + neg_score)
PyTorch
Word2Vec
(step 4)
class SkipGram(nn.Module):
def __init__(self, emb_size, emb_dim):
self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)
def forward(self, center, context, neg_context):
emb_center, emb_context, emb_neg_context = self.get_embeddings()
# Get score for positive pairs
score = torch.sum(emb_center * emb_context, dim=1)
score = -F.logsigmoid(score)
# Get score for negative pairs
neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()
neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)
# Return combined score
return torch.mean(score + neg_score)
Results
(w2v)
Figure 7a and 7b. Precision recall curves for PyTorch Word2Vec
All products
AUC-ROC = 0.9554
Time for 5 epochs = 23.63 min
Results
(w2v)
Figure 7a and 7b. Precision recall curves for PyTorch Word2Vec
All products
AUC-ROC = 0.9554
Time for 5 epochs = 23.63 min
Seen products only
AUC-ROC = 0.9855
Time for 5 epochs = 23.63 min
Learning
curve
(w2v)
Figure 8. AUC-ROC across epochs for word2vec; a single epoch seems sufficient
Overall
results so
far
§ Improvement on gensim.word2vec and
Alibaba paper
All products Seen products only
PyTorch MF 0.7951 -
Gensim w2v 0.9082 0.9735
PyTorch w2v 0.9554 0.9855
Alibaba Paper* 0.9327 -
* Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba (https://arxiv.org/abs/1803.02349)
Table 4. AUC-ROC across various implementations
Adding side info to w2v
To help solve the cold start problem
Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
Television Sound bar Lamp Standing Fan
Sony Sony Phillips Dyson
500 – 600 200 – 300 50 – 75 300 - 400
Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
§ Alibaba paper reported AUC-ROC
improvement from 0.9327 to 0.9575
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
Television Sound bar Lamp Standing Fan
Sony Sony Phillips Dyson
500 – 600 200 – 300 50 – 75 300 - 400
Weighting
side info
§ Two version were implemented
§ 1: Equal-weighted average of embeddings
§ 2: Learn weightage for each embedding
and applying weighted average
Learning
curve
(w2v with
side info)
Figure 9. AUC-ROC across epochs for word2vec with side information
Why
doesn’t it
work?!
§ Perhaps due to sparsity of metadata
- Of 418,749 electronics, metadata available for
162,023 (39%); Of these, brand was 51% empty
§ But I assumed the weights of the (useless)
embeddings would be learnt— ¯_(ツ)_/¯
§ An example of more data ≠ better
Why
doesn’t it
work?!
§ Perhaps due to sparsity of metadata
- Of 418,749 electronics, metadata available for
162,023 (39%); Of these, brand was 51% empty
§ But I assumed the weights of the (useless)
embeddings would be learnt— ¯_(ツ)_/¯
§ An example of more data ≠ better
Why w2v > MF?
Is it skip-gram? Or sequences?
Mixing it
up to pull
it apart
§ Why does w2v perform so much better?
§ For the fun of it, lets use the MF-bias model
with sequence data (used in w2v)
Results &
learning
curve
Figure 10a and 10b. Precision recall curve and learning curve
for PyTorch MF-bias with sequences
All products
AUC-ROC = 0.9320
Time for 5 epochs = 70.39 min
Further Extensions
What Airbnb, Facebook, and Uber are doing
Embed
everything
§ Building user embeddings in the same vector
space as products (Airbnb)
- Train user embeddings based on interactions with
products (e.g., click, ignore, purchase)
§ Embed all discrete features and just learn
similarities (Facebook)
§ Graph Neural Networks for embeddings;
node neighbors as representation (Uber Eats)
Key Takeaways
Last two tables, I promise
Overall
results
(electronics)
All products
Seen products
only
Runtime (min)
PyTorch MF 0.7951 - 45
Gensim w2v 0.9082 0.9735 2.58
PyTorch w2v 0.9554 0.9855 23.63
PyTorch w2v
with side info
NA NA NA
PyTorch MF with
sequences
0.9320 - 70.39
Alibaba Paper* 0.9327 - -
* Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba (https://arxiv.org/abs/1803.02349)
Table 5. AUC-ROC across various implementations (electronics)
Overall
results
(books)
All products
Seen products
only
Runtime (min)
PyTorch MF 0.4996 - 1353.12
Gensim w2v 0.9701 0.9892 16.24
PyTorch w2v 0.9775 - 122.66
PyTorch w2v
with side info
NA NA NA
PyTorch MF with
sequences
0.7196 - 1393.08
Table 6. AUC-ROC across various implementations (books)
§ Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic
§ Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic
§ Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic
§ Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic
Thank you!
eugene@eugeneyan.com
References
McAuley, J., Targett, C., Shi, Q., & Van Den Hengel, A. (2015, August). Image-based
recommendations on styles and substitutes. In Proceedings of the 38th International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 43-52). ACM.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems (pp. 3111-3119).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781.
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social
representations. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 701-710). ACM.
Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery
and data mining (pp. 855-864). ACM.
References
Wang, J., Huang, P., Zhao, H., Zhang, Z., Zhao, B., & Lee, D. L. (2018, July). Billion-scale
commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the
24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp.
839-848). ACM.
Grbovic, M., & Cheng, H. (2018, July). Real-time personalization using embeddings for search
ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (pp. 311-320). ACM.
Wu, L. Y., Fisch, A., Chopra, S., Adams, K., Bordes, A., & Weston, J. (2018, April). Starspace:
Embed all the things!. In Thirty-Second AAAI Conference on Artificial Intelligence.
Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations,
https://eng.uber.com/uber-eats-graph-learning/, retrieved 10 Jan 2020

More Related Content

PPTX
Trigger in mysql
PPTX
Sql operator
PDF
Database Automation with MySQL Triggers and Event Schedulers
PDF
Triggers in SQL | Edureka
PDF
Oracle User Management
PPTX
Categorical Data
PDF
Collaborative filtering
PDF
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Trigger in mysql
Sql operator
Database Automation with MySQL Triggers and Event Schedulers
Triggers in SQL | Edureka
Oracle User Management
Categorical Data
Collaborative filtering
Data Science - Part I - Sustaining Predictive Analytics Capabilities

What's hot (20)

PPT
Chapter 02 simulation examples
PPTX
K-Folds Cross Validation Method
PPTX
Decision tree
PPTX
Fuzzy Logic Ppt
PPT
Sql operators & functions 3
PPTX
Aggregate Function - Database
PDF
5. Basic Structure of SQL Queries.pdf
PPTX
DML, DCL and TCL commands in SQL database.pptx
DOCX
MySQL_SQL_Tunning_v0.1.3.docx
PPTX
1.2 sql create and drop table
PPTX
SQL JOIN
KEY
Testing Hadoop jobs with MRUnit
PPT
Aggregate functions
PDF
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZE
PDF
Multi Valued Dependency
PPTX
Optimizing MySQL Queries
PPTX
Classification in data mining
PDF
How to Analyze and Tune MySQL Queries for Better Performance
PPT
Data preprocessing
Chapter 02 simulation examples
K-Folds Cross Validation Method
Decision tree
Fuzzy Logic Ppt
Sql operators & functions 3
Aggregate Function - Database
5. Basic Structure of SQL Queries.pdf
DML, DCL and TCL commands in SQL database.pptx
MySQL_SQL_Tunning_v0.1.3.docx
1.2 sql create and drop table
SQL JOIN
Testing Hadoop jobs with MRUnit
Aggregate functions
MySQL 8.0.18 latest updates: Hash join and EXPLAIN ANALYZE
Multi Valued Dependency
Optimizing MySQL Queries
Classification in data mining
How to Analyze and Tune MySQL Queries for Better Performance
Data preprocessing
Ad

Similar to Recommender Systems: Beyond the user-item matrix (20)

DOCX
PREMIER PRODUCTS, INC.Premier Products, Inc. manufactures te.docx
PDF
Assumptions: Check yo'self before you wreck yourself
DOCX
The projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docx
PDF
IPO Framework PowerPoint Presentation Slides
PPTX
Lesson 9--production[1]
DOCX
Repositioning Assignment 1. All students are required to c.docx
PDF
Fast Distributed Online Classification
PPTX
Introduction to Personalisation - Stephen Tucker
PPTX
Recommend Products To Intsacart Customers
PPTX
BIG MART SALES PRIDICTION PROJECT.pptx
PPTX
BIG MART SALES.pptx
PPTX
Retail products - machine learning recommendation engine
PPT
LP1 as EEM Electrial Engineering Department.ppt
PDF
Failure Rate Prediction with Deep Learning
PDF
Im posting this again because the answer wasnt correct.Please .pdf
PPTX
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
PPTX
Deep recommendations in PyTorch
PPTX
APIs for catalogs
PDF
Magento 2 Automatic Related Products Extension by itoris inc
PDF
A/B testing in Firebase. Intermediate and advanced approach
PREMIER PRODUCTS, INC.Premier Products, Inc. manufactures te.docx
Assumptions: Check yo'self before you wreck yourself
The projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docx
IPO Framework PowerPoint Presentation Slides
Lesson 9--production[1]
Repositioning Assignment 1. All students are required to c.docx
Fast Distributed Online Classification
Introduction to Personalisation - Stephen Tucker
Recommend Products To Intsacart Customers
BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES.pptx
Retail products - machine learning recommendation engine
LP1 as EEM Electrial Engineering Department.ppt
Failure Rate Prediction with Deep Learning
Im posting this again because the answer wasnt correct.Please .pdf
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
Deep recommendations in PyTorch
APIs for catalogs
Magento 2 Automatic Related Products Extension by itoris inc
A/B testing in Firebase. Intermediate and advanced approach
Ad

More from Eugene Yan Ziyou (20)

PDF
System design for recommendations and search
PDF
Predicting Hospital Bills at Pre-admission
PDF
OLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
PDF
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
PDF
INSEAD Sharing on Lazada Data Science and my Journey
PDF
SMU BIA Sharing on Data Science
PDF
Culture at Lazada Data Science
PDF
Competition Improves Performance: Only when Competition Form matches Goal Ori...
PDF
How Lazada ranks products to improve customer experience and conversion
PDF
Sharing about my data science journey and what I do at Lazada
PDF
AXA x DSSG Meetup Sharing (Feb 2016)
PDF
Garuda Robotics x DataScience SG Meetup (Sep 2015)
PDF
DataKind SG sharing of our first DataDive
PDF
Social network analysis and growth recommendations for DataScience SG community
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
PDF
Nielsen x DataScience SG Meetup (Apr 2015)
PDF
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
PDF
Statistical inference: Hypothesis Testing and t-tests
PDF
Statistical inference: Probability and Distribution
PDF
A Study on the Relationship between Education and Income in the US
System design for recommendations and search
Predicting Hospital Bills at Pre-admission
OLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
INSEAD Sharing on Lazada Data Science and my Journey
SMU BIA Sharing on Data Science
Culture at Lazada Data Science
Competition Improves Performance: Only when Competition Form matches Goal Ori...
How Lazada ranks products to improve customer experience and conversion
Sharing about my data science journey and what I do at Lazada
AXA x DSSG Meetup Sharing (Feb 2016)
Garuda Robotics x DataScience SG Meetup (Sep 2015)
DataKind SG sharing of our first DataDive
Social network analysis and growth recommendations for DataScience SG community
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Nielsen x DataScience SG Meetup (Apr 2015)
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Hypothesis Testing and t-tests
Statistical inference: Probability and Distribution
A Study on the Relationship between Education and Income in the US

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Lecture1 pattern recognition............
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Computer network topology notes for revision
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Clinical guidelines as a resource for EBP(1).pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Knowledge Engineering Part 1
Business Acumen Training GuidePresentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
ISS -ESG Data flows What is ESG and HowHow
Database Infoormation System (DBIS).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Qualitative Qantitative and Mixed Methods.pptx
Business Analytics and business intelligence.pdf
Lecture1 pattern recognition............
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Computer network topology notes for revision
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Recommender Systems: Beyond the user-item matrix

  • 1. Recommender Systems 102 Beyond the (usual) user-item matrix—implementation & results DataScience SG Meetup Jan 2020
  • 2. About me § Lead Data Scientist @ health-tech startup - Early detection of preventable diseases - Healthcare resource allocation § Previously: VP, Data Science @ Lazada - E-commerce ML systems - Facilitated integration with Alibaba § More at https://eugeneyan.com
  • 3. RecSys Overview Figure 1. Obligatory (cliché) recsys representation
  • 4. Definition: Use behavior data to predict what other users will like based on user/item similarity
  • 5. Topics* § Data Acquisition, Preparation, Split, etc. § Conventional Baseline § Applying Graph and NLP approaches * Implementation and results discussed throughout
  • 6. Laying the Groundwork Data acquisition, preparation, train-val-split, etc.
  • 8. { "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related”: { "also_bought":[ "B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", ... "B007R2RM8W" ], "also_viewed":[ "B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", ... "B00BFXLZ8M" ], "bought_together":[ "B002BZX8Z6" ] }, "salesRank": { "Toys & Games":211836 }, "brand": "Coxlures", "categories":[ [ "Sports & Outdoors", "Other Sports", "Dance" ] ] }
  • 9. { "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related”: { "also_bought":[ "B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", ... "B007R2RM8W" ], "also_viewed":[ "B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", ... "B00BFXLZ8M" ], "bought_together":[ "B002BZX8Z6" ] }, "salesRank": { "Toys & Games":211836 }, "brand": "Coxlures", "categories":[ [ "Sports & Outdoors", "Other Sports", "Dance" ] ] }
  • 10. Parsing json § Require parsing json to tabular form § Fairly large, with the largest having 142.8 million rows and 20gb on disk § Not able to load into ram fully on regular laptop (16gb ram)
  • 11. def parse_json_to_csv(read_path: str, write_path: str) -> None: csv_writer = csv.writer(open(write_path, 'w')) i = 0 for d in parse(read_path): if i == 0: header = d.keys() csv_writer.writerow(header) csv_writer.writerow(d.values().lower()) i += 1 if i % 10000 == 0: logger.info('Rows processed: {:,}'.format(i)) logger.info('Csv saved to {}'.format(write_path))
  • 12. Getting product- pairs § Evaluate string and convert to dictionary § Get product-pairs for each relationship § Explode each product-pair into a row
  • 13. Getting product- pairs § Evaluate string and convert to dictionary § Get product-pairs for each relationship § Explode each product-pair into a row product1 | product2 | relationship -------------------------------------- B001T9NUFS | B003AVEU6G | also_viewed 0000031895 | B002R0FA24 | also_viewed B007ZN5Y56 | B005C4Y4F6 | also_viewed 0000031909 | B00538F5OK | also_bought B00CYBULSO | B00B608000 | also_bought B004FOEEHC | B00D9C32NI | bought_together Table 1. Product-pairs and relationships (sample)
  • 14. Scoring product- pairs § Simple way: Assign 1.0 if product-pair has any/multiple relationships, 0.0 otherwise § My approach: Score relationships differently* - Bought together: 1.2, Also bought: 1.0, Also viewed: 0.5
  • 15. Scoring product- pairs § Simple way: Assign 1.0 if product-pair has any/multiple relationships, 0.0 otherwise § My approach: Score relationships differently* - Bought together: 1.2, Also bought: 1.0, Also viewed: 0.5 product1 | product2 | weight -------------------------------- B001T9NUFS | B003AVEU6G | 0.5 0000031895 | B002R0FA24 | 0.5 B007ZN5Y56 | B005C4Y4F6 | 0.5 0000031909 | B00538F5OK | 1.0 B00CYBULSO | B00B608000 | 1.0 B004FOEEHC | B00D9C32NI | 1.2 Table 2. Product-pairs and weights (sample) * Assume relationships are symmetrical
  • 16. Electronics Books Unique products 418,749 1,948,370 Product-pairs 4,005,262 26,595,848 Sparsity 0.9999 0.9999 𝑆𝑝𝑎𝑟𝑠𝑖𝑡𝑦 = 1 − 𝐶𝑜𝑢𝑛𝑡(𝑛𝑜𝑛𝑧𝑒𝑟𝑜 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠) 𝐶𝑜𝑢𝑛𝑡(𝑡𝑜𝑡𝑎𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠) Table 3. Unique products and sparsity for electronics and books
  • 17. Train-Validation Split Or how to create negative samples (at scale)
  • 18. Splitting the data § Random split: 2/3 train, 1/3 validation § Easy, right? § But our dataset only consists of positive product-pairs—how do we validate?
  • 19. Splitting the data § Random split: 2/3 train, 1/3 validation § Easy, right? § Not so fast! Our dataset only has positive product-pairs—how do we validate?
  • 20. Creating negative samples § Direct approach: Random sampling - To create 1 million negative product-pairs, call random 2 million times—very slow! § Hack: Add products in array, shuffle, slice to sample; shuffle when exhausted—fast!
  • 21. Creating negative samples § Direct approach: Random sampling - To create 1 million negative product-pairs, call random 2 million times—very slow! § Hack: Add products in array, shuffle, slice to sample; re-shuffle when exhausted—fast!
  • 22. Creating negative samples § Direct approach: Random sampling - To create 1 million negative product-pairs, call random 2 million times—very slow! § Hack: Add products in array, shuffle, slice to sample; re-shuffle when exhausted—fast! products ---------- B001T9NUFS 0000031895 B007ZN5Y56 0000031909 B00CYBULSO B004FOEEHC Negative product-pair 1
  • 23. Creating negative samples § Direct approach: Random sampling - To create 1 million negative product-pairs, call random 2 million times—very slow! § Hack: Add products in array, shuffle, slice to sample; re-shuffle when exhausted—fast! products ---------- B001T9NUFS 0000031895 B007ZN5Y56 0000031909 B00CYBULSO B004FOEEHC Negative product-pair 2
  • 24. Creating negative samples § Direct approach: Random sampling - To create 1 million negative product-pairs, call random 2 million times—very slow! § Hack: Add products in array, shuffle, slice to sample; re-shuffle when exhausted—fast! products ---------- B001T9NUFS 0000031895 B007ZN5Y56 0000031909 B00CYBULSO B004FOEEHC Negative product-pair 3
  • 26. Batch MF § Common approach 1: Load matrix in memory; apply Python package (e.g., scipy.svd, surprise, etc.) § Common approach 2: Run on cluster with SparkML Alternating Least Squares § Very resource intensive! - Is there a smarter way, given the sparse data?
  • 27. Batch MF § Common approach 1: Load matrix in memory; apply Python package (e.g., scipy.svd, surprise, etc.) § Common approach 2: Run on cluster with SparkML Alternating Least Squares § Very resource intensive! - Is there a smarter way, given the sparse data?
  • 28. Iterative MF § Only load (or read from disk) product-pairs, instead of entire matrix that contains zeros § Matrix factorization by iterating through each product-pair
  • 29. Iterative MF (numeric labels, step 0) for product_pair, label in train_set: # Get embedding for each product product1_emb = embedding(product1) product2_emb = embedding(product2) # Predict product-pair score (multiply embeddings and sum) prediction = sum(product1_emb * product2_emb, dim=1) # Minimize loss loss = MeanSquaredErrorLoss(prediction, label) loss.backward() optimizer.step()
  • 30. Iterative MF (numeric labels, step 1) for product_pair, label in train_set: # Get embedding for each product product1_emb = embedding(product1) product2_emb = embedding(product2) # Predict product-pair score (interaction term and sum) prediction = sum(product1_emb * product2_emb, dim=1) # Minimize loss loss = MeanSquaredErrorLoss(prediction, label) loss.backward() optimizer.step()
  • 31. Iterative MF (numeric labels, step 2) for product_pair, label in train_set: # Get embedding for each product product1_emb = embedding(product1) product2_emb = embedding(product2) # Predict product-pair score (interaction term and sum) prediction = sum(product1_emb * product2_emb, dim=1) # Minimize loss loss = MeanSquaredErrorLoss(prediction, label) loss.backward() optimizer.step()
  • 32. Iterative MF (numeric labels, step 3) for product_pair, label in train_set: # Get embedding for each product product1_emb = embedding(product1) product2_emb = embedding(product2) # Predict product-pair score (interaction term and sum) prediction = sum(product1_emb * product2_emb, dim=1) # Minimize loss loss = MeanSquaredErrorLoss(prediction, label) loss.backward() optimizer.step()
  • 33. Iterative MF (binary labels) for product_pair, label in train_set: # Get embedding for each product product1_emb = embedding(product1) product2_emb = embedding(product2) # Predict product-pair score (interaction term and sum) prediction = sig(sum(product1_emb * product2_emb, dim=1)) # Minimize loss loss = BinaryCrossEntropyLoss(prediction, label) loss.backward() optimizer.step()
  • 34. Regularize! for product_pair, label in train_set: # Get embedding for each product product1_emb = embedding(product1) product2_emb = embedding(product2) # Predict product-pair score (interaction term and sum) prediction = sig(sum(product1_emb * product2_emb, dim=1)) l2_reg = lambda * sum(embedding.weight ** 2) # Minimize loss loss = BinaryCrossEntropyLoss(prediction, label) loss += l2_reg loss.backward() optimizer.step()
  • 35. Training Schedule Figure 2. Cosine Annealing training schedule
  • 36. Results (MF) Binary labels AUC-ROC = 0.8083 Time for 5 epochs = 45 min Figure 3a and 3b. Precision recall curves for Matrix Factorization
  • 37. Results (MF) Binary labels AUC-ROC = 0.8083 Time for 5 epochs = 45 min Figure 3a and 3b. Precision recall curves for Matrix Factorization
  • 38. Results (MF) Binary labels AUC-ROC = 0.8083 Time for 5 epochs = 45 min Continuous labels AUC-ROC = 0.9225 Time for 5 epochs = 45 min Figure 3a and 3b. Precision recall curves for Matrix Factorization
  • 39. Results (MF) Figure 3a and 3b. Precision recall curves for Matrix Factorization ”Cliff of Death”
  • 40. Learning curve (MF) Figure 4. AUC-ROC across epochs for matrix factorization; Each time learning rate is reset, the model seems to ”forget”, causing AUC-ROC to revert to ~0.5. Also, a single epoch seems sufficient
  • 41. Matrix Factorization + bias Incremental improvement on the baseline
  • 42. Adding bias § What if a product is generally popular or unpopular? § Learn a bias factor (i.e., single number for each product)
  • 43. Results (MF-bias) Binary labels AUC-ROC = 0.7951 Time for 5 epochs = 45 min Figure 5a and 5b. Precision recall curves for Matrix Factorization with bias
  • 44. Results (MF-bias) Binary labels AUC-ROC = 0.7951 Time for 5 epochs = 45 min Continuous labels AUC-ROC = 0.8319 Time for 5 epochs = 45 min Figure 5a and 5b. Precision recall curves for Matrix Factorization with bias
  • 45. Results (MF-bias) Figure 5a and 5b. Precision recall curves for Matrix Factorization with bias More “production friendly”
  • 46. Off the Beaten Path Natural language processing (“NLP”) and Graphs in RecSys
  • 47. Word2Vec § In 2013, two seminal papers by Tomas Mikolov on Word2Vec (”w2v”) § Demonstrated w2v could learn semantic and syntactic word vector representations § TL; DR: Converts words into numbers (array)
  • 48. DeepWalk § Unsupervised learning of representations of nodes (i.e., vertices) in a social network § Generate sequences from random walks on (social) graph § Learn vector representations of nodes (e.g., profiles, content)
  • 50. How do NLP and Graphs matter? § Create graph from product-pairs + weights § Generate sequences from graph (via random walk) § Learn product embeddings (via word2vec) § Recommend based on embedding similarity (e.g., cosine similarity, dot product)
  • 51. How do NLP and Graphs matter? § Create graph from product-pairs + weights § Generate sequences from graph (via random walk) § Learn product embeddings (via word2vec) § Recommend based on embedding similarity (e.g., cosine similarity, dot product)
  • 52. How do NLP and Graphs matter? § Create graph from product-pairs + weights § Generate sequences from graph (via random walk) § Learn product embeddings (via word2vec) § Recommend based on embedding similarity (e.g., cosine similarity, dot product)
  • 53. How do NLP and Graphs matter? § Create graph from product-pairs + weights § Generate sequences from graph (via random walk) § Learn product embeddings (via word2vec) § Recommend based on embedding similarity (e.g., cosine similarity, dot product)
  • 55. Creating a product graph § We have product-pairs and weights - These are our graph edges § Create a weighted graph with networkx - Each graph edge is given a numerical weight, instead of all edges having same weight product1 | product2 | weight -------------------------------- B001T9NUFS | B003AVEU6G | 0.5 0000031895 | B002R0FA24 | 0.5 B007ZN5Y56 | B005C4Y4F6 | 0.5 0000031909 | B00538F5OK | 1.0 B00CYBULSO | B00B608000 | 1.1 B004FOEEHC | B00D9C32NI | 1.2 Table 2. Product-pairs and weights
  • 56. Random Walks § Direct approach: Traverse networkx graph - For 10 sequences of length 10 for a starting node, need to traverse 100 times - 2 mil nodes for books graph = 200 mil queries - Very slow and memory intensive § Hack: Work with transition probabilities
  • 57. Random Walks § Direct approach: Traverse networkx graph - For 10 sequences of length 10 for a starting node, need to traverse 100 times - 2 mil nodes for books graph = 200 mil queries - Very slow and memory intensive § Hack: Work directly on transition probabilities
  • 59. Random Walks (Weighted- adjacency matrix) Product1 Product2 Product3 Product4 Product5 Product1 1 1 3 Product2 1 1 Product3 1 2 Product4 3 2 Product5 1
  • 60. Random Walks (Transition matrix) Product1 Product2 Product3 Product4 Product5 Product1 .2 .2 .6 Product2 .5 .5 Product3 .33 .67 Product4 .6 .4 Product5 1.0
  • 61. Random Walks (Transition matrix) Product1 Product2 Product3 Product4 Product5 Product1 .2 .2 .6 Product2 .5 .5 Product3 .33 .67 Product4 .6 .4 Product5 1.0 Transition-probability(Product3)
  • 62. B001T9NUFS B003AVEU6G B005C4Y4F6 B007ZN5Y56 ... B007ZN5Y56 0000031895 B00538F5OK B004FOEEHC B001T9NUFS ... 0000031895 B005C4Y4F6 0000031909 B00CYBULSO B003AVEU6G ... B00D9C32NI B00CYBULSO B001T9NUFS B002R0FA24 B00CYBULSO ... B007ZN5Y56 B004FOEEHC B00CYBULSO B001T9NUFS B002R0FA24 ... B00B608000 ... ... ... 0000031909 B00B608000 B00D9C32NI B00CYBULSO ... B007ZN5Y56 Length of sequence (10) No. of nodes (420k) * samples per node (10)
  • 63. Pre-canned Node2Vec Readily available open-sourced implementations
  • 64. Node2Vec § Seemed to work out of the box - Just need to provide edges - Uses networkx and gensim under the hood § But very memory intensive and slow - Could not run to completion even on 64gb ram https://github.com/aditya-grover/node2vec
  • 65. Gensim Word2Vec Using a trusted package as baseline
  • 66. Gensim w2v § Very easy to use - Takes in a list of sequences - Can be multithreaded - CPU-only § Fastest to complete 5 epochs
  • 67. Results (gensim w2v) All products AUC-ROC = 0.9082 Time for 5 epochs = 2.58 min Figure 6a and 6b. Precision recall curves for gensim.word2vec
  • 68. Results (gensim w2v) All products AUC-ROC = 0.9082 Time for 5 epochs = 2.58 min Seen products only AUC-ROC = 0.9735 Time for 5 epochs = 2.58 min Figure 6a and 6b. Precision recall curves for gensim.word2vec
  • 69. Results (gensim w2v) Figure 6a and 6b. Precision recall curves for gensim.word2vec Unseen products without embeddings
  • 70. Building w2v from Scratch To plot learning curves and extend it
  • 71. Data Loader § Input sequences instead of product-pairs § Implements two features from w2v papers - Subsampling of frequent words - Negative sampling
  • 72. Data Loader (sub- sampling) § Drop out words of higher frequency - Frequency of 0.0026 = 0.0 dropout - Frequency of 0.00746 = 0.5 dropout - Frequency of 1.0 = 0.977 dropout § Accelerated learning and improved vectors of rare words 𝐷𝑟𝑜𝑝𝑜𝑢𝑡 𝑃𝑟𝑜𝑏 𝑤𝑜𝑟𝑑 = 1 − 𝐹𝑟𝑒𝑞 𝑤𝑜𝑟𝑑 0.001 + 1 × 0.001 𝐹𝑟𝑒𝑞(𝑤𝑜𝑟𝑑)
  • 73. Data Loader (sub- sampling) § Drop out words of higher frequency - Frequency of 0.0026 = 0.0 dropout - Frequency of 0.00746 = 0.5 dropout - Frequency of 1.0 = 0.977 dropout § Accelerated learning and improved vectors of rare words 𝐷𝑟𝑜𝑝𝑜𝑢𝑡 𝑃𝑟𝑜𝑏 𝑤𝑜𝑟𝑑 = 1 − 𝐹𝑟𝑒𝑞 𝑤𝑜𝑟𝑑 0.001 + 1 × 0.001 𝐹𝑟𝑒𝑞(𝑤𝑜𝑟𝑑)
  • 74. Data Loader (Negative sampling) § Original skip-gram ends with SoftMax - If vocab = 10k words, embedding dim = 128, 1.28 million weights to update—expensive! - In RecSys, ”vocab” in the millions § Negative sampling - Only modify weights of negative pair samples - If 6 pairs (1 pos, 5 neg) and 1 mil products, only update 0.0006 weights—efficient!
  • 75. Data Loader (Negative sampling) § Original skip-gram ends with SoftMax - If vocab = 10k words, embedding dim = 128, 1.28 million weights to update—expensive! - In RecSys, ”vocab” in the millions § Negative sampling - Only modify weights of negative pair samples - If 6 pairs (1 pos, 5 neg) and 1 mil products, only update 0.0006% weights—very efficient!
  • 76. PyTorch Word2Vec (step 0) class SkipGram(nn.Module): def __init__(self, emb_size, emb_dim): self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) def forward(self, center, context, neg_context): emb_center, emb_context, emb_neg_context = self.get_embeddings() # Get score for positive pairs score = torch.mul(emb_center, emb_context) score = torch.sum(score, dim=1) score = -F.logsigmoid(score) # Get score for negative pairs neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze() neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1) # Return combined score return torch.mean(score + neg_score)
  • 77. PyTorch Word2Vec (step 1) class SkipGram(nn.Module): def __init__(self, emb_size, emb_dim): self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) def forward(self, center, context, neg_context): emb_center, emb_context, emb_neg_context = self.get_embeddings() # Get score for positive pairs score = torch.mul(emb_center, emb_context) score = torch.sum(score, dim=1) score = -F.logsigmoid(score) # Get score for negative pairs neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze() neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1) # Return combined score return torch.mean(score + neg_score)
  • 78. PyTorch Word2Vec (step 2) class SkipGram(nn.Module): def __init__(self, emb_size, emb_dim): self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) def forward(self, center, context, neg_context): emb_center, emb_context, emb_neg_context = self.get_embeddings() # Get score for positive pairs (interaction term and sum) score = torch.sum(emb_center * emb_context, dim=1) score = -F.logsigmoid(score) # Get score for negative pairs neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze() neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1) # Return combined score return torch.mean(score + neg_score)
  • 79. PyTorch Word2Vec (step 3) class SkipGram(nn.Module): def __init__(self, emb_size, emb_dim): self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) def forward(self, center, context, neg_context): emb_center, emb_context, emb_neg_context = self.get_embeddings() # Get score for positive pairs (interaction term and sum) score = torch.sum(emb_center * emb_context, dim=1) score = -F.logsigmoid(score) # Get score for negative pairs (batch interaction term and sum) neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze() neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1) # Return combined score return torch.mean(score + neg_score)
  • 80. PyTorch Word2Vec (step 4) class SkipGram(nn.Module): def __init__(self, emb_size, emb_dim): self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) def forward(self, center, context, neg_context): emb_center, emb_context, emb_neg_context = self.get_embeddings() # Get score for positive pairs score = torch.sum(emb_center * emb_context, dim=1) score = -F.logsigmoid(score) # Get score for negative pairs neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze() neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1) # Return combined score return torch.mean(score + neg_score)
  • 81. Results (w2v) Figure 7a and 7b. Precision recall curves for PyTorch Word2Vec All products AUC-ROC = 0.9554 Time for 5 epochs = 23.63 min
  • 82. Results (w2v) Figure 7a and 7b. Precision recall curves for PyTorch Word2Vec All products AUC-ROC = 0.9554 Time for 5 epochs = 23.63 min Seen products only AUC-ROC = 0.9855 Time for 5 epochs = 23.63 min
  • 83. Learning curve (w2v) Figure 8. AUC-ROC across epochs for word2vec; a single epoch seems sufficient
  • 84. Overall results so far § Improvement on gensim.word2vec and Alibaba paper All products Seen products only PyTorch MF 0.7951 - Gensim w2v 0.9082 0.9735 PyTorch w2v 0.9554 0.9855 Alibaba Paper* 0.9327 - * Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba (https://arxiv.org/abs/1803.02349) Table 4. AUC-ROC across various implementations
  • 85. Adding side info to w2v To help solve the cold start problem
  • 86. Extending w2v § For each product, we have information like category, brand, price group, etc. - Why not add this when learning embeddings? B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
  • 87. Extending w2v § For each product, we have information like category, brand, price group, etc. - Why not add this when learning embeddings? B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
  • 88. Extending w2v § For each product, we have information like category, brand, price group, etc. - Why not add this when learning embeddings? B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56 Television Sound bar Lamp Standing Fan Sony Sony Phillips Dyson 500 – 600 200 – 300 50 – 75 300 - 400
  • 89. Extending w2v § For each product, we have information like category, brand, price group, etc. - Why not add this when learning embeddings? § Alibaba paper reported AUC-ROC improvement from 0.9327 to 0.9575 B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56 Television Sound bar Lamp Standing Fan Sony Sony Phillips Dyson 500 – 600 200 – 300 50 – 75 300 - 400
  • 90. Weighting side info § Two version were implemented § 1: Equal-weighted average of embeddings § 2: Learn weightage for each embedding and applying weighted average
  • 91. Learning curve (w2v with side info) Figure 9. AUC-ROC across epochs for word2vec with side information
  • 92. Why doesn’t it work?! § Perhaps due to sparsity of metadata - Of 418,749 electronics, metadata available for 162,023 (39%); Of these, brand was 51% empty § But I assumed the weights of the (useless) embeddings would be learnt— ¯_(ツ)_/¯ § An example of more data ≠ better
  • 93. Why doesn’t it work?! § Perhaps due to sparsity of metadata - Of 418,749 electronics, metadata available for 162,023 (39%); Of these, brand was 51% empty § But I assumed the weights of the (useless) embeddings would be learnt— ¯_(ツ)_/¯ § An example of more data ≠ better
  • 94. Why w2v > MF? Is it skip-gram? Or sequences?
  • 95. Mixing it up to pull it apart § Why does w2v perform so much better? § For the fun of it, lets use the MF-bias model with sequence data (used in w2v)
  • 96. Results & learning curve Figure 10a and 10b. Precision recall curve and learning curve for PyTorch MF-bias with sequences All products AUC-ROC = 0.9320 Time for 5 epochs = 70.39 min
  • 97. Further Extensions What Airbnb, Facebook, and Uber are doing
  • 98. Embed everything § Building user embeddings in the same vector space as products (Airbnb) - Train user embeddings based on interactions with products (e.g., click, ignore, purchase) § Embed all discrete features and just learn similarities (Facebook) § Graph Neural Networks for embeddings; node neighbors as representation (Uber Eats)
  • 99. Key Takeaways Last two tables, I promise
  • 100. Overall results (electronics) All products Seen products only Runtime (min) PyTorch MF 0.7951 - 45 Gensim w2v 0.9082 0.9735 2.58 PyTorch w2v 0.9554 0.9855 23.63 PyTorch w2v with side info NA NA NA PyTorch MF with sequences 0.9320 - 70.39 Alibaba Paper* 0.9327 - - * Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba (https://arxiv.org/abs/1803.02349) Table 5. AUC-ROC across various implementations (electronics)
  • 101. Overall results (books) All products Seen products only Runtime (min) PyTorch MF 0.4996 - 1353.12 Gensim w2v 0.9701 0.9892 16.24 PyTorch w2v 0.9775 - 122.66 PyTorch w2v with side info NA NA NA PyTorch MF with sequences 0.7196 - 1393.08 Table 6. AUC-ROC across various implementations (books)
  • 102. § Don’t just look at numeric metrics—plot some curves! - Especially if you need some arbitrary threshold (i.e., classification) § Matrix Factorization is an okay-ish baseline § Word2vec is a great baseline § Training on sequences is epic
  • 103. § Don’t just look at numeric metrics—plot some curves! - Especially if you need some arbitrary threshold (i.e., classification) § Matrix Factorization is an okay-ish baseline § Word2vec is a great baseline § Training on sequences is epic
  • 104. § Don’t just look at numeric metrics—plot some curves! - Especially if you need some arbitrary threshold (i.e., classification) § Matrix Factorization is an okay-ish baseline § Word2vec is a great baseline § Training on sequences is epic
  • 105. § Don’t just look at numeric metrics—plot some curves! - Especially if you need some arbitrary threshold (i.e., classification) § Matrix Factorization is an okay-ish baseline § Word2vec is a great baseline § Training on sequences is epic
  • 107. References McAuley, J., Targett, C., Shi, Q., & Van Den Hengel, A. (2015, August). Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 43-52). ACM. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM. Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). ACM.
  • 108. References Wang, J., Huang, P., Zhao, H., Zhang, Z., Zhao, B., & Lee, D. L. (2018, July). Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 839-848). ACM. Grbovic, M., & Cheng, H. (2018, July). Real-time personalization using embeddings for search ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 311-320). ACM. Wu, L. Y., Fisch, A., Chopra, S., Adams, K., Bordes, A., & Weston, J. (2018, April). Starspace: Embed all the things!. In Thirty-Second AAAI Conference on Artificial Intelligence. Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations, https://eng.uber.com/uber-eats-graph-learning/, retrieved 10 Jan 2020