SlideShare a Scribd company logo
Fast ALS-based matrix factorization for explicit and implicit feedback datasets Istv á n Pil á szy, D ávid Zibriczky,  Domonkos Tikk Gravity R&D Ltd. www.gravityrd.com 28   September  20 10
Collaborative filtering
Problem setting 5 4 3 4 4 2 4 1
Ridge Regression
Optimal solution: Ridge Regression
Computing the optimal solution: Matrix inversion is costly:  Sum of squared errors of the optimal solution: 0.055 Ridge Regression
RR1: RR with coordinate descent Idea: optimize only one variable of    at once Start with zero: Sum of squared errors: 24.6
RR1: RR with coordinate descent Idea: optimize only one variable of    at once Start with zero, then optimize w 1 Sum of squared errors: 7.5
RR1: RR with coordinate descent Idea: optimize only one variable of    at once Start with zero, then optimize w 1  ,then optimize w 2 Sum of squared errors: 6.2
RR1: RR with coordinate descent Idea: optimize only one variable of    at once Start with zero, then optimize w 1 , then w 2 , then w 3 Sum of squared errors: 5.7
RR1: RR with coordinate descent Idea: optimize only one variable of    at once …  w 4 Sum of squared errors: 5.4
RR1: RR with coordinate descent Idea: optimize only one variable of    at once …  w 5 Sum of squared errors: 5.0
RR1: RR with coordinate descent Idea: optimize only one variable of    at once …  w 1  again Sum of squared errors: 3.4
RR1: RR with coordinate descent Idea: optimize only one variable of    at once …  w 2  again Sum of squared errors: 2.9
RR1: RR with coordinate descent Idea: optimize only one variable of    at once …  w 3  again Sum of squared errors: 2.7
RR1: RR with coordinate descent Idea: optimize only one variable of    at once …  after a while: Sum of squared errors: 0.055 No remarkable difference Cost:  n examples, e epoch
The rating matrix,  R  of  (M x N ) is approximated as the product of two lower ranked matrices,  P : user feature matrix of ( M x K ) size Q : item (movie) feature matrix of ( N x K ) size K : number of features Matrix factorization P T R T Q
Matrix Factorization  for explicit feedb. Q P 5 5 4 3 1 R 3.3 1.3 1.3 1. 4 1. 3 1 . 9 1. 7 0.7 1.0 1.3 0.8 0 0. 7 0.4 1. 7 0. 3 2.1 2.2 6.7 1.6 1. 4 2 4 3.3 1.6 1.8
Finding P and Q Q P R 0.3 0.9 0.7 1.3 0.5 0 .6 1.2 0.3 1. 6 1.1 5 5 4 3 1 2 4 ? ? Init Q randomly Find p 1
Finding  p 1  with RR Optimal solution:
Finding  p 1  with RR Q P R 0.3 0.9 0.7 1.3 0.5 0 .6 1.2 0.3 1. 6 1.1 5 5 4 3 1 2 4 2.3 3.2
Initialize Q randomly Repeat Recompute P Compute  p 1  with RR Compute  p 2  with RR …  (for each user) Recompute Q Compute  q 1  with RR …  (for each item) Alternating Least Squares (ALS)
ALS relies on RR: recomputation of vectors with RR when recomputing  p 1 , the previously computed value is ignored ALS1 relies on RR1: optimize the previously computed  p 1 , one scalar at once the previously computed value is not lost run RR1 only for one epoch ALS is just an approximation method. Likewise ALS1. ALS1: ALS with RR1
Implicit feedback Q P 1 0 R 0.5 0.1 0.2 0.7 0.3 0.1 0.1 0.7 0.3 0 0.2 0 0. 7 0.4 0.4 0. 4 1 0 0 0 0 1 1 0 0 1 0 1 1
The matrix is fully specified: each user watched each item. Zeros are less important, but still important. Many 0-s, few 1-s. Recall, that Idea (Hu, Koren, Volinsky): consider a user, who watched nothing compute    and    for this user (the null-user) when recomputing  p 1 , compare her to the null-user based on the cached    and   , update them according to the differences In this way, only the number of 1-s affect performance, not the number of 0-s IALS: alternating least squares with this trick. Implicit feedback: IALS
The RR1 trick cannot be applied here  Implicit feedback: IALS1
The RR1 trick cannot be applied here  But, wait…! Implicit feedback: IALS1
X T X is just a matrix. No matter how many items we have, its dimension is the same (KxK) If we are lucky, we can find K items which generate this matrix What, if we are unlucky?   We can still create synthetic items. Assume that the null user did not watch these K items X T X and X T y are the same, if synthetic items were created appropriately Implicit feedback: IALS1
Can we find a Z matrix such that Z is small,  KxK  and  ? We can, by eigenvalue decomposition Implicit feedback: IALS1
If a user watched N items,we can run RR1 with  N+K examples To recompute  p u , we need steps (assume 1 epoch) Is it better in practice, than the   of IALS ? Implicit feedback: IALS1
Evaluation of ALS vs. ALS1 Probe10 RMSE on Netflix Prize dataset, after 25  epochs
Evaluation of ALS vs. ALS1 Time-accuracy tradeoff
Evaluation of IALS vs. IALS1 Average Relative Position on the test subset of a proprietary implicit feedback dataset, after 20 epochs. Lower is better.
Evaluation of IALS vs. IALS1 Time – accuracy tradeoff.
Conclusions users items We learned two tricks: ALS1: RR1 can be used instead of RR in ALS IALS1: we can create few synthetic examples to replace the not-watching of many examples ALS and IALS are approximation algorithms,  so why not change them to be even more approximative ALS1 and IALS1 offer better time-accuracy tradeoffs,  esp. when  K is large. They can be even 10x faster   (or even 100x faster, for non-realistic K values) TODO: Precision, recall, other datasets.
Thank you for your attention ?

More Related Content

PPTX
Big Practical Recommendations with Alternating Least Squares
PPTX
Simple Matrix Factorization for Recommendation in Mahout
PDF
ML+Hadoop at NYC Predictive Analytics
PDF
Lecture 5 backpropagation
PDF
Lecture 4 neural networks
PDF
Dueling network architectures for deep reinforcement learning
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PDF
Continuous control with deep reinforcement learning (DDPG)
Big Practical Recommendations with Alternating Least Squares
Simple Matrix Factorization for Recommendation in Mahout
ML+Hadoop at NYC Predictive Analytics
Lecture 5 backpropagation
Lecture 4 neural networks
Dueling network architectures for deep reinforcement learning
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Continuous control with deep reinforcement learning (DDPG)

What's hot (14)

PDF
Lecture 2 fuzzy inference system
PDF
Lecture 6 radial basis-function_network
PDF
Multiclass Logistic Regression: Derivation and Apache Spark Examples
PDF
MediaEval 2015 - Emotion in Music: Task Overview
PPTX
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
PDF
Sensor Fusion Study - Ch15. The Particle Filter [Seoyeon Stella Yang]
PPTX
Av 738-Adaptive Filters - Extended Kalman Filter
PDF
Sensor Fusion Study - Ch3. Least Square Estimation [강소라, Stella, Hayden]
PPTX
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
PDF
Lecture 3
PDF
Gradient Estimation Using Stochastic Computation Graphs
PPTX
0415_seminar_DeepDPG
PDF
Applied Machine Learning For Search Engine Relevance
PDF
Lecture 2 fuzzy inference system
Lecture 6 radial basis-function_network
Multiclass Logistic Regression: Derivation and Apache Spark Examples
MediaEval 2015 - Emotion in Music: Task Overview
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Sensor Fusion Study - Ch15. The Particle Filter [Seoyeon Stella Yang]
Av 738-Adaptive Filters - Extended Kalman Filter
Sensor Fusion Study - Ch3. Least Square Estimation [강소라, Stella, Hayden]
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
Lecture 3
Gradient Estimation Using Stochastic Computation Graphs
0415_seminar_DeepDPG
Applied Machine Learning For Search Engine Relevance
Ad

Similar to Fast ALS-based matrix factorization for explicit and implicit feedback datasets (20)

PDF
ilp-nlp-slides.pdf
PPT
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
PDF
Closed-Form Solutions in Low-Rank Subspace Recovery Models and Their Implicat...
PPT
Extrapolation
PPT
Extrapolation
PPT
Extrapolation
PPT
Extrapolation
PPT
Extrapolation
PPT
PPT
Extrapolation
PPT
DIAPOSITIVA
PPT
Extrapolation
PPT
Extrapolation
PPT
Extrapolation
PPT
Extrapolation
PPT
Extrapolation
PPTX
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
PPTX
Kulum alin-11 jan2014
PDF
Parallelising Dynamic Programming
ilp-nlp-slides.pdf
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Closed-Form Solutions in Low-Rank Subspace Recovery Models and Their Implicat...
Extrapolation
Extrapolation
Extrapolation
Extrapolation
Extrapolation
Extrapolation
DIAPOSITIVA
Extrapolation
Extrapolation
Extrapolation
Extrapolation
Extrapolation
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Kulum alin-11 jan2014
Parallelising Dynamic Programming
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
20250228 LYD VKU AI Blended-Learning.pptx
A comparative analysis of optical character recognition models for extracting...
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding

Fast ALS-based matrix factorization for explicit and implicit feedback datasets

  • 1. Fast ALS-based matrix factorization for explicit and implicit feedback datasets Istv á n Pil á szy, D ávid Zibriczky, Domonkos Tikk Gravity R&D Ltd. www.gravityrd.com 28 September 20 10
  • 3. Problem setting 5 4 3 4 4 2 4 1
  • 6. Computing the optimal solution: Matrix inversion is costly: Sum of squared errors of the optimal solution: 0.055 Ridge Regression
  • 7. RR1: RR with coordinate descent Idea: optimize only one variable of at once Start with zero: Sum of squared errors: 24.6
  • 8. RR1: RR with coordinate descent Idea: optimize only one variable of at once Start with zero, then optimize w 1 Sum of squared errors: 7.5
  • 9. RR1: RR with coordinate descent Idea: optimize only one variable of at once Start with zero, then optimize w 1 ,then optimize w 2 Sum of squared errors: 6.2
  • 10. RR1: RR with coordinate descent Idea: optimize only one variable of at once Start with zero, then optimize w 1 , then w 2 , then w 3 Sum of squared errors: 5.7
  • 11. RR1: RR with coordinate descent Idea: optimize only one variable of at once … w 4 Sum of squared errors: 5.4
  • 12. RR1: RR with coordinate descent Idea: optimize only one variable of at once … w 5 Sum of squared errors: 5.0
  • 13. RR1: RR with coordinate descent Idea: optimize only one variable of at once … w 1 again Sum of squared errors: 3.4
  • 14. RR1: RR with coordinate descent Idea: optimize only one variable of at once … w 2 again Sum of squared errors: 2.9
  • 15. RR1: RR with coordinate descent Idea: optimize only one variable of at once … w 3 again Sum of squared errors: 2.7
  • 16. RR1: RR with coordinate descent Idea: optimize only one variable of at once … after a while: Sum of squared errors: 0.055 No remarkable difference Cost: n examples, e epoch
  • 17. The rating matrix, R of (M x N ) is approximated as the product of two lower ranked matrices, P : user feature matrix of ( M x K ) size Q : item (movie) feature matrix of ( N x K ) size K : number of features Matrix factorization P T R T Q
  • 18. Matrix Factorization for explicit feedb. Q P 5 5 4 3 1 R 3.3 1.3 1.3 1. 4 1. 3 1 . 9 1. 7 0.7 1.0 1.3 0.8 0 0. 7 0.4 1. 7 0. 3 2.1 2.2 6.7 1.6 1. 4 2 4 3.3 1.6 1.8
  • 19. Finding P and Q Q P R 0.3 0.9 0.7 1.3 0.5 0 .6 1.2 0.3 1. 6 1.1 5 5 4 3 1 2 4 ? ? Init Q randomly Find p 1
  • 20. Finding p 1 with RR Optimal solution:
  • 21. Finding p 1 with RR Q P R 0.3 0.9 0.7 1.3 0.5 0 .6 1.2 0.3 1. 6 1.1 5 5 4 3 1 2 4 2.3 3.2
  • 22. Initialize Q randomly Repeat Recompute P Compute p 1 with RR Compute p 2 with RR … (for each user) Recompute Q Compute q 1 with RR … (for each item) Alternating Least Squares (ALS)
  • 23. ALS relies on RR: recomputation of vectors with RR when recomputing p 1 , the previously computed value is ignored ALS1 relies on RR1: optimize the previously computed p 1 , one scalar at once the previously computed value is not lost run RR1 only for one epoch ALS is just an approximation method. Likewise ALS1. ALS1: ALS with RR1
  • 24. Implicit feedback Q P 1 0 R 0.5 0.1 0.2 0.7 0.3 0.1 0.1 0.7 0.3 0 0.2 0 0. 7 0.4 0.4 0. 4 1 0 0 0 0 1 1 0 0 1 0 1 1
  • 25. The matrix is fully specified: each user watched each item. Zeros are less important, but still important. Many 0-s, few 1-s. Recall, that Idea (Hu, Koren, Volinsky): consider a user, who watched nothing compute and for this user (the null-user) when recomputing p 1 , compare her to the null-user based on the cached and , update them according to the differences In this way, only the number of 1-s affect performance, not the number of 0-s IALS: alternating least squares with this trick. Implicit feedback: IALS
  • 26. The RR1 trick cannot be applied here  Implicit feedback: IALS1
  • 27. The RR1 trick cannot be applied here  But, wait…! Implicit feedback: IALS1
  • 28. X T X is just a matrix. No matter how many items we have, its dimension is the same (KxK) If we are lucky, we can find K items which generate this matrix What, if we are unlucky? We can still create synthetic items. Assume that the null user did not watch these K items X T X and X T y are the same, if synthetic items were created appropriately Implicit feedback: IALS1
  • 29. Can we find a Z matrix such that Z is small, KxK and ? We can, by eigenvalue decomposition Implicit feedback: IALS1
  • 30. If a user watched N items,we can run RR1 with N+K examples To recompute p u , we need steps (assume 1 epoch) Is it better in practice, than the of IALS ? Implicit feedback: IALS1
  • 31. Evaluation of ALS vs. ALS1 Probe10 RMSE on Netflix Prize dataset, after 25 epochs
  • 32. Evaluation of ALS vs. ALS1 Time-accuracy tradeoff
  • 33. Evaluation of IALS vs. IALS1 Average Relative Position on the test subset of a proprietary implicit feedback dataset, after 20 epochs. Lower is better.
  • 34. Evaluation of IALS vs. IALS1 Time – accuracy tradeoff.
  • 35. Conclusions users items We learned two tricks: ALS1: RR1 can be used instead of RR in ALS IALS1: we can create few synthetic examples to replace the not-watching of many examples ALS and IALS are approximation algorithms, so why not change them to be even more approximative ALS1 and IALS1 offer better time-accuracy tradeoffs, esp. when K is large. They can be even 10x faster (or even 100x faster, for non-realistic K values) TODO: Precision, recall, other datasets.
  • 36. Thank you for your attention ?