SlideShare a Scribd company logo
Using Spark’s Machine
Learning Library to Make
Product Recommendations
Sorin Pește
Technology Solutions Professional, Data & AI
Microsoft
source: xkcd.com
(demo)
A PA C H E S PA R K
A unified, distributed, open source engine for large-scale data processing
Spark Structured
Streaming
Stream processing
Spark MLlib
Machine
Learning
Spark Core Engine
Spark SQL
Interactive
Queries
Yarn Mesos
Standalone
Scheduler
Spark MLlib
Machine
Learning
Spark
Streaming
Stream processing
GraphX
Graph
Computation
S PA R K : A B R I E F H I S T O R Y
S PA R K D ATA F R A M E S
A distributed collection of data that’s conceptually equivalent to a table
S P A R K M A C H I N E L E A R N I N G ( M L L I B )
 Offers a set of parallelized machine learning algorithms for ML
 Supports Model Selection (hyperparameter tuning) using Cross
Validation and Train-Validation Split.
 Supports Java, Scala or Python apps using DataFrame-based API
Enables Parallel, Distributed ML for large datasets on Spark Clusters
S P A R K M L L I B A L G O R I T H M S
Spark MLlib
Algorithms
S P A R K M L L I B P I P E L I N E S
C O L L A B O R A T I V E F I L T E R I N G
C O L L A B O R A T I V E F I L T E R I N G
User Latent Factors
Item Latent Factors
A L T E R N A T I N G L E A S T S Q U A R E S ( A L S )
ALS
https://github.com/neaorin/databricks-demos/
A L S : E X P L I C I T V S I M P L I C I T F E E D B A C K
 Explicit feedback — user rates items
 Implicit feedback — system records user activity
 Browses a product page
 Watches a movie trailer
 Plays a song
 Shares on social media
 etc
Implicit feedback is generally used in real-world implementations
A L S : H Y P E R P A R A M E T E R T U N I N G
 Hyperparameters which can be adjusted:
 rank = the number of latent factors in the model
 maxIter = the maximum number of iterations
 regParam = the regularization parameter
A L S : H Y P E R P A R A M E T E R T U N I N G
A L S : W H A T A B O U T R E A L - T I M E ?
 Near real-time computation of ALS algorithm may be unfeasible
 Streaming variant of ALS, using Stochastic Gradient Descent
https://github.com/brkyvz/streaming-matrix-factorization
• Oryx Framework (http://oryx.io ) also offers streaming ALS
B E Y O N D A L S
 ALS-learned latent factors can be useful as input for other algorithms
D E E P L E A R N I N G
 A set of machine learning techniques that use multiple layers of non-linear processing units to
learn useful data representations of input
D E E P L E A R N I N G W I T H S P A R K
 Integrations with existing DL libraries
• Microsoft CNTK (mmlspark)
• TensorFlow (TensorFlowOnSpark)
• DeepLearning4J
• Caffe (CaffeOnSpark)
• Keras (Elephas)
• mxnet
• Paddle
• and more…
 Implementations of DL on Spark
• BigDL
• DeepDist
• SparkCL
• SparkNet
• Deep Learning Pipelines (Databricks)
• and more…
Distributed Hyperparameter Tuning
D E E P L E A R N I N G F O R R E C O M M E N D E R S
• Neural Collaborative Filtering (He et al, 2017)
https://arxiv.org/abs/1708.05031
https://github.com/hexiangnan/neural_collaborative_filtering
Neural Collaborative Filtering
D E E P L E A R N I N G F O R R E C O M M E N D E R S
• Predict the next item the user will want to interact with
Recommendations as sequence prediction
[a] -> b
[a, b] -> c
[a, b, c] -> d
[0, 0, 0, a] -> b
[0, 0, a, b] -> c
[0, a, b, c] -> d
D E E P L E A R N I N G F O R R E C O M M E N D E R S
• Predict the next item the user will want to interact with
Recommendations as sequence prediction
D E E P L E A R N I N G F O R R E C O M M E N D E R S
 Session-based Recommendations with Recurrent Neural Networks
(Hidasi et al., 2015)
https://arxiv.org/abs/1511.06939
https://github.com/hidasib/GRU4Rec
Recommendations as sequence prediction
D E E P L E A R N I N G F O R R E C O M M E N D E R S

https://arxiv.org/pdf/1510.01784.pdf
Featurize product images
Spark for Recommender Systems

More Related Content

PDF
Journey for a data driven organization
PPTX
How to Build a Recommendation Engine on Spark
PDF
Building an Implicit Recommendation Engine with Spark with Sophie Watson
PPTX
Balancing data democratization with comprehensive information governance: bui...
PDF
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
PDF
Evoluindo a Plataforma de Dados do Nubank TDC SP 2019
PDF
Sequential Decision Making in Recommendations
PPTX
A Arte de Escrever User Stories: Quais são os segredos
Journey for a data driven organization
How to Build a Recommendation Engine on Spark
Building an Implicit Recommendation Engine with Spark with Sophie Watson
Balancing data democratization with comprehensive information governance: bui...
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Evoluindo a Plataforma de Dados do Nubank TDC SP 2019
Sequential Decision Making in Recommendations
A Arte de Escrever User Stories: Quais são os segredos

What's hot (11)

PDF
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
PPTX
Protect your Database with Data Masking & Enforced Version Control
PDF
Engagement, metrics and "recommenders"
PDF
Pinecone Vector Database.pdf
PDF
dbt Python models - GoDataFest by Guillermo Sanchez
PPTX
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
PDF
[Pcamp19] - Escalando o uso de dados no Nubank - André Tavares | Nubank
PDF
Measuring Data Quality with DataOps
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
PPTX
BI Introduction
PDF
DataOps - The Foundation for Your Agile Data Architecture
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
Protect your Database with Data Masking & Enforced Version Control
Engagement, metrics and "recommenders"
Pinecone Vector Database.pdf
dbt Python models - GoDataFest by Guillermo Sanchez
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
[Pcamp19] - Escalando o uso de dados no Nubank - André Tavares | Nubank
Measuring Data Quality with DataOps
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
BI Introduction
DataOps - The Foundation for Your Agile Data Architecture
Ad

Similar to Spark for Recommender Systems (20)

PDF
Nose Dive into Apache Spark ML
PDF
MLlib: Spark's Machine Learning Library
PDF
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
PDF
Recent Developments in Spark MLlib and Beyond
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
PDF
Silk Data - Review Lecture on Recommendation Systems
PPTX
Learning spark ch11 - Machine Learning with MLlib
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Recent Developments in Spark MLlib and Beyond
PPTX
Machine Learning With Spark
PPTX
Apache Spark MLlib
PDF
Machine Learning by Example - Apache Spark
PDF
A Recommendation Engine For Predicting Movie Ratings Using A Big Data Approach
PDF
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
PPTX
MLconf NYC Xiangrui Meng
PDF
Integrating Deep Learning Libraries with Apache Spark
PDF
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
PDF
Spark: Taming Big Data
PPTX
Deep Learning for Recommender Systems
Nose Dive into Apache Spark ML
MLlib: Spark's Machine Learning Library
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
Recent Developments in Spark MLlib and Beyond
Combining Machine Learning Frameworks with Apache Spark
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Silk Data - Review Lecture on Recommendation Systems
Learning spark ch11 - Machine Learning with MLlib
Combining Machine Learning frameworks with Apache Spark
Recent Developments in Spark MLlib and Beyond
Machine Learning With Spark
Apache Spark MLlib
Machine Learning by Example - Apache Spark
A Recommendation Engine For Predicting Movie Ratings Using A Big Data Approach
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
MLconf NYC Xiangrui Meng
Integrating Deep Learning Libraries with Apache Spark
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Spark: Taming Big Data
Deep Learning for Recommender Systems
Ad

More from Sorin Peste (8)

PPTX
Microsoft Automated ML Service
PPTX
Using Deep Learning (Computer Vision) to Search for Oil and Gas
PPTX
Introduction to Reinforcement Learning
PDF
SQL Server 2017 Machine Learning Services
PPTX
Build an Intelligent Bot (Node.js)
PPTX
Automate your UI testing for Android and iOS apps with the Xamarin Test Cloud
PPTX
Build an Intelligent Bot
PPTX
SQL Server on Linux - march 2017
Microsoft Automated ML Service
Using Deep Learning (Computer Vision) to Search for Oil and Gas
Introduction to Reinforcement Learning
SQL Server 2017 Machine Learning Services
Build an Intelligent Bot (Node.js)
Automate your UI testing for Android and iOS apps with the Xamarin Test Cloud
Build an Intelligent Bot
SQL Server on Linux - march 2017

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
A Quantitative-WPS Office.pptx research study
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Logistic Regression ml machine learning.pptx
PPTX
Global journeys: estimating international migration
PPTX
climate analysis of Dhaka ,Banglades.pptx
Foundation of Data Science unit number two notes
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Miokarditis (Inflamasi pada Otot Jantung)
Business Ppt On Nestle.pptx huunnnhhgfvu
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Fluorescence-microscope_Botany_detailed content
Major-Components-ofNKJNNKNKNKNKronment.pptx
.pdf is not working space design for the following data for the following dat...
IB Computer Science - Internal Assessment.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Supervised vs unsupervised machine learning algorithms
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
A Quantitative-WPS Office.pptx research study
Reliability_Chapter_ presentation 1221.5784
Data_Analytics_and_PowerBI_Presentation.pptx
Logistic Regression ml machine learning.pptx
Global journeys: estimating international migration
climate analysis of Dhaka ,Banglades.pptx

Spark for Recommender Systems

  • 1. Using Spark’s Machine Learning Library to Make Product Recommendations Sorin Pește Technology Solutions Professional, Data & AI Microsoft source: xkcd.com
  • 3. A PA C H E S PA R K A unified, distributed, open source engine for large-scale data processing Spark Structured Streaming Stream processing Spark MLlib Machine Learning Spark Core Engine Spark SQL Interactive Queries Yarn Mesos Standalone Scheduler Spark MLlib Machine Learning Spark Streaming Stream processing GraphX Graph Computation
  • 4. S PA R K : A B R I E F H I S T O R Y
  • 5. S PA R K D ATA F R A M E S A distributed collection of data that’s conceptually equivalent to a table
  • 6. S P A R K M A C H I N E L E A R N I N G ( M L L I B )  Offers a set of parallelized machine learning algorithms for ML  Supports Model Selection (hyperparameter tuning) using Cross Validation and Train-Validation Split.  Supports Java, Scala or Python apps using DataFrame-based API Enables Parallel, Distributed ML for large datasets on Spark Clusters
  • 7. S P A R K M L L I B A L G O R I T H M S Spark MLlib Algorithms
  • 8. S P A R K M L L I B P I P E L I N E S
  • 9. C O L L A B O R A T I V E F I L T E R I N G
  • 10. C O L L A B O R A T I V E F I L T E R I N G User Latent Factors Item Latent Factors
  • 11. A L T E R N A T I N G L E A S T S Q U A R E S ( A L S )
  • 13. A L S : E X P L I C I T V S I M P L I C I T F E E D B A C K  Explicit feedback — user rates items  Implicit feedback — system records user activity  Browses a product page  Watches a movie trailer  Plays a song  Shares on social media  etc Implicit feedback is generally used in real-world implementations
  • 14. A L S : H Y P E R P A R A M E T E R T U N I N G  Hyperparameters which can be adjusted:  rank = the number of latent factors in the model  maxIter = the maximum number of iterations  regParam = the regularization parameter
  • 15. A L S : H Y P E R P A R A M E T E R T U N I N G
  • 16. A L S : W H A T A B O U T R E A L - T I M E ?  Near real-time computation of ALS algorithm may be unfeasible  Streaming variant of ALS, using Stochastic Gradient Descent https://github.com/brkyvz/streaming-matrix-factorization • Oryx Framework (http://oryx.io ) also offers streaming ALS
  • 17. B E Y O N D A L S  ALS-learned latent factors can be useful as input for other algorithms
  • 18. D E E P L E A R N I N G  A set of machine learning techniques that use multiple layers of non-linear processing units to learn useful data representations of input
  • 19. D E E P L E A R N I N G W I T H S P A R K  Integrations with existing DL libraries • Microsoft CNTK (mmlspark) • TensorFlow (TensorFlowOnSpark) • DeepLearning4J • Caffe (CaffeOnSpark) • Keras (Elephas) • mxnet • Paddle • and more…  Implementations of DL on Spark • BigDL • DeepDist • SparkCL • SparkNet • Deep Learning Pipelines (Databricks) • and more… Distributed Hyperparameter Tuning
  • 20. D E E P L E A R N I N G F O R R E C O M M E N D E R S • Neural Collaborative Filtering (He et al, 2017) https://arxiv.org/abs/1708.05031 https://github.com/hexiangnan/neural_collaborative_filtering Neural Collaborative Filtering
  • 21. D E E P L E A R N I N G F O R R E C O M M E N D E R S • Predict the next item the user will want to interact with Recommendations as sequence prediction [a] -> b [a, b] -> c [a, b, c] -> d [0, 0, 0, a] -> b [0, 0, a, b] -> c [0, a, b, c] -> d
  • 22. D E E P L E A R N I N G F O R R E C O M M E N D E R S • Predict the next item the user will want to interact with Recommendations as sequence prediction
  • 23. D E E P L E A R N I N G F O R R E C O M M E N D E R S  Session-based Recommendations with Recurrent Neural Networks (Hidasi et al., 2015) https://arxiv.org/abs/1511.06939 https://github.com/hidasib/GRU4Rec Recommendations as sequence prediction
  • 24. D E E P L E A R N I N G F O R R E C O M M E N D E R S  https://arxiv.org/pdf/1510.01784.pdf Featurize product images