SlideShare a Scribd company logo
Spark Technology Center
Oct /
27 /
16
Creating an end-to-end
Recommender System
with Apache Spark
and Elasticsearch
Jean-François Puget
Nick Pentreath
Spark Technology Center
§ @JFPuget
§ Distinguished Engineer, IBM Machine
Learning & Optimization
§ @MLnick
§ Principal Engineer, IBM Spark
Technology Center
§ Apache Spark PMC
About
Spark Technology Center
§ Recommender systems & the machine
learning workflow
§ Data modelling for recommender
systems
§ Why Spark & Elasticsearch?
§ Spark ML for collaborative filtering
§ Deploying & scoring recommender
models
§ Demo
Agenda
Spark Technology Center
Recommender
Systems & the ML
Workflow
Spark Technology Center
Recommender
Systems
Overview
Spark Technology Center
The Machine
Learning
Workflow
Perception
Data ???
Machine
Learning
??? $$$
Spark Technology Center
The Machine
Learning
Workflow
Reality
Data
• Historical
• Streaming
Ingest
Data
Processing
• Feature
transformation &
engineering
Model
Training
• Model selection &
evaluation
Deploy
• Pipelines, not just
models
• Versioning
Live System
• Predict given new
data
• Monitoring & live
evaluation
Feedback Loop
Spark DataFrames
Spark ML
Various ???
Stream (Kafka)
Missing
piece!
Spark Technology Center
The Machine
Learning
Workflow
Recommender Version
Data Ingest
Data
Processing
• Aggregation
• Handle implicit
data
Model
Training
• ALS
• Ranking-style
evaluation
Deploy
• Model size &
complexity
Live System
•User & item
recommendations
•Monitoring, filters
Feedback => another Event Type
Spark DataFrames
Spark ML
Elasticsearch
• User & Item
Metadata
• Events
Elasticsearch
Stream (Kafka)
Spark Technology Center
Data Modeling for
Recommender
Systems
Spark Technology Center
Data modelUser and Item
Metadata
! !
Spark Technology Center
System RequirementsUser and Item
Metadata
! !
Filtering &
Grouping
Business
Rules
Spark Technology Center
User interactions
Implicit preference data
• Page view
• eCommerce - cart, purchase
• Media – preview, watch, listen
Intent data
• Search query
Anatomy of a
User Event
Explicit preference data
• Rating
• Review
Social network interactions
• Like
• Share
• Follow
User Interactions
!
!
!
!
!
!
!
!
Spark Technology Center
Data modelAnatomy of a
User Event
!
!
! !! !
!
Spark Technology Center
How to handle implicit feedback?Anatomy of a
User Event
!
!
! !! !
!
!
Spark Technology Center
Why Spark &
Elasticsearch?
Spark Technology Center
DataFrames
§ Events & metadata are “lightly
structured” data
§ Suited to DataFrames
§ Pluggable external data source support
Spark ML
§ Spark ML pipelines
§ Scalable ALS algorithm, supporting
implicit feedback & NMF
§ Cross-validation
§ Custom transformers & algorithms
Why Spark?
Spark Technology Center
Storage
§ Native JSON
§ Scalable
§ Good support for time-series / event data
§ Kibana for data visualisation
§ Integration with Spark DataFrames
Scoring
§ Full-text search
§ Filtering
§ Aggregations (grouping)
§ Search ~== recommendation (more
later)
Why
Elasticsearch?
Spark Technology Center
Spark ML for
Collaborative
Filtering
Spark Technology Center
Matrix FactorizationCollaborative
Filtering
3 4
1
5 2
1 3
2 1
!
!
−1.1 3.2 4.3
0.2 1.4 3.1
2.5 0.3 2.3
4.3 −2.4 0.5
3.6 0.3 1.2
0.2 1.7 2.3
1.9 0.4 0.8
1.5 −1.2 0.3
−0.4 2.1 0.6
2.7 0.8 1.4
! !
Spark Technology Center
PredictionCollaborative
Filtering
3 4
1
5 2
1 3
2 1
!
!
−1.1 3.2 4.3
0.2 1.4 3.1
2.5 0.3 2.3
4.3 −2.4 0.5
3.6 0.3 1.2
0.2 1.7 2.3
1.9 0.4 0.8
1.5 −1.2 0.3
−0.4 2.1 0.6
2.7 0.8 1.4
! !
Spark Technology Center
Loading DataAlternating Least
Squares
Spark Technology Center
Implicit Preference DataAlternating Least
Squares
Spark Technology Center
Deploying &
Scoring
Recommendation
Models
Spark Technology Center
Full-text Search & SimilarityPrelude: Search
“cat videos”
!
!cat videos
0 0 ⋯ 0 1 ⋯
0 1 ⋯ 1 1 ⋯
1 1 ⋯ 0 0 ⋯
1 0 ⋯ 0 1 ⋯
Sort
results
0 1 ⋯ 1 0 ⋯
Scoring RankingAnalysis Term vectors
Similarity
Spark Technology Center
Can we use the same machinery?Recommendation
!
0 0 ⋯ 0 1 ⋯
0 1 ⋯ 1 1 ⋯
1 1 ⋯ 0 0 ⋯
1 0 ⋯ 0 1 ⋯
Sort
results
1.2 ⋯ −0.2 0.3
Dot product & cosine similarity
… the same as we need for recommendations!
Scoring RankingAnalysis Term vectors
!
!!!
SimilarityUser
(or item)
vector
?
Spark Technology Center
Delimited Payload FilterElasticsearch
Term Vectors
Raw vector
1.2 ⋯ −0.2 0.3
Term vector with payloads
0|1.2 ⋯ 3|-0.2 4|0.3
Custom analyzer
Spark Technology Center
Custom scoring function
• Native script (Java), compiled for speed
• Scoring function computes dot product by:
§ For each document vector index (“term”), retrieve
payload
§ score += payload * query(i)
• Normalize with query vector norm and document
vector norm for cosine similarity (“similar items”)
Elasticsearch
Scoring
Spark Technology Center
Can we use the same machinery?Recommendation
! Sort
results
1.2 ⋯ −0.2 0.3
Scoring RankingAnalysis Term vectors
!!
Custom
scoring
function
!!
Delimited
payload filter
−1.1 1.3 ⋯ 0.4
1.2 −0.2 ⋯ 0.3
0.5 0.7 ⋯ −1.3
0.9 1.4 ⋯ −0.8
!
User
(or item)
vector
Spark Technology Center
We get search engine functionality for free!Elasticsearch
Scoring
Spark Technology Center
Deploying to ElasticsearchAlternating Least
Squares
Spark Technology Center
Monitoring &
Feedback
Spark Technology Center
Demo
Spark Technology Center
Elasticsearch
Elasticsearch Spark Integration
Spark ML ALS for Collaborative Filtering
Collaborative Filtering for Implicit Feedback
Datasets
Elasticsearch Term Vectors & Payloads
Delimited Payload Filter
Vector Scoring Plugin
Kibana
References
Spark Technology Center
Thanks!
https://github.com/MLnick/sseu16-meetup
https://github.com/MLnick/elasticsearch-vector-scoring

More Related Content

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Semantic web
PDF
Ontologies
PDF
Oracle Enterprise Manager 12c - OEM12c Presentation
PDF
Partie 5 - Neo4j.pdf bd nosql oriente graphe
PDF
Productionzing ML Model Using MLflow Model Serving
PPTX
NAMED ENTITY RECOGNITION
PDF
Oracle RAC Internals - The Cache Fusion Edition
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Semantic web
Ontologies
Oracle Enterprise Manager 12c - OEM12c Presentation
Partie 5 - Neo4j.pdf bd nosql oriente graphe
Productionzing ML Model Using MLflow Model Serving
NAMED ENTITY RECOGNITION
Oracle RAC Internals - The Cache Fusion Edition

What's hot (20)

PPTX
Apache Spark overview
PPTX
The Semantic Knowledge Graph
PDF
Managing the Machine Learning Lifecycle with MLflow
PDF
Introduction to Amazon Athena
PDF
Nosql data models
PDF
Introduction to spaCy
PDF
Latency and Consistency Tradeoffs in Modern Distributed Databases
PDF
Stanford CS347 Guest Lecture: Apache Spark
PDF
Information retrieval-systems notes
PDF
Lecture6 introduction to data streams
PPTX
Automatic indexing
PDF
KFServing and Feast
PPTX
WEB BASED INFORMATION RETRIEVAL SYSTEM
PPTX
Intro to Apache Spark
PPTX
Spark autotuning talk final
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PDF
Discover AI with Microsoft Azure
PPTX
MongoDB at eBay
PDF
Introduction to PySpark
PDF
Apache Spark Introduction
Apache Spark overview
The Semantic Knowledge Graph
Managing the Machine Learning Lifecycle with MLflow
Introduction to Amazon Athena
Nosql data models
Introduction to spaCy
Latency and Consistency Tradeoffs in Modern Distributed Databases
Stanford CS347 Guest Lecture: Apache Spark
Information retrieval-systems notes
Lecture6 introduction to data streams
Automatic indexing
KFServing and Feast
WEB BASED INFORMATION RETRIEVAL SYSTEM
Intro to Apache Spark
Spark autotuning talk final
Building Reliable Lakehouses with Apache Flink and Delta Lake
Discover AI with Microsoft Azure
MongoDB at eBay
Introduction to PySpark
Apache Spark Introduction
Ad

Similar to Creating an end-to-end Recommender System with Apache Spark and Elasticsearch - Nick Pentreath & Jean-François Puget (20)

PDF
Recommender Systems @ Scale, Big Data Europe Conference 2019
PPTX
Search and Recommendations: 3 Sides of the Same Coin
PDF
Explain Yourself: Why You Get the Recommendations You Do
PDF
Velox at SF Data Mining Meetup
PDF
As simple as Apache Spark
PPTX
Data Science at Scale by Sarah Guido
PPTX
Bridging Batch and Real-time Systems for Anomaly Detection
PPTX
How Humans & Machines Can Improve Site Search Results - Search Y: Paris
PDF
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
PDF
Machine learning advanced applications
PPTX
Spark for Recommender Systems
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PPTX
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
PDF
Helsinki Spark Meetup Nov 20 2015
PPTX
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
PPTX
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
PDF
GeeCON Prague 2015
PDF
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
PDF
Measuring Relevance in the Negative Space
PDF
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019
Search and Recommendations: 3 Sides of the Same Coin
Explain Yourself: Why You Get the Recommendations You Do
Velox at SF Data Mining Meetup
As simple as Apache Spark
Data Science at Scale by Sarah Guido
Bridging Batch and Real-time Systems for Anomaly Detection
How Humans & Machines Can Improve Site Search Results - Search Y: Paris
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Machine learning advanced applications
Spark for Recommender Systems
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Helsinki Spark Meetup Nov 20 2015
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
GeeCON Prague 2015
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Measuring Relevance in the Negative Space
Recommender Systems @ Scale - PyData 2019
Ad

More from sparktc (13)

PDF
Apache Spark™ Applications the Easy Way - Pierre Borckmans
PPTX
Hyperparameter Optimization - Sven Hafeneger
PDF
Data Science Hub & the Data Science Community - Philippe Van Impe
PDF
Data Science and Beer - Kris peeters
PDF
Holden Karau - Spark ML for Custom Models
PDF
DeepLearning4J and Spark: Successes and Challenges - François Garillot
PDF
DeepLearning4J and Spark: Successes and Challenges - François Garillot
PPTX
Building Custom
Machine Learning Algorithms
with Apache SystemML
PPTX
The Internet of Everywhere — How The Weather Company Scales
PPTX
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
PDF
STC Design - Engage
PPTX
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
PDF
Spark Summit EU: IBM Keynote
Apache Spark™ Applications the Easy Way - Pierre Borckmans
Hyperparameter Optimization - Sven Hafeneger
Data Science Hub & the Data Science Community - Philippe Van Impe
Data Science and Beer - Kris peeters
Holden Karau - Spark ML for Custom Models
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
Building Custom
Machine Learning Algorithms
with Apache SystemML
The Internet of Everywhere — How The Weather Company Scales
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
STC Design - Engage
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
Spark Summit EU: IBM Keynote

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Monthly Chronicles - July 2025
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Empathic Computing: Creating Shared Understanding
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
Big Data Technologies - Introduction.pptx
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?

Creating an end-to-end Recommender System with Apache Spark and Elasticsearch - Nick Pentreath & Jean-François Puget