SlideShare a Scribd company logo
Introduction to
Recommender Systems
Machine Learning 101 Tutorial

Strata + Hadoop World, NYC, 2015

Chris DuBois, Dato
Outline
Motivation


Fundamentals

Collaborative filtering

Content-based recommendations

Hybrid methods



Practical considerations

Feedback

Evaluation

Tuning

Deployment
2
ML for building data products
• Products that produce and consume data.

• Products that improve as they produce and
consume data.

• Products that use data to provide a
personalized experience.

• Personalized experiences increase
engagement and retention.
3
Recommender systems
• Personalized experiences through
recommendations

• Recommend products, social network
connections, events, songs, and more

• Implicitly and explicitly drive many of
experiences you’re familiar with
4
Recommender uses
• Netflix, Spotify, LinkedIn, Facebook with the most
visible examples

• “You May Also Like”

“People You May Know”

“People to Follow”

• Also silently power many other experiences

• Quora/FB/Stitchfix: given interest in A, what
else might they be interested in?

• Product listings, up-sell options, etc.
5
Outline
Motivation

Fundamentals
Collaborative filtering

Content-based recommendations

Hybrid methods

Practical considerations

Feedback

Evaluation

Tuning

Deployment
6
Basic idea
7
• Data

• past behavior

• similarity between items

• current context

• Machine learning models

• Input

data about users and items

• Output

a function that provides a list of items for a given
context
recom
m
end
City of God
Wild Strawberries
The Celebration
La Dolce Vita
Women on the Verge of a
Nervous Breakdown
What do I
recommend?
8
Collaborative filtering
City of God
Wild Strawberries
The Celebration
La Dolce Vita
Women on the Verge of a
Nervous Breakdown
9
Content-based similarity
What data do you need?
• Required for collaborative filtering

• User identifier

• Product identifier

• Required for content-based recommendations
• Information about each item
• Further customization

• Ratings (explicit data), counts

• Side data
10
Outline
Motivation



Fundamentals

Collaborative filtering
Content-based recommendations

Hybrid methods



Practical considerations

Feedback

Evaluation

Tuning

Deployment
11
Implicit data
• User x product

interactions

• Consumed / used /

clicked / etc.
12
Item-based CF: Training
13
Item-based CF: predictions
14
Create a ranked list for a given user using 

the list of previously seen items

• For each item, i, compute the average similarity
between i and the items in the list

• Compute a list of the top N items ranked by score

Alternatives

• Incorporate rating, e.g., cosine distance

• Other distances, e.g., Pearsons
Demo!
15
Matrix factorization
• Treat users and products as a giant matrix
with (very) many missing values

• Users have latent factors that describe
how much they like various genres

• Items have latent factors that describe
how much like each genre they are
16
Matrix factorization
• Turn this into a fill-in-the-missing-value
exercise by learning the latent factors

• Implicit or explicit data

• Part of the winning formula for the Netflix
Prize

• Predict ratings or rankings 17
18
Alex
Bob
Alice
Barbara
Game of Thrones
Vikings
House of Cards
True Detective
Usual Suspects
5 5 5 3
5 4 5
1 5 4
3 5 5
Matrix factorization
19
5 5 5 3
5 4 5
1 5 4
3 5 5
Game of Thrones
Vikings
House of Cards
True Detective
Usual Suspects
Alex
Bob
Alice
Barbara
Model	
  parameters
Matrix factorization
20
5 5 5 3
5 4 5
1 5 4
3 5 5
HBO people
Game of Thrones
Vikings
House of Cards
True Detective
Usual Suspects
Alex
Bob
Alice
Barbara
Matrix factorization
21
5 5 5 3
5 4 5
1 5 4
3 5 5
HBO people
Violent historical
Game of Thrones
Vikings
House of Cards
True Detective
Usual Suspects
Alex
Bob
Alice
Barbara
Matrix factorization
22
5 5 5 3
5 4 5
1 5 4
3 5 5
HBO people
Violent historical
Kevin Spacey fans
Game of Thrones
Vikings
House of Cards
True Detective
Usual Suspects
Alex
Bob
Alice
Barbara
Matrix factorization
Fill in the blanks
• Learn the latent factors that minimize
prediction error on the observed values

• Fill in the missing values

• Sort the list by predicted rating &

recommend the unseen items
23
Demo!
24
Outline
Motivation



Fundamentals

Collaborative filtering

Content-based recommendations
Hybrid methods



Practical considerations

Feedback

Evaluation

Tuning

Deployment
25
recs = sim_model.recommend()
>>> nn_model
Class : NearestNeighborsModel
Distance : jaccard
Method : brute force
Number of examples : 195
Number of feature columns : 1
Number of unpacked features : 5170
Total training time (seconds) : 0.0318
talks[‘bow’] = gl.text_analytics.count_words(talks[‘abstract’])
talks[‘tfidf’] = gl.text_analytics.tf_idf(talks[‘bow’])
nn_model = gl.nearest_neighbors.create(talks, ‘id’, features=[‘tfidf’])
nbrs = nn_model.query(talks, label=‘id’, k=50)
sim_model = gl.item_similarity_recommender.create(historical, nearest=nbrs)
>>> historical
+------------+----------+------------------+---------+------------+
| date | time | user | item_id | event_type |
+------------+----------+------------------+---------+------------+
| 2015-02-12 | 07:05:37 | 809c0dc2548cbbc3 | 38825 | like |
| 2015-02-12 | 07:05:39 | 809c0dc2548cbbc3 | 38825 | like |
>>> talks
+------------+------------+-------------------------------+--------------------------------+
| date | start_time | title | tech_tags |
+------------+------------+-------------------------------+--------------------------------+
| 02/20/2015 | 10:40am | The IoT P2P Backbone | [MapReduce, Storm, Docker,... |
| 02/20/2015 | 10:40am | Practical Problems in Dete... | [Storm, Docker, Impala, R,... |
| 02/19/2015 | 1:30pm | From MapReduce to Programm... | [MapReduce, Spark, Apache,... |
| 02/19/2015 | 2:20pm | Drill into Drill: How Prov... | [JAVA, Docker, R, Hadoop, SQL] |
| 02/19/2015 | 4:50pm | Maintaining Low Latency wh... | [Apache, Hadoop, HBase, YA... |
| 02/20/2015 | 4:00pm | Top Ten Pitfalls to Avoid ... | [MapReduce, Hadoop, JAVA, ... |
| 02/20/2015 | 4:00pm | Using Data to Help Farmers... | [MapReduce, Spark, Storm, ... |
| 02/19/2015 | 1:30pm | Sears Hometown and Outlet... | [Hadoop, Spark, Docker, R,... |
| 02/20/2015 | 11:30am | Search Evolved: Unraveling... | [Docker, R, Hadoop, SQL, R... |
| 02/19/2015 | 4:00pm | Data Dexterity: Immediate ... | [Hadoop, NoSQL, Spark, Sto... |
| ... | ... | ... | ... |
+------------+------------+-------------------------------+--------------------------------+
[195 rows x 4 columns]
26
recs = sim_model.recommend()
>>> si
Class
Schema
------
User I
Item I
Target
Additi
Number
Number
Statis
------
Number
Number
Number
Traini
------
Traini
Settin
>>> nn_model
Class : NearestNeighborsModel
Distance : jaccard
Method : brute force
Number of examples : 195
Number of feature columns : 1
Number of unpacked features : 5170
Total training time (seconds) : 0.0318
talks[‘bow’] = gl.text_analytics.count_words(talks[‘abstract’])
talks[‘tfidf’] = gl.text_analytics.tf_idf(talks[‘bow’])
nn_model = gl.nearest_neighbors.create(talks, ‘id’, features=[‘tfidf’])
nbrs = nn_model.query(talks, label=‘id’, k=50)
sim_model = gl.item_similarity_recommender.create(historical, nearest=nbrs)
27
Side features
• Include information about users

• Geographic, demographic, time of day,
etc.

• Include information about products

• Product subtypes, geographic
availability, etc.
28
Demo!
29
Outline
Motivation



Fundamentals

Collaborative filtering

Content-based recommendations

Hybrid methods


Practical considerations

Feedback

Evaluation

Tuning

Deployment
30
Users Items
Collaborative Filtering
31
Items Features
Content-based
32
Items FeaturesUsers
Hybrid methods
33
Current approaches
Downsides
Alternatives
Linear model + Matrix factorization
Factorization machines with side data
Ensembles
Black box
Hard to tune
Hard to explain
Composite distance + nearest neighbors
Directly tune the notion of distance
Easy to explain
Hybrid methods
34
Benefits Cold start situations
Incorporating context
Items FeaturesUsers
Hybrid methods
35
Features
Composite distances
Distance Weight
year Euclidean 1.0
description Jaccard 0.5
genre Jaccard 1.5
latent	
  factors cosine 1.5
Features
Composite distances
Distance Weight
year Euclidean 1.0
description Jaccard 0.5
genre Jaccard 1.5
latent	
  factors cosine 1.5
Features
Composite distances
Distance Weight
year Euclidean 1.0
description Jaccard 0.5
genre Jaccard 1.5
latent	
  factors cosine 1.5
Features
Composite distances
Distance Weight
year Euclidean 1.0
description Jaccard 0.5
genre Jaccard 1.5
latent	
  factors cosine 1.5
Demo!
40
Outline
Motivation



Fundamentals

Collaborative filtering

Content-based recommendations

Hybrid methods



Practical considerations
Feedback

Evaluation

Tuning

Deployment
41
Outline
Motivation



Fundamentals

Collaborative filtering

Content-based recommendations

Hybrid methods



Practical considerations

Feedback
Evaluation

Tuning

Deployment
42
Feedback
Core assumption

past behavior will help predict future behavior. 

Collaborative filtering

data often comes from log data.

Plan ahead!

• value elicitation, e.g., like, watch, etc.

• ratings, stars, etc.

• critique, e.g. Improve the system’s recommendations!

• preference: e.g., Which do you prefer?

Preprocessing

• Item deduplication

Relationship to information retrieval

• position bias

• source of the event 43
Outline
Motivation



Fundamentals

Collaborative filtering

Content-based recommendations

Hybrid methods



Practical considerations

Feedback

Evaluation
Tuning

Deployment
44
Evaluating Models
45
Historical
Data
Live
Data
PredictionsTrained
Model
Deployed
Model
Offline Evaluation
Online Evaluation
Evaluation
• Train on a portion of your data

• Test on a held-out portion

• Ratings: RMSE

• Ranking: Precision, recall

• Business metrics

• Evaluate against popularity
46
Rankings?
• Often less concerned with predicting
precise scores

• Just want to get the first few items right

• Screen real estate is precious

• Ranking factorization recommender
47
Evaluation: Example
Suppose we serve a ranked list of 20 recommendations. 

“relevant” == user actual likes an item

“retrieved” == set of recommendations

Precision@5

% of top-5 recommendations that user likes

Precision@20 

% of recommendations that user likes

Questions
What if only 5 are visible?

How do things vary based on the number of events?

48
Demo!
49
Outline
Motivation

Fundamentals

Collaborative filtering

Content-based recommendations

Hybrid methods

Practical considerations

Feedback

Evaluation

Tuning
Deployment
50
Model parameter search
• Searching for which model performs best
at your metric

• Strategies
• grid search

• random search

• Bayesian optimization
51
How to choose which model?
• Select the appropriate model for your data
(implicit/explicit), if you want side features
or not, select hyperparameters, tune
them…

• … or let GraphLab Create do it for you and
automatically tune hyperparameters
52
Outline
Motivation



Fundamentals

Collaborative filtering

Content-based recommendations

Hybrid methods



Practical considerations

Feedback

Evaluation

Tuning

Deployment
53
Monitoring & Management
54
Historical
Data
Live
Data
PredictionsTrained
Model
Deployed
Model
Feedback
Models over time
Feedback
Deployed
Model
Time
Offline
Metrics
Online
Metrics
Historical
Data
Predictive
Service
User activity
logged
Request for Strata
event data
Personalized
recommendations
56
Summary
Motivation



Fundamentals

Collaborative filtering

Content-based recommendations

Hybrid methods



Practical considerations

Feedback

Evaluation

Tuning

Deployment
57
Thank you!
58
Email	
  
Twitter	
  
chris@dato.com	
  
@chrisdubois	
  

More Related Content

PPT
Building Personalized Data Products with Dato
PPTX
Getting Started With Dato - August 2015
PPTX
Webinar - Product Matching - Palombo (20160428)
PPTX
Webinar - Know Your Customer - Arya (20160526)
PPTX
Machine Learning with GraphLab Create
PPTX
Intelligent Applications with Machine Learning Toolkits
PPTX
Text Analysis with Machine Learning
PPTX
Production machine learning_infrastructure
Building Personalized Data Products with Dato
Getting Started With Dato - August 2015
Webinar - Product Matching - Palombo (20160428)
Webinar - Know Your Customer - Arya (20160526)
Machine Learning with GraphLab Create
Intelligent Applications with Machine Learning Toolkits
Text Analysis with Machine Learning
Production machine learning_infrastructure

What's hot (20)

PDF
H2O World - Machine Learning for non-data scientists
PPTX
Question Answering and Virtual Assistants with Deep Learning
PPTX
Webinar - Fraud Detection - Palombo (20160428)
PDF
Before Kaggle
PDF
The Machine Learning Workflow with Azure
PPTX
The Next Generation of AI-Powered Search
PDF
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
PPTX
Recommender System Using AZURE ML
PPTX
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
PDF
Modern Machine Learning Infrastructure and Practices
PDF
Knowledge Discovery
PPTX
Danny Bickson - Python based predictive analytics with GraphLab Create
PDF
H2O World - Intro to Data Science with Erin Ledell
PPTX
Personalized Search at Sandia National Labs
PDF
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
PDF
Architecting for Data Science
PPTX
Reduce Query Time Up to 60% with Selective Search
PDF
What Is GDS and Neo4j’s GDS Library
PDF
Real World Guide to Building Your Knowledge Graph
PDF
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
H2O World - Machine Learning for non-data scientists
Question Answering and Virtual Assistants with Deep Learning
Webinar - Fraud Detection - Palombo (20160428)
Before Kaggle
The Machine Learning Workflow with Azure
The Next Generation of AI-Powered Search
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
Recommender System Using AZURE ML
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Modern Machine Learning Infrastructure and Practices
Knowledge Discovery
Danny Bickson - Python based predictive analytics with GraphLab Create
H2O World - Intro to Data Science with Erin Ledell
Personalized Search at Sandia National Labs
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
Architecting for Data Science
Reduce Query Time Up to 60% with Selective Search
What Is GDS and Neo4j’s GDS Library
Real World Guide to Building Your Knowledge Graph
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Ad

Similar to Introduction to Recommender Systems (20)

PDF
Real-time personalized recommendations using product embeddings
PDF
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
PDF
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
PDF
Nbe rtopicsandrecomvlecture1
PDF
Discovering User's Topics of Interest in Recommender Systems
PPT
Content based recommendation systems
PPTX
Олександр Обєдніков “Рекомендательные системы”
PDF
[系列活動] 人工智慧與機器學習在推薦系統上的應用
PPTX
Recommenders, Topics, and Text
PPTX
Recommenders Systems
PDF
Recommendation System Explained
PDF
Overview of recommender system
PPTX
Recommendation system
PDF
Silk Data - Review Lecture on Recommendation Systems
PDF
Past, present, and future of Recommender Systems: an industry perspective
PDF
Recommender systems
PDF
Building a Recommender systems by Vivek Murugesan - Technical Architect at Cr...
PPT
CS8091_BDA_Unit_III_Content_Based_Recommendation
PDF
IntroductionRecommenderSystems_Petroni.pdf
PDF
LatentCross.pdf
Real-time personalized recommendations using product embeddings
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Nbe rtopicsandrecomvlecture1
Discovering User's Topics of Interest in Recommender Systems
Content based recommendation systems
Олександр Обєдніков “Рекомендательные системы”
[系列活動] 人工智慧與機器學習在推薦系統上的應用
Recommenders, Topics, and Text
Recommenders Systems
Recommendation System Explained
Overview of recommender system
Recommendation system
Silk Data - Review Lecture on Recommendation Systems
Past, present, and future of Recommender Systems: an industry perspective
Recommender systems
Building a Recommender systems by Vivek Murugesan - Technical Architect at Cr...
CS8091_BDA_Unit_III_Content_Based_Recommendation
IntroductionRecommenderSystems_Petroni.pdf
LatentCross.pdf
Ad

More from Turi, Inc. (20)

PPTX
Webinar - Analyzing Video
PDF
Webinar - Patient Readmission Risk
PPTX
Webinar - Pattern Mining Log Data - Vega (20160426)
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
PDF
Pattern Mining: Extracting Value from Log Data
PPTX
Machine Learning in Production with Dato Predictive Services
PPTX
Machine Learning in 2016: Live Q&A with Carlos Guestrin
PDF
Scalable data structures for data science
PPTX
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
PDF
Machine learning in production
PPTX
Overview of Machine Learning and Feature Engineering
PPTX
SFrame
PPTX
Towards a Comprehensive Machine Learning Benchmark
PDF
Dato Keynote
PDF
New Capabilities in the PyData Ecosystem
PPTX
Anomaly Detection Using Isolation Forests
PDF
Data! Data! Data! I Can't Make Bricks Without Clay!
PPTX
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
PDF
Pandas & Cloudera: Scaling the Python Data Experience
PDF
Better {ML} Together: GraphLab Create + Spark
Webinar - Analyzing Video
Webinar - Patient Readmission Risk
Webinar - Pattern Mining Log Data - Vega (20160426)
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Pattern Mining: Extracting Value from Log Data
Machine Learning in Production with Dato Predictive Services
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Scalable data structures for data science
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Machine learning in production
Overview of Machine Learning and Feature Engineering
SFrame
Towards a Comprehensive Machine Learning Benchmark
Dato Keynote
New Capabilities in the PyData Ecosystem
Anomaly Detection Using Isolation Forests
Data! Data! Data! I Can't Make Bricks Without Clay!
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Pandas & Cloudera: Scaling the Python Data Experience
Better {ML} Together: GraphLab Create + Spark

Recently uploaded (20)

PDF
Introduction to Business Data Analytics.
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Quality review (1)_presentation of this 21
PPTX
Global journeys: estimating international migration
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IB Computer Science - Internal Assessment.pptx
Introduction to Business Data Analytics.
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Quality review (1)_presentation of this 21
Global journeys: estimating international migration
Fluorescence-microscope_Botany_detailed content
Miokarditis (Inflamasi pada Otot Jantung)
IB Computer Science - Internal Assessment.pptx

Introduction to Recommender Systems