SlideShare a Scribd company logo
Ensemble Methods
+
Recommender Systems
1
10-601 Introduction to Machine Learning
Matt Gormley
Lecture 28
Apr. 29, 2019
Machine Learning Department
School of Computer Science
Carnegie Mellon University
Reminders
• Homework 9: Learning Paradigms
– Out: Wed, Apr 24
– Due: Wed, May 1 at 11:59pm
– Can only be submitted up to 3 days late,
so we can return grades before final exam
• Today’s In-Class Poll
– http://p28.mlcourse.org
2
Q&A
3
Q: In k-Means, since we don’t have a validation set, how do we
pick k?
A: Look at the training objective
function as a function of k
and pick the value at the
“elbo” of the curve.
Q: What if our random initialization for k-Means gives us poor
performance?
A: Do random restarts: that is, run k-means from scratch, say, 10
times and pick the run that gives the lowest training objective
function value.
The objective function is nonconvex, so we’re just looking for
the best local minimum.
J(c, z)
k
ML Big Picture
5
Learning Paradigms:
What data is available and
when? What form of prediction?
• supervised learning
• unsupervised learning
• semi-supervised learning
• reinforcement learning
• active learning
• imitation learning
• domain adaptation
• online learning
• density estimation
• recommender systems
• feature learning
• manifold learning
• dimensionality reduction
• ensemble learning
• distant supervision
• hyperparameter optimization
Problem Formulation:
What is the structure of our output prediction?
boolean Binary Classification
categorical Multiclass Classification
ordinal Ordinal Classification
real Regression
ordering Ranking
multiple discrete Structured Prediction
multiple continuous (e.g. dynamical systems)
both discrete &
cont.
(e.g. mixed graphical models)
Theoretical Foundations:
What principles guide learning?
q probabilistic
q information theoretic
q evolutionary search
q ML as optimization
Facets of Building ML
Systems:
How to build systems that are
robust, efficient, adaptive,
effective?
1. Data prep
2. Model selection
3. Training (optimization /
search)
4. Hyperparameter tuning on
validation data
5. (Blind) Assessment on test
data
Big Ideas in ML:
Which are the ideas driving
development of the field?
• inductive bias
• generalization / overfitting
• bias-variance decomposition
• generative vs. discriminative
• deep nets, graphical models
• PAC learning
• distant rewards
Application
Areas
Key
challenges?
NLP,
Speech,
Computer
Vision,
Robotics,
Medicine,
Search
Outline for Today
We’ll talk about two distinct topics:
1. Ensemble Methods: combine or learn multiple
classifiers into one
(i.e. a family of algorithms)
2. Recommender Systems: produce
recommendations of what a user will like
(i.e. the solution to a particular type of task)
We’ll use a prominent example of a recommender
systems (the Netflix Prize) to motivate both
topics…
6
RECOMMENDER SYSTEMS
7
Recommender Systems
A Common Challenge:
– Assume you’re a company
selling items of some sort:
movies, songs, products,
etc.
– Company collects millions
of ratings from users of
their items
– To maximize profit / user
happiness, you want to
recommend items that
users are likely to want
8
Recommender Systems
9
Recommender Systems
10
Recommender Systems
11
Recommender Systems
12
Problem Setup
• 500,000 users
• 20,000 movies
• 100 million ratings
• Goal: To obtain lower root mean squared error (RMSE)
than Netflix’s existing system on 3 million held out ratings
ENSEMBLE METHODS
13
Recommender Systems
14
Top performing systems
were ensembles
Weighted Majority Algorithm
• Given: pool A of binary classifiers (that
you know nothing about)
• Data: stream of examples (i.e. online
learning setting)
• Goal: design a new learner that uses
the predictions of the pool to make
new predictions
• Algorithm:
– Initially weight all classifiers equally
– Receive a training example and predict
the (weighted) majority vote of the
classifiers in the pool
– Down-weight classifiers that contribute
to a mistake by a factor of β
15
(Littlestone & Warmuth, 1994)
Weighted Majority Algorithm
17
(Littlestone & Warmuth, 1994)
Weighted Majority Algorithm
18
(Littlestone & Warmuth, 1994)
This is a “mistake bound”
of the variety we saw for
the Perceptron algorithm
ADABOOST
19
Comparison
Weighted Majority Algorithm
• an example of an
ensemble method
• assumes the classifiers are
learned ahead of time
• only learns (majority vote)
weight for each classifiers
AdaBoost
• an example of a boosting
method
• simultaneously learns:
– the classifiers themselves
– (majority vote) weight for
each classifiers
20
Toy Example
Toy Example
Toy Example
Toy Example
Toy Example
D1
weak classifiers = vertical or horizontal half-planes
AdaBoost: Toy Example
23
Slide from Schapire NIPS Tutorial
Round 1
Round 1
Round 1
Round 1
Round 1
h1
α
ε1
1
=0.30
=0.42
2
D
AdaBoost: Toy Example
24
Slide from Schapire NIPS Tutorial
Round 2
Round 2
Round 2
Round 2
Round 2
α
ε2
2
=0.21
=0.65
h2 3
D
AdaBoost: Toy Example
25
Slide from Schapire NIPS Tutorial
Round 3
Round 3
Round 3
Round 3
Round 3
h3
α
ε3
3=0.92
=0.14
AdaBoost: Toy Example
26
Slide from Schapire NIPS Tutorial
Final Classifier
Final Classifier
Final Classifier
Final Classifier
Final Classifier
H
final
+ 0.92
+ 0.65
0.42
sign
=
=
AdaBoost: Toy Example
27
Slide from Schapire NIPS Tutorial
AdaBoost
28
Given: where ,
Initialize .
For :
Train weak learner using distribution .
Get weak hypothesis with error
Choose .
Update:
if
if
where is a normalization factor (chosen so that will be a distribution).
Output the final hypothesis:
Figure 1: The boosting algorithm AdaBoost.
Algorithm from (Freund & Schapire, 1999)
AdaBoost
30
Figure from (Freund & Schapire, 1999)
error
10 100 1000
0
5
10
15
20
cumulative
distribution
-1 -0.5 0.5 1
0.5
1.0
# rounds margin
Figure 2: Error curves and the margin distribution graph for boosting C4.5 on the letter dataset as
reported by Schapire et al. [41]. Left: the training and test error curves (lower and upper curves,
respectively) of the combined classifier as a function of the number of rounds of boosting. The
horizontal lines indicate the test error rate of the base classifier as well as the test error of the final
combined classifier. Right: The cumulative distribution of margins of the training examples after 5,
100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly hidden) and solid curves,
respectively.
Analyzing the training error
Learning Objectives
Ensemble Methods / Boosting
You should be able to…
1. Implement the Weighted Majority Algorithm
2. Implement AdaBoost
3. Distinguish what is learned in the Weighted
Majority Algorithm vs. Adaboost
4. Contrast the theoretical result for the
Weighted Majority Algorithm to that of
Perceptron
5. Explain a surprisingly common empirical result
regarding Adaboost train/test curves
31
Outline
• Recommender Systems
– Content Filtering
– Collaborative Filtering (CF)
– CF: Neighborhood Methods
– CF: Latent Factor Methods
• Matrix Factorization
– Background: Low-rank Factorizations
– Residual matrix
– Unconstrained Matrix Factorization
• Optimization problem
• Gradient Descent, SGD, Alternating Least Squares
• User/item bias terms (matrix trick)
– Singular Value Decomposition (SVD)
– Non-negative Matrix Factorization
32
RECOMMENDER SYSTEMS
33
Recommender Systems
38
Problem Setup
• 500,000 users
• 20,000 movies
• 100 million ratings
• Goal: To obtain lower root mean squared error (RMSE)
than Netflix’s existing system on 3 million held out ratings
Recommender Systems
39
Recommender Systems
• Setup:
– Items:
movies, songs, products, etc.
(often many thousands)
– Users:
watchers, listeners, purchasers, etc.
(often many millions)
– Feedback:
5-star ratings, not-clicking ‘next’,
purchases, etc.
• Key Assumptions:
– Can represent ratings numerically
as a user/item matrix
– Users only rate a small number of
items (the matrix is sparse)
40
Doctor
Strange
Star
Trek:
Beyond
Zootopia
Alice 1 5
Bob 3 4
Charlie 3 5 2
Two Types of Recommender Systems
Content Filtering
• Example: Pandora.com
music recommendations
(Music Genome Project)
• Con: Assumes access to
side information about
items (e.g. properties of a
song)
• Pro: Got a new item to
add? No problem, just be
sure to include the side
information
Collaborative Filtering
• Example: Netflix movie
recommendations
• Pro: Does not assume
access to side information
about items (e.g. does not
need to know about movie
genres)
• Con: Does not work on
new items that have no
ratings
41
COLLABORATIVE FILTERING
43
Collaborative Filtering
• Everyday Examples of Collaborative Filtering...
– Bestseller lists
– Top 40 music lists
– The “recent returns” shelf at the library
– Unmarked but well-used paths thru the woods
– The printer room at work
– “Read any good books lately?”
– …
• Common insight: personal tastes are correlated
– If Alice and Bob both like X and Alice likes Y then
Bob is more likely to like Y
– especially (perhaps) if Bob knows Alice
44
Slide from William Cohen
Two Types of Collaborative Filtering
1. Neighborhood Methods 2. Latent Factor Methods
45
Figures from Koren et al. (2009)
Two Types of Collaborative Filtering
1. Neighborhood Methods
46
In the figure, assume that
a green line indicates the
movie was watched
Algorithm:
1. Find neighbors based
on similarity of movie
preferences
2. Recommend movies
that those neighbors
watched
Figures from Koren et al. (2009)
Two Types of Collaborative Filtering
2. Latent Factor Methods
47
Figures from Koren et al. (2009)
• Assume that both
movies and users
live in some low-
dimensional space
describing their
properties
• Recommend a
movie based on its
proximity to the
user in the latent
space
• Example Algorithm:
Matrix Factorization
MATRIX FACTORIZATION
48
Recommending Movies
Question:
Applied to the Netflix Prize
problem, which of the
following methods always
requires side information
about the users and movies?
Select all that apply
A. collaborative filtering
B. latent factor methods
C. ensemble methods
D. content filtering
E. neighborhood methods
F. recommender systems
49
Answer:
Matrix Factorization
• Many different ways of factorizing a matrix
• We’ll consider three:
1. Unconstrained Matrix Factorization
2. Singular Value Decomposition
3. Non-negative Matrix Factorization
• MF is just another example of a common
recipe:
1. define a model
2. define an objective function
3. optimize with SGD
50
Matrix Factorization
Whiteboard
– Background: Low-rank Factorizations
– Residual matrix
52
Example: MF for Netflix Problem
53
Figures from Aggarwal (2016)
1
2
3
4
5
6
7
HISTORY
ROMANCE
X
HISTORY
ROMANCE
ROMANCE
BOTH
HISTORY
1 1 1
1 1 1
1 1 1
- 1
- 1
- 1
- 1
- 1
- 1 - 1 - 1
1 1 1 1 1 1
1 1 1
1 1 1 1
1 1 1
0 0 0
0 0 0
0 0 0
NERO
JULIUS
CAESAR
CLEOPATRA
SLEEPLESS
IN
SEA
PRETTY
WOMAN
CASABLANCA
R U
VT
NERO
JULIUS
CAESAR
CLEOPATRA
SLEEPLESS
IN
SEATTLE
0
0
0
- 1
- 1
- 1
1
1
1
1
1
1
1
1
1 1 1
1 1 1
0 0
0 0
6
7
5
4
3
2
1
ATTLE
N
O
US
CAESAR
OPATRA
PLESS
IN
SEA
TTY
WOMAN
ABLANCA
0 0 0
0 0 0
0 0 0
0 0 0
NERO
JULIU
CLEO
SLEEP
PRET
CASA
1
BOTH
HISTORY
0 0 0
0 0 0
0 0 0 0 0
0 0 0
0 0 0
1
2
3
4
ROMANCE
0
0
0
0
0 0
1
0 0 0
1
0 0 0
0 0 0
1
5
6
0 0 1 0 0 0
R
7
(a) Example of rank-2 matrix factorization
(b) Residual matrix
Figure 3.7: Example of a matrix factorization and its residual matrix
3.6. LATENT FACTOR MODELS 95
1
2
3
4
5
6
7
HISTORY
ROMANCE
X
HISTORY
ROMANCE
ROMANCE
BOTH
HISTORY
1 1 1
1 1 1
1 1 1
- 1
- 1
- 1
- 1
- 1
- 1 - 1 - 1
1 1 1 1 1 1
1 1 1
1 1 1 1
1 1 1
0 0 0
0 0 0
0 0 0
NERO
JULIUS
CAESAR
CLEOPATRA
SLEEPLESS
IN
SEATTLE
PRETTY
WOMAN
CASABLANCA
R U
VT
NERO
JULIUS
CAESAR
CLEOPATRA
SLEEPLESS
IN
SEATTLE
PRETTY
WOMAN
CASABLANCA
0
0
0
- 1
- 1
- 1
1
1
1
1
1
1
1
1
1 1 1
1 1 1 1
0 0
0 0 0
6
7
5
4
3
2
1
ATTLE
N
O
US
CAESAR
OPATRA
PLESS
IN
SEA
TTY
WOMAN
ABLANCA
0 0 0
0 0 0
0 0 0
0 0 0
NERO
JULIU
CLEO
SLEEP
PRET
CASA
1
HISTORY
0 0 0
0 0 0
0 0 0
0 0 0
2
3
(a) Example of rank-2 matrix factorization
Regression vs. Collaborative Filtering
54
72 CHAPTER 3. MODEL-BASED COLLABORATIVE FILTERING
TRAINING
ROWS
TEST
ROWS
INDEPENDENT
VARIABLES
DEPENDENT
VARIABLE
NO
DEMARCATION
BETWEEN
TRAINING AND
TEST ROWS
NO DEMARCATION BETWEEN DEPENDENT
AND INDEPENDENT VARIABLES
(a) Classification (b) Collaborative filtering
Figures from Aggarwal (2016)
Regression Collaborative Filtering
UNCONSTRAINED MATRIX
FACTORIZATION
55
Unconstrained Matrix Factorization
Whiteboard
– Optimization problem
– SGD
– SGD with Regularization
– Alternating Least Squares
– User/item bias terms (matrix trick)
56
Unconstrained Matrix Factorization
In-Class Exercise
Derive a block coordinate descent algorithm
for the Unconstrained Matrix Factorization
problem.
57
• User vectors:
• Item vectors:
• Rating prediction:
ru Rr
?i Rr
vui = rT
u ?i
• Set of non-missing entries:
• Objective:
`;KBM
r,?
(u,i) Z
(vui rT
u ?i)2
Matrix Factorization
• User vectors:
• Item vectors:
• Rating prediction:
58
Figures from Koren et al. (2009)
H i Rr
(Wu )T
Rr
Matrix$factori
this$work?
Figures from Gemulla et al. (2011)
Vui = Wu H i
= [WH]ui
(with matrices)
• User vectors:
• Item vectors:
• Rating prediction:
Matrix Factorization
(with vectors)
59
Figures from Koren et al. (2009)
ru Rr
?i Rr
vui = rT
u ?i
Matrix Factorization
• Set of non-missing entries:
• Objective:
60
Figures from Koren et al. (2009)
`;KBM
r,?
(u,i) Z
(vui rT
u ?i)2
(with vectors)
Matrix Factorization
• Regularized Objective:
• SGD update for random (u,i):
61
Figures from Koren et al. (2009)
(with vectors)
`;KBM
r,?
(u,i) Z
(vui rT
u ?i)2
+ (
i
||ri||2
+
u
||?u||2
)
Matrix Factorization
• Regularized Objective:
• SGD update for random (u,i):
62
Figures from Koren et al. (2009)
(with vectors)
eui vui rT
u ?i
ru ru + (eui?i ru)
?i ?i + (euiru ?i)
`;KBM
r,?
(u,i) Z
(vui rT
u ?i)2
+ (
i
||ri||2
+
u
||?u||2
)
Matrix Factorization
• User vectors:
• Item vectors:
• Rating prediction:
63
Figures from Koren et al. (2009)
H i Rr
(Wu )T
Rr
Matrix$factori
this$work?
Figures from Gemulla et al. (2011)
Vui = Wu H i
= [WH]ui
(with matrices)
Matrix Factorization
• SGD
64
Figures from Koren et al. (2009)
Matrix$factori
this$work?
Figure from Gemulla et al. (2011)
(with matrices)
Matrix$factorization$as$SGD$V$why$does$
this$work?
step size
Figure from Gemulla et al. (2011)
Matrix Factorization
65
Our winning entries consist of more than 100 differ-
ight cause a one-time
r, a recurring event is
reflect user opinion.
factorization model
ept varying confidence
et it give less weight to
ul observations. If con-
erving rui
is denoted as
odel enhances the cost
tion 5) to account for
follows:
ui
(rui
µ bu
bi
(|| pu
||2
+ || qi
||2
(8)
tion on a real-life ap-
lving such schemes,
borative Filtering for
ack Datasets.”10
IZE
ON
e online DVD rental
lix announced a con-
e the state of its recommender system.12
To
–1.5 –1.0 –0.5 0.0 0.5 1.0
–1.5
–1.0
–0.5
0.0
0.5
1.0
1.5
Factor vector 1
Factor
vector
2
F
r
e
d
d
y
G
o
t
F
i
n
g
e
r
e
d
F
r
e
d
d
y
v
s
.
J
a
s
o
n
H
a
l
f
B
a
k
e
d
R
o
a
d
T
r
i
p
T
h
e
S
o
u
n
d
o
f
M
u
s
i
c
S
o
p
h
i
e
’
s
C
h
o
i
c
e
M
o
o
n
s
t
r
u
c
k
M
a
i
d
i
n
M
a
n
h
a
t
t
a
n
T
h
e
W
a
y
W
e
W
e
r
e
R
u
n
a
w
a
y
B
r
i
d
e
C
o
y
o
t
e
U
g
l
y
T
h
e
R
o
y
a
l
T
e
n
e
n
b
a
u
m
s
P
u
n
c
h
-
D
r
u
n
k
L
o
v
e
I
H
e
a
r
t
H
u
c
k
a
b
e
e
s
A
r
m
a
g
e
d
d
o
n
C
i
t
i
z
e
n
K
a
n
e
T
h
e
W
a
l
t
o
n
s
:
S
e
a
s
o
n
1
S
t
e
p
m
o
m
J
u
l
i
e
n
D
o
n
k
e
y
-
B
o
y
S
i
s
t
e
r
A
c
t
T
h
e
F
a
s
t
a
n
d
t
h
e
F
u
r
i
o
u
s
T
h
e
W
i
z
a
r
d
o
f
O
z
K
i
l
l
B
i
l
l
:
V
o
l
.
1
S
c
a
r
f
a
c
e
N
a
t
u
r
a
l
B
o
r
n
K
i
l
l
e
r
s
A
n
n
i
e
H
a
l
l
B
e
l
l
e
d
e
J
o
u
r
L
o
s
t
i
n
T
r
a
n
s
l
a
t
i
o
n
T
h
e
L
o
n
g
e
s
t
Y
a
r
d
B
e
i
n
g
J
o
h
n
M
a
l
k
o
v
i
c
h
C
a
t
w
o
m
a
n
Figure 3. The first two vectors from a matrix decomposition of the Netflix Prize
data. Selected movies are placed at the appropriate spot based on their factor
vectors in two dimensions. The plot reveals distinct genres, including clusters of
movies with strong female leads, fraternity humor, and quirky independent films.
Figure from Koren et al. (2009)
Example
Factors
Matrix Factorization
66
ALS = alternating least squares
Comparison
of
Optimization
Algorithms
Figure from Gemulla et al. (2011)
SVD FOR COLLABORATIVE
FILTERING
67
Singular Value Decomposition
for Collaborative Filtering
69
Theorem: If R fully
observed and no
regularization, the
optimal UVT from
SVD equals the
optimal UVT from
Unconstrained MF
NON-NEGATIVE MATRIX
FACTORIZATION
70
Implicit Feedback Datasets
• What information does a five-star rating contain?
• Implicit Feedback Datasets:
– In many settings, users don’t have a way of expressing dislike for an
item (e.g. can’t provide negative ratings)
– The only mechanism for feedback is to “like” something
• Examples:
– Facebook has a “Like” button, but no “Dislike” button
– Google’s “+1” button
– Pinterest pins
– Purchasing an item on Amazon indicates a preference for it, but
there are many reasons you might not purchase an item (besides
dislike)
– Search engines collect click data but don’t have a clear mechanism
for observing dislike of a webpage
71
Examples from Aggarwal (2016)
Constrained Optimization Problem:
Non-negative Matrix Factorization
72
Multiplicative Updates: simple iterative
algorithm for solving just involves multiplying a
few entries together
Summary
• Recommender systems solve many real-
world (*large-scale) problems
• Collaborative filtering by Matrix
Factorization (MF) is an efficient and
effective approach
• MF is just another example of a common
recipe:
1. define a model
2. define an objective function
3. optimize with SGD
82

More Related Content

PPT
Machine learning introduction to unit 1.ppt
PPT
Lecture: introduction to Machine Learning.ppt
PPTX
Intro to machine learning
PPT
AL slides.ppt
PDF
PPT by Jannach_organized.pdf presentation on the recommendation
PDF
CPSC 340: Machine Learning and Data Mining K-Means Clustering Andreas Lehrman...
PPTX
Essential of ML 1st Lecture IIT Kharagpur
Machine learning introduction to unit 1.ppt
Lecture: introduction to Machine Learning.ppt
Intro to machine learning
AL slides.ppt
PPT by Jannach_organized.pdf presentation on the recommendation
CPSC 340: Machine Learning and Data Mining K-Means Clustering Andreas Lehrman...
Essential of ML 1st Lecture IIT Kharagpur

Similar to Ensemble Methods and Recommender Systems (20)

PDF
Exposé Ontology
PPTX
Computer Vision image classification
PPTX
Machine Learning Methods 2.pptx
PDF
Facebook Talk at Netflix ML Platform meetup Sep 2019
PPT
ai4.ppt
PDF
Lecture 5 machine learning updated
PPTX
Unit - 1 - Introduction of the machine learning
PPT
PDF
Module 1.pdf
PPTX
Machine_Learning.pptx
PDF
林守德/Practical Issues in Machine Learning
PPT
machine-learning-with-python usage in.ppt
PPT
ai4.ppt
PPTX
machine leraning : main principles and techniques
PPT
Machine learning and deep learning algorithms
PDF
EssentialsOfMachineLearning.pdf
PPTX
Recommender Systems: Advances in Collaborative Filtering
PPTX
Machine learning --Introduction.pptx
PPT
Chapter 02 collaborative recommendation
PPT
Chapter 02 collaborative recommendation
Exposé Ontology
Computer Vision image classification
Machine Learning Methods 2.pptx
Facebook Talk at Netflix ML Platform meetup Sep 2019
ai4.ppt
Lecture 5 machine learning updated
Unit - 1 - Introduction of the machine learning
Module 1.pdf
Machine_Learning.pptx
林守德/Practical Issues in Machine Learning
machine-learning-with-python usage in.ppt
ai4.ppt
machine leraning : main principles and techniques
Machine learning and deep learning algorithms
EssentialsOfMachineLearning.pdf
Recommender Systems: Advances in Collaborative Filtering
Machine learning --Introduction.pptx
Chapter 02 collaborative recommendation
Chapter 02 collaborative recommendation
Ad

More from rosni (8)

PPTX
Normalization of database from first normal form to 5th normal form
PPT
Research_Ethics in the real world by first defining what we mean by ethics
PPTX
1602849813-research-definition-and-types.pptx
PDF
1602849813-introduction-the-purpose-and-kinds-of-research.pdf
PPT
Writing the Literature Review by James from Texas University
PPT
Intensive course in research writing from texas Univ
PPTX
Mining massive datasets using recommender system
PDF
Recommender Systems Content and Collaborative Filtering
Normalization of database from first normal form to 5th normal form
Research_Ethics in the real world by first defining what we mean by ethics
1602849813-research-definition-and-types.pptx
1602849813-introduction-the-purpose-and-kinds-of-research.pdf
Writing the Literature Review by James from Texas University
Intensive course in research writing from texas Univ
Mining massive datasets using recommender system
Recommender Systems Content and Collaborative Filtering
Ad

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Approach and Philosophy of On baking technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Cloud computing and distributed systems.
PPTX
Machine Learning_overview_presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Building Integrated photovoltaic BIPV_UPV.pdf
Spectroscopy.pptx food analysis technology
Approach and Philosophy of On baking technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Programs and apps: productivity, graphics, security and other tools
A comparative analysis of optical character recognition models for extracting...
Cloud computing and distributed systems.
Machine Learning_overview_presentation.pptx

Ensemble Methods and Recommender Systems

  • 1. Ensemble Methods + Recommender Systems 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 28 Apr. 29, 2019 Machine Learning Department School of Computer Science Carnegie Mellon University
  • 2. Reminders • Homework 9: Learning Paradigms – Out: Wed, Apr 24 – Due: Wed, May 1 at 11:59pm – Can only be submitted up to 3 days late, so we can return grades before final exam • Today’s In-Class Poll – http://p28.mlcourse.org 2
  • 3. Q&A 3 Q: In k-Means, since we don’t have a validation set, how do we pick k? A: Look at the training objective function as a function of k and pick the value at the “elbo” of the curve. Q: What if our random initialization for k-Means gives us poor performance? A: Do random restarts: that is, run k-means from scratch, say, 10 times and pick the run that gives the lowest training objective function value. The objective function is nonconvex, so we’re just looking for the best local minimum. J(c, z) k
  • 4. ML Big Picture 5 Learning Paradigms: What data is available and when? What form of prediction? • supervised learning • unsupervised learning • semi-supervised learning • reinforcement learning • active learning • imitation learning • domain adaptation • online learning • density estimation • recommender systems • feature learning • manifold learning • dimensionality reduction • ensemble learning • distant supervision • hyperparameter optimization Problem Formulation: What is the structure of our output prediction? boolean Binary Classification categorical Multiclass Classification ordinal Ordinal Classification real Regression ordering Ranking multiple discrete Structured Prediction multiple continuous (e.g. dynamical systems) both discrete & cont. (e.g. mixed graphical models) Theoretical Foundations: What principles guide learning? q probabilistic q information theoretic q evolutionary search q ML as optimization Facets of Building ML Systems: How to build systems that are robust, efficient, adaptive, effective? 1. Data prep 2. Model selection 3. Training (optimization / search) 4. Hyperparameter tuning on validation data 5. (Blind) Assessment on test data Big Ideas in ML: Which are the ideas driving development of the field? • inductive bias • generalization / overfitting • bias-variance decomposition • generative vs. discriminative • deep nets, graphical models • PAC learning • distant rewards Application Areas Key challenges? NLP, Speech, Computer Vision, Robotics, Medicine, Search
  • 5. Outline for Today We’ll talk about two distinct topics: 1. Ensemble Methods: combine or learn multiple classifiers into one (i.e. a family of algorithms) 2. Recommender Systems: produce recommendations of what a user will like (i.e. the solution to a particular type of task) We’ll use a prominent example of a recommender systems (the Netflix Prize) to motivate both topics… 6
  • 7. Recommender Systems A Common Challenge: – Assume you’re a company selling items of some sort: movies, songs, products, etc. – Company collects millions of ratings from users of their items – To maximize profit / user happiness, you want to recommend items that users are likely to want 8
  • 11. Recommender Systems 12 Problem Setup • 500,000 users • 20,000 movies • 100 million ratings • Goal: To obtain lower root mean squared error (RMSE) than Netflix’s existing system on 3 million held out ratings
  • 13. Recommender Systems 14 Top performing systems were ensembles
  • 14. Weighted Majority Algorithm • Given: pool A of binary classifiers (that you know nothing about) • Data: stream of examples (i.e. online learning setting) • Goal: design a new learner that uses the predictions of the pool to make new predictions • Algorithm: – Initially weight all classifiers equally – Receive a training example and predict the (weighted) majority vote of the classifiers in the pool – Down-weight classifiers that contribute to a mistake by a factor of β 15 (Littlestone & Warmuth, 1994)
  • 16. Weighted Majority Algorithm 18 (Littlestone & Warmuth, 1994) This is a “mistake bound” of the variety we saw for the Perceptron algorithm
  • 18. Comparison Weighted Majority Algorithm • an example of an ensemble method • assumes the classifiers are learned ahead of time • only learns (majority vote) weight for each classifiers AdaBoost • an example of a boosting method • simultaneously learns: – the classifiers themselves – (majority vote) weight for each classifiers 20
  • 19. Toy Example Toy Example Toy Example Toy Example Toy Example D1 weak classifiers = vertical or horizontal half-planes AdaBoost: Toy Example 23 Slide from Schapire NIPS Tutorial
  • 20. Round 1 Round 1 Round 1 Round 1 Round 1 h1 α ε1 1 =0.30 =0.42 2 D AdaBoost: Toy Example 24 Slide from Schapire NIPS Tutorial
  • 21. Round 2 Round 2 Round 2 Round 2 Round 2 α ε2 2 =0.21 =0.65 h2 3 D AdaBoost: Toy Example 25 Slide from Schapire NIPS Tutorial
  • 22. Round 3 Round 3 Round 3 Round 3 Round 3 h3 α ε3 3=0.92 =0.14 AdaBoost: Toy Example 26 Slide from Schapire NIPS Tutorial
  • 23. Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier H final + 0.92 + 0.65 0.42 sign = = AdaBoost: Toy Example 27 Slide from Schapire NIPS Tutorial
  • 24. AdaBoost 28 Given: where , Initialize . For : Train weak learner using distribution . Get weak hypothesis with error Choose . Update: if if where is a normalization factor (chosen so that will be a distribution). Output the final hypothesis: Figure 1: The boosting algorithm AdaBoost. Algorithm from (Freund & Schapire, 1999)
  • 25. AdaBoost 30 Figure from (Freund & Schapire, 1999) error 10 100 1000 0 5 10 15 20 cumulative distribution -1 -0.5 0.5 1 0.5 1.0 # rounds margin Figure 2: Error curves and the margin distribution graph for boosting C4.5 on the letter dataset as reported by Schapire et al. [41]. Left: the training and test error curves (lower and upper curves, respectively) of the combined classifier as a function of the number of rounds of boosting. The horizontal lines indicate the test error rate of the base classifier as well as the test error of the final combined classifier. Right: The cumulative distribution of margins of the training examples after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly hidden) and solid curves, respectively. Analyzing the training error
  • 26. Learning Objectives Ensemble Methods / Boosting You should be able to… 1. Implement the Weighted Majority Algorithm 2. Implement AdaBoost 3. Distinguish what is learned in the Weighted Majority Algorithm vs. Adaboost 4. Contrast the theoretical result for the Weighted Majority Algorithm to that of Perceptron 5. Explain a surprisingly common empirical result regarding Adaboost train/test curves 31
  • 27. Outline • Recommender Systems – Content Filtering – Collaborative Filtering (CF) – CF: Neighborhood Methods – CF: Latent Factor Methods • Matrix Factorization – Background: Low-rank Factorizations – Residual matrix – Unconstrained Matrix Factorization • Optimization problem • Gradient Descent, SGD, Alternating Least Squares • User/item bias terms (matrix trick) – Singular Value Decomposition (SVD) – Non-negative Matrix Factorization 32
  • 29. Recommender Systems 38 Problem Setup • 500,000 users • 20,000 movies • 100 million ratings • Goal: To obtain lower root mean squared error (RMSE) than Netflix’s existing system on 3 million held out ratings
  • 31. Recommender Systems • Setup: – Items: movies, songs, products, etc. (often many thousands) – Users: watchers, listeners, purchasers, etc. (often many millions) – Feedback: 5-star ratings, not-clicking ‘next’, purchases, etc. • Key Assumptions: – Can represent ratings numerically as a user/item matrix – Users only rate a small number of items (the matrix is sparse) 40 Doctor Strange Star Trek: Beyond Zootopia Alice 1 5 Bob 3 4 Charlie 3 5 2
  • 32. Two Types of Recommender Systems Content Filtering • Example: Pandora.com music recommendations (Music Genome Project) • Con: Assumes access to side information about items (e.g. properties of a song) • Pro: Got a new item to add? No problem, just be sure to include the side information Collaborative Filtering • Example: Netflix movie recommendations • Pro: Does not assume access to side information about items (e.g. does not need to know about movie genres) • Con: Does not work on new items that have no ratings 41
  • 34. Collaborative Filtering • Everyday Examples of Collaborative Filtering... – Bestseller lists – Top 40 music lists – The “recent returns” shelf at the library – Unmarked but well-used paths thru the woods – The printer room at work – “Read any good books lately?” – … • Common insight: personal tastes are correlated – If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y – especially (perhaps) if Bob knows Alice 44 Slide from William Cohen
  • 35. Two Types of Collaborative Filtering 1. Neighborhood Methods 2. Latent Factor Methods 45 Figures from Koren et al. (2009)
  • 36. Two Types of Collaborative Filtering 1. Neighborhood Methods 46 In the figure, assume that a green line indicates the movie was watched Algorithm: 1. Find neighbors based on similarity of movie preferences 2. Recommend movies that those neighbors watched Figures from Koren et al. (2009)
  • 37. Two Types of Collaborative Filtering 2. Latent Factor Methods 47 Figures from Koren et al. (2009) • Assume that both movies and users live in some low- dimensional space describing their properties • Recommend a movie based on its proximity to the user in the latent space • Example Algorithm: Matrix Factorization
  • 39. Recommending Movies Question: Applied to the Netflix Prize problem, which of the following methods always requires side information about the users and movies? Select all that apply A. collaborative filtering B. latent factor methods C. ensemble methods D. content filtering E. neighborhood methods F. recommender systems 49 Answer:
  • 40. Matrix Factorization • Many different ways of factorizing a matrix • We’ll consider three: 1. Unconstrained Matrix Factorization 2. Singular Value Decomposition 3. Non-negative Matrix Factorization • MF is just another example of a common recipe: 1. define a model 2. define an objective function 3. optimize with SGD 50
  • 41. Matrix Factorization Whiteboard – Background: Low-rank Factorizations – Residual matrix 52
  • 42. Example: MF for Netflix Problem 53 Figures from Aggarwal (2016) 1 2 3 4 5 6 7 HISTORY ROMANCE X HISTORY ROMANCE ROMANCE BOTH HISTORY 1 1 1 1 1 1 1 1 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 NERO JULIUS CAESAR CLEOPATRA SLEEPLESS IN SEA PRETTY WOMAN CASABLANCA R U VT NERO JULIUS CAESAR CLEOPATRA SLEEPLESS IN SEATTLE 0 0 0 - 1 - 1 - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 6 7 5 4 3 2 1 ATTLE N O US CAESAR OPATRA PLESS IN SEA TTY WOMAN ABLANCA 0 0 0 0 0 0 0 0 0 0 0 0 NERO JULIU CLEO SLEEP PRET CASA 1 BOTH HISTORY 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 ROMANCE 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 5 6 0 0 1 0 0 0 R 7 (a) Example of rank-2 matrix factorization (b) Residual matrix Figure 3.7: Example of a matrix factorization and its residual matrix 3.6. LATENT FACTOR MODELS 95 1 2 3 4 5 6 7 HISTORY ROMANCE X HISTORY ROMANCE ROMANCE BOTH HISTORY 1 1 1 1 1 1 1 1 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 NERO JULIUS CAESAR CLEOPATRA SLEEPLESS IN SEATTLE PRETTY WOMAN CASABLANCA R U VT NERO JULIUS CAESAR CLEOPATRA SLEEPLESS IN SEATTLE PRETTY WOMAN CASABLANCA 0 0 0 - 1 - 1 - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 6 7 5 4 3 2 1 ATTLE N O US CAESAR OPATRA PLESS IN SEA TTY WOMAN ABLANCA 0 0 0 0 0 0 0 0 0 0 0 0 NERO JULIU CLEO SLEEP PRET CASA 1 HISTORY 0 0 0 0 0 0 0 0 0 0 0 0 2 3 (a) Example of rank-2 matrix factorization
  • 43. Regression vs. Collaborative Filtering 54 72 CHAPTER 3. MODEL-BASED COLLABORATIVE FILTERING TRAINING ROWS TEST ROWS INDEPENDENT VARIABLES DEPENDENT VARIABLE NO DEMARCATION BETWEEN TRAINING AND TEST ROWS NO DEMARCATION BETWEEN DEPENDENT AND INDEPENDENT VARIABLES (a) Classification (b) Collaborative filtering Figures from Aggarwal (2016) Regression Collaborative Filtering
  • 45. Unconstrained Matrix Factorization Whiteboard – Optimization problem – SGD – SGD with Regularization – Alternating Least Squares – User/item bias terms (matrix trick) 56
  • 46. Unconstrained Matrix Factorization In-Class Exercise Derive a block coordinate descent algorithm for the Unconstrained Matrix Factorization problem. 57 • User vectors: • Item vectors: • Rating prediction: ru Rr ?i Rr vui = rT u ?i • Set of non-missing entries: • Objective: `;KBM r,? (u,i) Z (vui rT u ?i)2
  • 47. Matrix Factorization • User vectors: • Item vectors: • Rating prediction: 58 Figures from Koren et al. (2009) H i Rr (Wu )T Rr Matrix$factori this$work? Figures from Gemulla et al. (2011) Vui = Wu H i = [WH]ui (with matrices)
  • 48. • User vectors: • Item vectors: • Rating prediction: Matrix Factorization (with vectors) 59 Figures from Koren et al. (2009) ru Rr ?i Rr vui = rT u ?i
  • 49. Matrix Factorization • Set of non-missing entries: • Objective: 60 Figures from Koren et al. (2009) `;KBM r,? (u,i) Z (vui rT u ?i)2 (with vectors)
  • 50. Matrix Factorization • Regularized Objective: • SGD update for random (u,i): 61 Figures from Koren et al. (2009) (with vectors) `;KBM r,? (u,i) Z (vui rT u ?i)2 + ( i ||ri||2 + u ||?u||2 )
  • 51. Matrix Factorization • Regularized Objective: • SGD update for random (u,i): 62 Figures from Koren et al. (2009) (with vectors) eui vui rT u ?i ru ru + (eui?i ru) ?i ?i + (euiru ?i) `;KBM r,? (u,i) Z (vui rT u ?i)2 + ( i ||ri||2 + u ||?u||2 )
  • 52. Matrix Factorization • User vectors: • Item vectors: • Rating prediction: 63 Figures from Koren et al. (2009) H i Rr (Wu )T Rr Matrix$factori this$work? Figures from Gemulla et al. (2011) Vui = Wu H i = [WH]ui (with matrices)
  • 53. Matrix Factorization • SGD 64 Figures from Koren et al. (2009) Matrix$factori this$work? Figure from Gemulla et al. (2011) (with matrices) Matrix$factorization$as$SGD$V$why$does$ this$work? step size Figure from Gemulla et al. (2011)
  • 54. Matrix Factorization 65 Our winning entries consist of more than 100 differ- ight cause a one-time r, a recurring event is reflect user opinion. factorization model ept varying confidence et it give less weight to ul observations. If con- erving rui is denoted as odel enhances the cost tion 5) to account for follows: ui (rui µ bu bi (|| pu ||2 + || qi ||2 (8) tion on a real-life ap- lving such schemes, borative Filtering for ack Datasets.”10 IZE ON e online DVD rental lix announced a con- e the state of its recommender system.12 To –1.5 –1.0 –0.5 0.0 0.5 1.0 –1.5 –1.0 –0.5 0.0 0.5 1.0 1.5 Factor vector 1 Factor vector 2 F r e d d y G o t F i n g e r e d F r e d d y v s . J a s o n H a l f B a k e d R o a d T r i p T h e S o u n d o f M u s i c S o p h i e ’ s C h o i c e M o o n s t r u c k M a i d i n M a n h a t t a n T h e W a y W e W e r e R u n a w a y B r i d e C o y o t e U g l y T h e R o y a l T e n e n b a u m s P u n c h - D r u n k L o v e I H e a r t H u c k a b e e s A r m a g e d d o n C i t i z e n K a n e T h e W a l t o n s : S e a s o n 1 S t e p m o m J u l i e n D o n k e y - B o y S i s t e r A c t T h e F a s t a n d t h e F u r i o u s T h e W i z a r d o f O z K i l l B i l l : V o l . 1 S c a r f a c e N a t u r a l B o r n K i l l e r s A n n i e H a l l B e l l e d e J o u r L o s t i n T r a n s l a t i o n T h e L o n g e s t Y a r d B e i n g J o h n M a l k o v i c h C a t w o m a n Figure 3. The first two vectors from a matrix decomposition of the Netflix Prize data. Selected movies are placed at the appropriate spot based on their factor vectors in two dimensions. The plot reveals distinct genres, including clusters of movies with strong female leads, fraternity humor, and quirky independent films. Figure from Koren et al. (2009) Example Factors
  • 55. Matrix Factorization 66 ALS = alternating least squares Comparison of Optimization Algorithms Figure from Gemulla et al. (2011)
  • 57. Singular Value Decomposition for Collaborative Filtering 69 Theorem: If R fully observed and no regularization, the optimal UVT from SVD equals the optimal UVT from Unconstrained MF
  • 59. Implicit Feedback Datasets • What information does a five-star rating contain? • Implicit Feedback Datasets: – In many settings, users don’t have a way of expressing dislike for an item (e.g. can’t provide negative ratings) – The only mechanism for feedback is to “like” something • Examples: – Facebook has a “Like” button, but no “Dislike” button – Google’s “+1” button – Pinterest pins – Purchasing an item on Amazon indicates a preference for it, but there are many reasons you might not purchase an item (besides dislike) – Search engines collect click data but don’t have a clear mechanism for observing dislike of a webpage 71 Examples from Aggarwal (2016)
  • 60. Constrained Optimization Problem: Non-negative Matrix Factorization 72 Multiplicative Updates: simple iterative algorithm for solving just involves multiplying a few entries together
  • 61. Summary • Recommender systems solve many real- world (*large-scale) problems • Collaborative filtering by Matrix Factorization (MF) is an efficient and effective approach • MF is just another example of a common recipe: 1. define a model 2. define an objective function 3. optimize with SGD 82