Ensemble Methods and Recommender Systems

Ensemble Methods
+
Recommender Systems
1
10-601 Introduction to Machine Learning
Matt Gormley
Lecture 28
Apr. 29, 2019
Machine Learning Department
School of Computer Science
Carnegie Mellon University

Reminders
• Homework 9: Learning Paradigms
– Out: Wed, Apr 24
– Due: Wed, May 1 at 11:59pm
– Can only be submitted up to 3 days late,
so we can return grades before final exam
• Today’s In-Class Poll
– http://p28.mlcourse.org
2

Q&A
3
Q: In k-Means, since we don’t have a validation set, how do we
pick k?
A: Look at the training objective
function as a function of k
and pick the value at the
“elbo” of the curve.
Q: What if our random initialization for k-Means gives us poor
performance?
A: Do random restarts: that is, run k-means from scratch, say, 10
times and pick the run that gives the lowest training objective
function value.
The objective function is nonconvex, so we’re just looking for
the best local minimum.
J(c, z)
k

ML Big Picture
5
Learning Paradigms:
What data is available and
when? What form of prediction?
• supervised learning
• unsupervised learning
• semi-supervised learning
• reinforcement learning
• active learning
• imitation learning
• domain adaptation
• online learning
• density estimation
• recommender systems
• feature learning
• manifold learning
• dimensionality reduction
• ensemble learning
• distant supervision
• hyperparameter optimization
Problem Formulation:
What is the structure of our output prediction?
boolean Binary Classification
categorical Multiclass Classification
ordinal Ordinal Classification
real Regression
ordering Ranking
multiple discrete Structured Prediction
multiple continuous (e.g. dynamical systems)
both discrete &
cont.
(e.g. mixed graphical models)
Theoretical Foundations:
What principles guide learning?
q probabilistic
q information theoretic
q evolutionary search
q ML as optimization
Facets of Building ML
Systems:
How to build systems that are
robust, efficient, adaptive,
effective?
1. Data prep
2. Model selection
3. Training (optimization /
search)
4. Hyperparameter tuning on
validation data
5. (Blind) Assessment on test
data
Big Ideas in ML:
Which are the ideas driving
development of the field?
• inductive bias
• generalization / overfitting
• bias-variance decomposition
• generative vs. discriminative
• deep nets, graphical models
• PAC learning
• distant rewards
Application
Areas
Key
challenges?
NLP,
Speech,
Computer
Vision,
Robotics,
Medicine,
Search

Outline for Today
We’ll talk about two distinct topics:
1. Ensemble Methods: combine or learn multiple
classifiers into one
(i.e. a family of algorithms)
2. Recommender Systems: produce
recommendations of what a user will like
(i.e. the solution to a particular type of task)
We’ll use a prominent example of a recommender
systems (the Netflix Prize) to motivate both
topics…
6

Recommender Systems
A Common Challenge:
– Assume you’re a company
selling items of some sort:
movies, songs, products,
etc.
– Company collects millions
of ratings from users of
their items
– To maximize profit / user
happiness, you want to
recommend items that
users are likely to want
8

Recommender Systems
12
Problem Setup
• 500,000 users
• 20,000 movies
• 100 million ratings
• Goal: To obtain lower root mean squared error (RMSE)
than Netflix’s existing system on 3 million held out ratings

Recommender Systems
14
Top performing systems
were ensembles

Weighted Majority Algorithm
• Given: pool A of binary classifiers (that
you know nothing about)
• Data: stream of examples (i.e. online
learning setting)
• Goal: design a new learner that uses
the predictions of the pool to make
new predictions
• Algorithm:
– Initially weight all classifiers equally
– Receive a training example and predict
the (weighted) majority vote of the
classifiers in the pool
– Down-weight classifiers that contribute
to a mistake by a factor of β
15
(Littlestone & Warmuth, 1994)

17

18
This is a “mistake bound”
of the variety we saw for
the Perceptron algorithm

Comparison
• an example of an
ensemble method
• assumes the classifiers are
learned ahead of time
• only learns (majority vote)
weight for each classifiers
AdaBoost
• an example of a boosting
method
• simultaneously learns:
– the classifiers themselves
– (majority vote) weight for
each classifiers
20

Toy Example
Toy Example
Toy Example
Toy Example
Toy Example
D1
weak classifiers = vertical or horizontal half-planes
AdaBoost: Toy Example
23
Slide from Schapire NIPS Tutorial

Round 1
Round 1
Round 1
Round 1
Round 1
h1
α
ε1
1
=0.30
=0.42
2
D
24

Round 2
Round 2
Round 2
Round 2
Round 2
α
ε2
2
=0.21
=0.65
h2 3
D
25

Round 3
Round 3
Round 3
Round 3
Round 3
h3
α
ε3
3=0.92
=0.14
26

Final Classifier
Final Classifier
Final Classifier
Final Classifier
Final Classifier
H
final
+ 0.92
+ 0.65
0.42
sign
=
=
27

AdaBoost
28
Given: where ,
Initialize .
For :
Train weak learner using distribution .
Get weak hypothesis with error
Choose .
Update:
if
if
where is a normalization factor (chosen so that will be a distribution).
Output the final hypothesis:
Figure 1: The boosting algorithm AdaBoost.
Algorithm from (Freund & Schapire, 1999)

AdaBoost
30
Figure from (Freund & Schapire, 1999)
error
10 100 1000
0
5
10
15
20
cumulative
distribution
-1 -0.5 0.5 1
0.5
1.0
# rounds margin
Figure 2: Error curves and the margin distribution graph for boosting C4.5 on the letter dataset as
reported by Schapire et al. [41]. Left: the training and test error curves (lower and upper curves,
respectively) of the combined classifier as a function of the number of rounds of boosting. The
horizontal lines indicate the test error rate of the base classifier as well as the test error of the final
combined classifier. Right: The cumulative distribution of margins of the training examples after 5,
100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly hidden) and solid curves,
respectively.
Analyzing the training error

Learning Objectives
Ensemble Methods / Boosting
You should be able to…
1. Implement the Weighted Majority Algorithm
2. Implement AdaBoost
3. Distinguish what is learned in the Weighted
Majority Algorithm vs. Adaboost
4. Contrast the theoretical result for the
Weighted Majority Algorithm to that of
Perceptron
5. Explain a surprisingly common empirical result
regarding Adaboost train/test curves
31

Outline
• Recommender Systems
– Content Filtering
– Collaborative Filtering (CF)
– CF: Neighborhood Methods
– CF: Latent Factor Methods
• Matrix Factorization
– Background: Low-rank Factorizations
– Residual matrix
– Unconstrained Matrix Factorization
• Optimization problem
• Gradient Descent, SGD, Alternating Least Squares
• User/item bias terms (matrix trick)
– Singular Value Decomposition (SVD)
– Non-negative Matrix Factorization
32

Recommender Systems
38
Problem Setup
• 500,000 users
• 20,000 movies
• 100 million ratings
• Goal: To obtain lower root mean squared error (RMSE)
than Netflix’s existing system on 3 million held out ratings

Recommender Systems
• Setup:
– Items:
movies, songs, products, etc.
(often many thousands)
– Users:
watchers, listeners, purchasers, etc.
(often many millions)
– Feedback:
5-star ratings, not-clicking ‘next’,
purchases, etc.
• Key Assumptions:
– Can represent ratings numerically
as a user/item matrix
– Users only rate a small number of
items (the matrix is sparse)
40
Doctor
Strange
Star
Trek:
Beyond
Zootopia
Alice 1 5
Bob 3 4
Charlie 3 5 2

Two Types of Recommender Systems
Content Filtering
• Example: Pandora.com
music recommendations
(Music Genome Project)
• Con: Assumes access to
side information about
items (e.g. properties of a
song)
• Pro: Got a new item to
add? No problem, just be
sure to include the side
information
Collaborative Filtering
• Example: Netflix movie
recommendations
• Pro: Does not assume
access to side information
about items (e.g. does not
need to know about movie
genres)
• Con: Does not work on
new items that have no
ratings
41

Collaborative Filtering
• Everyday Examples of Collaborative Filtering...
– Bestseller lists
– Top 40 music lists
– The “recent returns” shelf at the library
– Unmarked but well-used paths thru the woods
– The printer room at work
– “Read any good books lately?”
– …
• Common insight: personal tastes are correlated
– If Alice and Bob both like X and Alice likes Y then
Bob is more likely to like Y
– especially (perhaps) if Bob knows Alice
44
Slide from William Cohen

Two Types of Collaborative Filtering
1. Neighborhood Methods 2. Latent Factor Methods
45
Figures from Koren et al. (2009)

1. Neighborhood Methods
46
In the figure, assume that
a green line indicates the
movie was watched
Algorithm:
1. Find neighbors based
on similarity of movie
preferences
2. Recommend movies
that those neighbors
watched

2. Latent Factor Methods
47
• Assume that both
movies and users
live in some low-
dimensional space
describing their
properties
• Recommend a
movie based on its
proximity to the
user in the latent
space
• Example Algorithm:
Matrix Factorization

Recommending Movies
Question:
Applied to the Netflix Prize
problem, which of the
following methods always
requires side information
about the users and movies?
Select all that apply
A. collaborative filtering
B. latent factor methods
C. ensemble methods
D. content filtering
E. neighborhood methods
F. recommender systems
49
Answer:

• Many different ways of factorizing a matrix
• We’ll consider three:
1. Unconstrained Matrix Factorization
2. Singular Value Decomposition
3. Non-negative Matrix Factorization
• MF is just another example of a common
recipe:
1. define a model
2. define an objective function
3. optimize with SGD
50

Whiteboard
– Background: Low-rank Factorizations
– Residual matrix
52

Example: MF for Netflix Problem
53
Figures from Aggarwal (2016)
1
2
3
4
5
6
7
HISTORY
ROMANCE
X
HISTORY
ROMANCE
ROMANCE
BOTH
HISTORY
1 1 1
1 1 1
1 1 1
- 1
- 1
- 1
- 1
- 1
- 1 - 1 - 1
1 1 1 1 1 1
1 1 1
1 1 1 1
1 1 1
0 0 0
0 0 0
0 0 0
NERO
JULIUS
CAESAR
CLEOPATRA
SLEEPLESS
IN
SEA
PRETTY
WOMAN
CASABLANCA
R U
VT
NERO
JULIUS
CAESAR
CLEOPATRA
SLEEPLESS
IN
SEATTLE
0
0
0
- 1
- 1
- 1
1
1
1
1
1
1
1
1
1 1 1
1 1 1
0 0
0 0
6
7
5
4
3
2
1
ATTLE
N
O
US
CAESAR
OPATRA
PLESS
IN
SEA
TTY
WOMAN
ABLANCA
0 0 0
0 0 0
0 0 0
0 0 0
NERO
JULIU
CLEO
SLEEP
PRET
CASA
1
BOTH
HISTORY
0 0 0
0 0 0
0 0 0 0 0
0 0 0
0 0 0
1
2
3
4
ROMANCE
0
0
0
0
0 0
1
0 0 0
1
0 0 0
0 0 0
1
5
6
0 0 1 0 0 0
R
7
(a) Example of rank-2 matrix factorization
(b) Residual matrix
Figure 3.7: Example of a matrix factorization and its residual matrix
3.6. LATENT FACTOR MODELS 95
1
2
3
4
5
6
7
HISTORY
ROMANCE
X
HISTORY
ROMANCE
ROMANCE
BOTH
HISTORY
1 1 1
1 1 1
1 1 1
- 1
- 1
- 1
- 1
- 1
- 1 - 1 - 1
1 1 1 1 1 1
1 1 1
1 1 1 1
1 1 1
0 0 0
0 0 0
0 0 0
NERO
JULIUS
CAESAR
CLEOPATRA
SLEEPLESS
IN
SEATTLE
PRETTY
WOMAN
CASABLANCA
R U
VT
NERO
JULIUS
CAESAR
CLEOPATRA
SLEEPLESS
IN
SEATTLE
PRETTY
WOMAN
CASABLANCA
0
0
0
- 1
- 1
- 1
1
1
1
1
1
1
1
1
1 1 1
1 1 1 1
0 0
0 0 0
6
7
5
4
3
2
1
ATTLE
N
O
US
CAESAR
OPATRA
PLESS
IN
SEA
TTY
WOMAN
ABLANCA
0 0 0
0 0 0
0 0 0
0 0 0
NERO
JULIU
CLEO
SLEEP
PRET
CASA
1
HISTORY
0 0 0
0 0 0
0 0 0
0 0 0
2
3
(a) Example of rank-2 matrix factorization

Regression vs. Collaborative Filtering
54
72 CHAPTER 3. MODEL-BASED COLLABORATIVE FILTERING
TRAINING
ROWS
TEST
ROWS
INDEPENDENT
VARIABLES
DEPENDENT
VARIABLE
NO
DEMARCATION
BETWEEN
TRAINING AND
TEST ROWS
NO DEMARCATION BETWEEN DEPENDENT
AND INDEPENDENT VARIABLES
(a) Classification (b) Collaborative filtering
Figures from Aggarwal (2016)
Regression Collaborative Filtering

UNCONSTRAINED MATRIX
FACTORIZATION
55

Unconstrained Matrix Factorization
Whiteboard
– Optimization problem
– SGD
– SGD with Regularization
– Alternating Least Squares
– User/item bias terms (matrix trick)
56

Unconstrained Matrix Factorization
In-Class Exercise
Derive a block coordinate descent algorithm
for the Unconstrained Matrix Factorization
problem.
57
• User vectors:
• Item vectors:
• Rating prediction:
ru Rr
?i Rr
vui = rT
u ?i
• Set of non-missing entries:
• Objective:
`;KBM
r,?
(u,i) Z
(vui rT
u ?i)2

• User vectors:
• Item vectors:
58
H i Rr
(Wu )T
Rr
Matrix$factori
this$work?
Figures from Gemulla et al. (2011)
Vui = Wu H i
= [WH]ui
(with matrices)

• User vectors:
• Item vectors:
(with vectors)
59
ru Rr
?i Rr
vui = rT
u ?i

• Set of non-missing entries:
• Objective:
60
`;KBM
r,?
(u,i) Z
(vui rT
u ?i)2
(with vectors)

• Regularized Objective:
• SGD update for random (u,i):
61
(with vectors)
`;KBM
r,?
(u,i) Z
(vui rT
u ?i)2
+ (
i
||ri||2
+
u
||?u||2
)

• Regularized Objective:
• SGD update for random (u,i):
62
(with vectors)
eui vui rT
u ?i
ru ru + (eui?i ru)
?i ?i + (euiru ?i)
`;KBM
r,?
(u,i) Z
(vui rT
u ?i)2
+ (
i
||ri||2
+
u
||?u||2
)

• User vectors:
• Item vectors:
63
H i Rr
(Wu )T
Rr
Matrix$factori
this$work?
Figures from Gemulla et al. (2011)
Vui = Wu H i
= [WH]ui
(with matrices)

• SGD
64
Matrix$factori
this$work?
Figure from Gemulla et al. (2011)
(with matrices)
Matrix$factorization$as$SGD$V$why$does$
this$work?
step size

65
Our winning entries consist of more than 100 differ-
ight cause a one-time
r, a recurring event is
reflect user opinion.
factorization model
ept varying confidence
et it give less weight to
ul observations. If con-
erving rui
is denoted as
odel enhances the cost
tion 5) to account for
follows:
ui
(rui
µ bu
bi
(|| pu
||2
+ || qi
||2
(8)
tion on a real-life ap-
lving such schemes,
borative Filtering for
ack Datasets.”10
IZE
ON
e online DVD rental
lix announced a con-
e the state of its recommender system.12
To
–1.5 –1.0 –0.5 0.0 0.5 1.0
–1.5
–1.0
–0.5
0.0
0.5
1.0
1.5
Factor vector 1
Factor
vector
2
F
r
e
d
d
y
G
o
t
F
i
n
g
e
r
e
d
F
r
e
d
d
y
v
s
.
J
a
s
o
n
H
a
l
f
B
a
k
e
d
R
o
a
d
T
r
i
p
T
h
e
S
o
u
n
d
o
f
M
u
s
i
c
S
o
p
h
i
e
’
s
C
h
o
i
c
e
M
o
o
n
s
t
r
u
c
k
M
a
i
d
i
n
M
a
n
h
a
t
t
a
n
T
h
e
W
a
y
W
e
W
e
r
e
R
u
n
a
w
a
y
B
r
i
d
e
C
o
y
o
t
e
U
g
l
y
T
h
e
R
o
y
a
l
T
e
n
e
n
b
a
u
m
s
P
u
n
c
h
-
D
r
u
n
k
L
o
v
e
I
H
e
a
r
t
H
u
c
k
a
b
e
e
s
A
r
m
a
g
e
d
d
o
n
C
i
t
i
z
e
n
K
a
n
e
T
h
e
W
a
l
t
o
n
s
:
S
e
a
s
o
n
1
S
t
e
p
m
o
m
J
u
l
i
e
n
D
o
n
k
e
y
-
B
o
y
S
i
s
t
e
r
A
c
t
T
h
e
F
a
s
t
a
n
d
t
h
e
F
u
r
i
o
u
s
T
h
e
W
i
z
a
r
d
o
f
O
z
K
i
l
l
B
i
l
l
:
V
o
l
.
1
S
c
a
r
f
a
c
e
N
a
t
u
r
a
l
B
o
r
n
K
i
l
l
e
r
s
A
n
n
i
e
H
a
l
l
B
e
l
l
e
d
e
J
o
u
r
L
o
s
t
i
n
T
r
a
n
s
l
a
t
i
o
n
T
h
e
L
o
n
g
e
s
t
Y
a
r
d
B
e
i
n
g
J
o
h
n
M
a
l
k
o
v
i
c
h
C
a
t
w
o
m
a
n
Figure 3. The first two vectors from a matrix decomposition of the Netflix Prize
data. Selected movies are placed at the appropriate spot based on their factor
vectors in two dimensions. The plot reveals distinct genres, including clusters of
movies with strong female leads, fraternity humor, and quirky independent films.
Figure from Koren et al. (2009)
Example
Factors

66
ALS = alternating least squares
Comparison
of
Optimization
Algorithms

SVD FOR COLLABORATIVE
FILTERING
67

Singular Value Decomposition
for Collaborative Filtering
69
Theorem: If R fully
observed and no
regularization, the
optimal UVT from
SVD equals the
optimal UVT from
Unconstrained MF

NON-NEGATIVE MATRIX
FACTORIZATION
70

Implicit Feedback Datasets
• What information does a five-star rating contain?
• Implicit Feedback Datasets:
– In many settings, users don’t have a way of expressing dislike for an
item (e.g. can’t provide negative ratings)
– The only mechanism for feedback is to “like” something
• Examples:
– Facebook has a “Like” button, but no “Dislike” button
– Google’s “+1” button
– Pinterest pins
– Purchasing an item on Amazon indicates a preference for it, but
there are many reasons you might not purchase an item (besides
dislike)
– Search engines collect click data but don’t have a clear mechanism
for observing dislike of a webpage
71
Examples from Aggarwal (2016)

Constrained Optimization Problem:
Non-negative Matrix Factorization
72
Multiplicative Updates: simple iterative
algorithm for solving just involves multiplying a
few entries together

Summary
• Recommender systems solve many real-
world (*large-scale) problems
• Collaborative filtering by Matrix
Factorization (MF) is an efficient and
effective approach
• MF is just another example of a common
recipe:
1. define a model
2. define an objective function
3. optimize with SGD
82

Ensemble Methods and Recommender Systems

More Related Content

Similar to Ensemble Methods and Recommender Systems (20)

More from rosni (8)

Recently uploaded (20)

Ensemble Methods and Recommender Systems