Higher-order Factorization Machines（第5回ステアラボ人工知能セミナー）

Higher-order Factorization Machines
Mathieu Blondel
NTT Communication Science Laboratories
Kyoto, Japan
Joint work with M. Ishihata, A. Fujino and
N. Ueda
2016/9/28
1

Regression analysis
• Variables
y ∈ R: target variable
x ∈ Rd
: explanatory variables (features)
• Training data
y = [y1, . . . , yn]T
∈ Rn
X = [x1, . . . , xn] ∈ Rd×n
• Goal
◦ Learn model parameters
◦ Compute prediction y for a new x
2

Linear regression
• Model
ˆyLR(x; w) := w, x =
d
j=1
wjxj
• Parameters
w ∈ Rd
: feature weights
• Pros and cons
O(d) predictions
Learning w can be cast as a convex optimization problem
Does not use feature interactions
3

Polynomial regression
• Model
ˆyPR(x; w) := w, x +xT
Wx = w, x +
d
j,j =1
wj,j xjxj
• Parameters
w ∈ Rd
: feature weights
W ∈ Rd×d
: weight matrix
• Pros and cons
Learning w and W can be cast as a convex optimization problem
O(d2
) time and memory cost
4

Kernel regression
• Model
ˆyKR(x; α) :=
n
i=1
αiK(xi, x)
• Parameters
α ∈ Rn
: instance weights
• Pros and cons
Can use non-linear kernels (RBF, polynomial, etc...)
Learning α can be cast as a convex optimization problem
O(dn) predictions (linear dependence on training set size)
5

Factorization Machines (FMs) (Rendle, ICDM 2010)
• Model
ˆyFM(x; w, P) := w, x +
j >j
¯pj, ¯pj xjxj
• Parameters
w ∈ Rd
: feature weights
P ∈ Rd×k
: weight matrix
• Pros and cons
Takes into account feature combinations
O(2dk) predictions (linear-time) instead of O(d2
)
Parameter estimation involves a non-convex optimization problem6
jth
row of P

Application 1: recsys without features
• Formulate it as a matrix completion problem
Movie 1 Movie 2 Movie 3 Movie 4
Alice ? ?
Bob ? ?
Charlie ? ?
• Matrix factorization: ﬁnd U, V that approximately
reconstruct the rating matrix
R ≈ UVT
7

Conversion to a regression problem
Movie 1 Movie 2 Movie 3 Movie 4
Alice ? ?
Bob ? ?
Charlie ? ?
⇓ one-hot encoding




















y










1 0 0 1 0 0 0
1 0 0 0 0 1 0
0 1 0 1 0 0 0
0 1 0 0 0 1 0
0 0 1 1 0 0 0
0 0 1 0 0 0 1










X
8
Using this
representation,
FMs are equivalent
to MF!

Generalization ability of FMs
• The weight of xjxj is ¯pj, ¯pj compared to wj,j for PR
• The same parameters ¯pj are shared for the weight of
xjxj ∀j > j
• This increases the amount of data used to estimate ¯pj at
the cost of introducing some bias (low-rank assumption)
• This allows to generalize to feature interactions that
were not observed in the training set
9

Application 2: recsys with features
Rating
User Movie
Gender Age Genre Director
M 20-30 Adventure S. Spielberg
F 0-10 Anime H. Miyazaki
M 20-30 Drama A. Kurosawa
...
...
...
...
...
• Interactions between categorical variables
◦ Gender × genre: {M, F} × {Adventure, Anime, Drama, ...}
◦ Age × director: {0-10, 10-20, ...} × {S. Spielberg, H. Miyazaki, A. Kurosawa, ...}
• In practice, the number of interactions can be huge!
10

Conversion to regression
Rating
User Movie
Gender Age Genre Director
M 20-30 Adventure S. Spielberg
F 0-10 Anime H. Miyazaki
M 20-30 Drama A. Kurosawa
...
...
...
...
...
⇓ one-hot encoding






...






y






1 0 0 0 1 1 0 0 . . .
0 1 1 0 0 0 1 0 . . .
1 0 0 0 1 0 0 1 . . .
...
...
...
...
...
...
...
... . . .






X11
very sparse
binary data!

FMs revisited (Blondel+, ICML 2016)
• ANOVA kernel of degree m = 2 (Stitson+, 1997; Vapnik, 1998)
A2
(p, x) :=
j >j
pjxj pj xj
• Then
ˆyFM(x; w, P) = w, x +
j >j
¯pj, ¯pj xjxj
= w, x +
k
s=1
A2
(ps, x)
12
↑ sth
column of P

ANOVA kernel (arbitrary-order case)
• ANOVA kernel of degree 2 ≤ m ≤ d
Am
(p, x) :=
jm>···>j1
(pj1
xj1
) . . . (pjm
xjm
)
• Intuitively, the kernel uses all m-combinations of features
without replacement: xj1
. . . xjm
for j1 = · · · = jm
• Computing Am
(p, x) naively takes O(dm
)
13
↑ All possible m-combinations of {1, . . . , d}

Higher-order FMs (HOFMs)
• Model
ˆyHOFM(x; w, {Pt
}m
t=2) := w, x +
m
t=2
k
s=1
At
(pt
s, x)
• Parameters
w ∈ Rd
: feature weights
P2
, . . . , Pm
∈ Rd×k
: weight matrices
• Pros and cons
Takes into account higher-order feature combinations
O(dkm2
) prediction cost using our proposed algorithms
More complex than 2nd-order FMs14

Learning HOFMs (1/2)
• We use alternating mimimization w.r.t. w, P2
, ..., Pm
• Learning w alone reduces to linear regression
• Learning Pm
can be cast as minimizing
F(P) :=
1
n
n
i=1

yi,
k
s=1
Am
(ps, xi) + oi

 +
β
2
P 2
where oi is the contribution of degrees other than m
15

Learning HOFMs (2/2)
• Stochastic gradient update
ps ← ps − η (yi, ˆyi) Am
(ps, xi) − ηβps
where η is a learning rate hyper-parameter and
ˆyi :=
k
s=1
Am
(ps, xi ) + oi
• We propose O(dm) (linear time) DP algorithms for
◦ Evaluating ANOVA kernel Am
(p, x) ∈ R
◦ Computing gradient Am
(p, x) ∈ Rd
16

Evaluating the ANOVA kernel (1/3)
• Recursion (Blondel+, ICML 2016)
Am
(p, x) = Am
(p¬j, x¬j) + pjxj Am−1
(p¬j, x¬j) ∀j
where p¬j, x¬j ∈ Rd−1
are vectors with the jth
element
removed
• We can use this recursion to remove features until
computing the kernel becomes trivial
17

ad,m
ad-1,m ad-1,m-1
+
pd xd
ad-2,m ad-2,m-1
+
ad-2,m-2
+
pd-1 xd-1pd-1 xd-1
…
…
Value we want  
to compute
Redundant 
computation
2 allows us to focus on the ANOVA kernel as the main
Ms. In this section, we develop dynamic programming (DP)
ng its gradient in only O(dm) time, i.e., linear time.
that we can use (7) to recursively remove features until
et us denote a subvector of p by p1:j 2 Rj
and similarly for
rthand aj,t := At
(p1:j, x1:j). Then we have that
if j 0
if j < t
,t + pjxj aj 1,t 1 otherwise.
(8)
Table 1: Example of DP table
j = 0 j = 1 j = 2 . . . j = d
t = 0 1 1 1 1 1
t = 1 0 a1,1 a2,1 . . . ad,1
t = 2 0 0 a2,2 . . . ad,2
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
t = m 0 0 0 . . . ad,m
m
(p, x) = ad,m.
cursion, which
tations, we use
mputations in a
ner to initialize
to arrive at the
procedure, sum-
me and memory.
of Am
(p, x) w.r.t. p, we use reverse-mode differentiation
Shortcut:
Continue until all 
features have been
eliminated
…
18
p1:j := [p1, . . . , pj]T
x1:j := [x1, . . . , xj]T

Ways to avoid redundant computations:
• Top-down approach with memory table
• Bottum-up dynamic programming (DP)
j = 0 j = 1 j = 2 . . . j = d
t = 0 1 1 1 1 1
t = 1 0 a1,1 a2,1 . . . ad,1
t = 2 0 0 a2,2 . . . ad,2
...
...
...
... ... ...
t = m 0 0 0 . . . ad,m
19
p2x2
→
← goal
start

Algorithm 1 Evaluating Am
(p, x) in O(dm)
Input: p 2 Rd
, x 2 Rd
aj,t 0 8t 2 {1, . . . , m}, j 2 {0, 1, . . . , d}
aj,0 1 8j 2 {0, 1, . . . , d}
for t := 1, . . . , m do
for j := t, . . . , d do
aj,t aj 1,t + pjxjaj 1,t 1
end for
end for
Output: Am
(p, x) = ad,m
Algor
Inp
˜aj,t
˜ad,m
for
f
e
end
˜pj :
Ou
20

Backpropagation (chain rule)
Ex: compute derivatives of composite function f (g(h(p)))
• Forward pass
a = h(p)
b = g(a)
c = f (b)
• Backward pass
∂c
∂pj
=
∂c
∂b
∂b
∂a
∂a
∂pj
= f (b) g (a) hj(p)
21
↓
Only the last part
depends on j
Can compute all derivatives in one pass!

Gradient computation (1/2)
• We want to compute Am
(p, x) = [˜p1, . . . , ˜pd]T
• Using the chain rule, we have
˜pj :=
∂ad,m
∂pj
=
m
t=1
∂ad,m
∂aj,t
:=ãj,t
∂aj,t
∂pj
=aj−1,t−1xj
=
m
t=1
ãj,t aj−1,t−1 xj
since pj influences aj,t ∀t ∈ [m]
• ãj,t can be computed recursively in reverse order
ãj,t = ãj+1,t + pj+1xj+1 ãj+1,t+1
22

Gradient computation (2/2)
j = 1 j = 2 . . . j = d − 1 j = d
t = 1 ã1,1 ã2,1 . . . 0 0
t = 2 0 ã2,2
... 0 0
...
...
...
... ãd−1,t−1 0
t = m 0 0 0 1 1
t = m + 1 0 0 0 0 0
23
←
pd xd
goal
← start

m) Algorithm 2 Computing rAm
(p, x) in O(dm)
Input: p 2 Rd
, x 2 Rd
, {aj,t}d,m
j,t=0
ãj,t 0 8t 2 [m + 1], j 2 [d]
ãd,m 1
for t := m, . . . , 1 do
for j := d 1, . . . , t do
ãj,t ãj+1,t + ãj+1,t+1pj+1xj+1
end for
end for
˜pj :=
Pm
t=1 ãj,taj 1,t 1xj 8j 2 [d]
Output: rAm
(p, x) = [˜p1, . . . , ˜pd]T
24

Summary so far
• HOFMs can be expressed using the ANOVA kernel Am
• We proposed O(dm) time algorithms for computing
Am
(p, x) and Am
(p, x)
• The cost per epoch of stochastic gradient algorithms for
learning Pm
is therefore O(dnkm)
• The prediction cost is O(dkm2
)
25

Other contributions
• Coordinate-descent algorithm for learning Pm
based on a
diﬀerent recursion
◦ Cost per epoch is O(dnkm2
)
◦ However, no learning rate to tune!
• HOFMs with shared parameters: P2
= · · · = Pm
◦ Total prediction cost is O(dkm) instead of O(dkm2
)
◦ Corresponds to using new kernels derived from the ANOVA kernel
26

Application to link prediction
Goal: predict missing links between nodes in a graph
?
?
Graph:
• Co-author network
• Enzyme network
?
?
Bipartite graph:
• User-movie
• Gene-disease
28

Application to link prediction
• We assume two sets of nodes A (e.g., users) and B (e.g,
movies) of size nA and nB
• Nodes in A are represented by feature vectors ai ∈ RdA
• Nodes in B are represented by feature vectors bj ∈ RdB
• We are given a matrix Y ∈ {−1, +1}nA×nB
such that
yi,j = +1 if there is a link between ai and bj
• Number of positive samples is n+
29

Datasets
Dataset n+ Columns of A nA dA Columns of B nB dB
NIPS 4,140 Authors 2,037 13,649
Enzyme 2,994 Enzymes 668 325
GD 3,954 Diseases 3,209 3,209 Genes 12,331 25,275
ML 100K 21,201 Users 943 49 Movies 1,682 29
Features:
• NIPS: word occurence in author publications
• Enzyme: phylogenetic information, gene expression information and
gene location information
• GD: MimMiner similarity scores (diseases) and HumanNet similarity
scores (genes)
• ML 100K: age, gender, occupation, living area (users); release year,
genre (movies)30

Models compared
Goal: predict if there is a link between ai and bj
• HOFM: ˆyi,j = ˆyHOFM(ai ⊕ bj; w, {Pt
}m
t=2)
• HOFM-shared: same but with P2
= · · · = Pm
• Polynomial network (PN): replace ANOVA kernel by
polynomial kernel
• Bilinear regression (BLR): ˆyi,j = aiUVT
bj
31
vector concatenation

Experimental protocol
• We sample n− = n+ negatives samples (missing edges
are treated as negative samples)
• We use 50% for training and 50% for testing
• We use ROC-AUC (area under ROC curve) for evaluation
• β tuned by CV, k ﬁxed to 30
• P2
, . . . , Pm
initialized randomly
• is set to the squared loss
32

HOFM HOFM-shared PN BLR
0.70
0.75
0.80
0.85
0.90
AUC
m=2
m=3
m=4
m=5
(a) NIPS
0.60
0.65
0.70
0.75
0.80
0.85
0.90
AUC
m=2
m=3
m=4
m=5
(b) Enzyme
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
AUC
m=2
m=3
m=4
m=5
(c) GD
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
AUC
m=2
m=3
m=4
m=5
(d) ML100K33

Solver comparison
• Coordinate descent
• AdaGrad
• L-BFGS
AdaGrad and L-BFGS use the proposed DP algorithm to
compute Am
(p, x)
34

100 101 102 103
CPU time (seconds)
10-3
10-2
10-1
100
101
Objectivevalueminusbest
CD
AdaGrad
L-BFGS
(a) Convergence when m = 2
100 101 102 103 104
CPU time (seconds)
10-3
10-2
10-1
100
101
CD
AdaGrad
L-BFGS
(b) Convergence when m = 3
100 101 102 103 104
CPU time (seconds)
10-3
10-2
10-1
100
101
CD
AdaGrad
L-BFGS
(c) Convergence when m = 4
2 3 4 5
Degree m
0
50
100
150
200
250
300
350
Timetocompleteoneepoch(sec.)
CD
AdaGrad
L-BFGS
(d) Scalability w.r.t. m35
NIPS dataset

Higher-order Factorization Machines（第5回ステアラボ人工知能セミナー）

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Higher-order Factorization Machines（第5回ステアラボ人工知能セミナー） (20)

More from STAIR Lab, Chiba Institute of Technology (7)

Recently uploaded (20)

Higher-order Factorization Machines（第5回ステアラボ人工知能セミナー）