SlideShare a Scribd company logo
Higher-order Factorization Machines
Mathieu Blondel
NTT Communication Science Laboratories
Kyoto, Japan
Joint work with M. Ishihata, A. Fujino and
N. Ueda
2016/9/28
1
Regression analysis
• Variables
y ∈ R: target variable
x ∈ Rd
: explanatory variables (features)
• Training data
y = [y1, . . . , yn]T
∈ Rn
X = [x1, . . . , xn] ∈ Rd×n
• Goal
◦ Learn model parameters
◦ Compute prediction y for a new x
2
Linear regression
• Model
ˆyLR(x; w) := w, x =
d
j=1
wjxj
• Parameters
w ∈ Rd
: feature weights
• Pros and cons
O(d) predictions
Learning w can be cast as a convex optimization problem
Does not use feature interactions
3
Polynomial regression
• Model
ˆyPR(x; w) := w, x +xT
Wx = w, x +
d
j,j =1
wj,j xjxj
• Parameters
w ∈ Rd
: feature weights
W ∈ Rd×d
: weight matrix
• Pros and cons
Learning w and W can be cast as a convex optimization problem
O(d2
) time and memory cost
4
Kernel regression
• Model
ˆyKR(x; α) :=
n
i=1
αiK(xi, x)
• Parameters
α ∈ Rn
: instance weights
• Pros and cons
Can use non-linear kernels (RBF, polynomial, etc...)
Learning α can be cast as a convex optimization problem
O(dn) predictions (linear dependence on training set size)
5
Factorization Machines (FMs) (Rendle, ICDM 2010)
• Model
ˆyFM(x; w, P) := w, x +
j >j
¯pj, ¯pj xjxj
• Parameters
w ∈ Rd
: feature weights
P ∈ Rd×k
: weight matrix
• Pros and cons
Takes into account feature combinations
O(2dk) predictions (linear-time) instead of O(d2
)
Parameter estimation involves a non-convex optimization problem6
jth
row of P
Application 1: recsys without features
• Formulate it as a matrix completion problem
Movie 1 Movie 2 Movie 3 Movie 4
Alice ? ?
Bob ? ?
Charlie ? ?
• Matrix factorization: find U, V that approximately
reconstruct the rating matrix
R ≈ UVT
7
Conversion to a regression problem
Movie 1 Movie 2 Movie 3 Movie 4
Alice ? ?
Bob ? ?
Charlie ? ?
⇓ one-hot encoding




















y










1 0 0 1 0 0 0
1 0 0 0 0 1 0
0 1 0 1 0 0 0
0 1 0 0 0 1 0
0 0 1 1 0 0 0
0 0 1 0 0 0 1










X
8
Using this
representation,
FMs are equivalent
to MF!
Generalization ability of FMs
• The weight of xjxj is ¯pj, ¯pj compared to wj,j for PR
• The same parameters ¯pj are shared for the weight of
xjxj ∀j > j
• This increases the amount of data used to estimate ¯pj at
the cost of introducing some bias (low-rank assumption)
• This allows to generalize to feature interactions that
were not observed in the training set
9
Application 2: recsys with features
Rating
User Movie
Gender Age Genre Director
M 20-30 Adventure S. Spielberg
F 0-10 Anime H. Miyazaki
M 20-30 Drama A. Kurosawa
...
...
...
...
...
• Interactions between categorical variables
◦ Gender × genre: {M, F} × {Adventure, Anime, Drama, ...}
◦ Age × director: {0-10, 10-20, ...} × {S. Spielberg, H. Miyazaki, A. Kurosawa, ...}
• In practice, the number of interactions can be huge!
10
Conversion to regression
Rating
User Movie
Gender Age Genre Director
M 20-30 Adventure S. Spielberg
F 0-10 Anime H. Miyazaki
M 20-30 Drama A. Kurosawa
...
...
...
...
...
⇓ one-hot encoding






...






y






1 0 0 0 1 1 0 0 . . .
0 1 1 0 0 0 1 0 . . .
1 0 0 0 1 0 0 1 . . .
...
...
...
...
...
...
...
... . . .






X11
very sparse
binary data!
FMs revisited (Blondel+, ICML 2016)
• ANOVA kernel of degree m = 2 (Stitson+, 1997; Vapnik, 1998)
A2
(p, x) :=
j >j
pjxj pj xj
• Then
ˆyFM(x; w, P) = w, x +
j >j
¯pj, ¯pj xjxj
= w, x +
k
s=1
A2
(ps, x)
12
↑ sth
column of P
ANOVA kernel (arbitrary-order case)
• ANOVA kernel of degree 2 ≤ m ≤ d
Am
(p, x) :=
jm>···>j1
(pj1
xj1
) . . . (pjm
xjm
)
• Intuitively, the kernel uses all m-combinations of features
without replacement: xj1
. . . xjm
for j1 = · · · = jm
• Computing Am
(p, x) naively takes O(dm
)
13
↑ All possible m-combinations of {1, . . . , d}
Higher-order FMs (HOFMs)
• Model
ˆyHOFM(x; w, {Pt
}m
t=2) := w, x +
m
t=2
k
s=1
At
(pt
s, x)
• Parameters
w ∈ Rd
: feature weights
P2
, . . . , Pm
∈ Rd×k
: weight matrices
• Pros and cons
Takes into account higher-order feature combinations
O(dkm2
) prediction cost using our proposed algorithms
More complex than 2nd-order FMs14
Learning HOFMs (1/2)
• We use alternating mimimization w.r.t. w, P2
, ..., Pm
• Learning w alone reduces to linear regression
• Learning Pm
can be cast as minimizing
F(P) :=
1
n
n
i=1

yi,
k
s=1
Am
(ps, xi) + oi

 +
β
2
P 2
where oi is the contribution of degrees other than m
15
Learning HOFMs (2/2)
• Stochastic gradient update
ps ← ps − η (yi, ˆyi) Am
(ps, xi) − ηβps
where η is a learning rate hyper-parameter and
ˆyi :=
k
s=1
Am
(ps, xi ) + oi
• We propose O(dm) (linear time) DP algorithms for
◦ Evaluating ANOVA kernel Am
(p, x) ∈ R
◦ Computing gradient Am
(p, x) ∈ Rd
16
Evaluating the ANOVA kernel (1/3)
• Recursion (Blondel+, ICML 2016)
Am
(p, x) = Am
(p¬j, x¬j) + pjxj Am−1
(p¬j, x¬j) ∀j
where p¬j, x¬j ∈ Rd−1
are vectors with the jth
element
removed
• We can use this recursion to remove features until
computing the kernel becomes trivial
17
Evaluating the ANOVA kernel (2/3)
ad,m
ad-1,m ad-1,m-1
+
pd xd
ad-2,m ad-2,m-1
+
ad-2,m-2
+
pd-1 xd-1pd-1 xd-1
…
…
Value we want 

to compute
Redundant

computation
2 allows us to focus on the ANOVA kernel as the main
Ms. In this section, we develop dynamic programming (DP)
ng its gradient in only O(dm) time, i.e., linear time.
that we can use (7) to recursively remove features until
et us denote a subvector of p by p1:j 2 Rj
and similarly for
rthand aj,t := At
(p1:j, x1:j). Then we have that
if j 0
if j < t
,t + pjxj aj 1,t 1 otherwise.
(8)
Table 1: Example of DP table
j = 0 j = 1 j = 2 . . . j = d
t = 0 1 1 1 1 1
t = 1 0 a1,1 a2,1 . . . ad,1
t = 2 0 0 a2,2 . . . ad,2
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
t = m 0 0 0 . . . ad,m
m
(p, x) = ad,m.
cursion, which
tations, we use
mputations in a
ner to initialize
to arrive at the
procedure, sum-
me and memory.
of Am
(p, x) w.r.t. p, we use reverse-mode differentiation
Shortcut:
Continue until all

features have been
eliminated
…
18
p1:j := [p1, . . . , pj]T
x1:j := [x1, . . . , xj]T
Evaluating the ANOVA kernel (3/3)
Ways to avoid redundant computations:
• Top-down approach with memory table
• Bottum-up dynamic programming (DP)
j = 0 j = 1 j = 2 . . . j = d
t = 0 1 1 1 1 1
t = 1 0 a1,1 a2,1 . . . ad,1
t = 2 0 0 a2,2 . . . ad,2
...
...
...
... ... ...
t = m 0 0 0 . . . ad,m
19
p2x2
→
← goal
start
Algorithm 1 Evaluating Am
(p, x) in O(dm)
Input: p 2 Rd
, x 2 Rd
aj,t 0 8t 2 {1, . . . , m}, j 2 {0, 1, . . . , d}
aj,0 1 8j 2 {0, 1, . . . , d}
for t := 1, . . . , m do
for j := t, . . . , d do
aj,t aj 1,t + pjxjaj 1,t 1
end for
end for
Output: Am
(p, x) = ad,m
Algor
Inp
˜aj,t
˜ad,m
for
f
e
end
˜pj :
Ou
20
Backpropagation (chain rule)
Ex: compute derivatives of composite function f (g(h(p)))
• Forward pass
a = h(p)
b = g(a)
c = f (b)
• Backward pass
∂c
∂pj
=
∂c
∂b
∂b
∂a
∂a
∂pj
= f (b) g (a) hj(p)
21
↓
Only the last part
depends on j
Can compute all derivatives in one pass!
Gradient computation (1/2)
• We want to compute Am
(p, x) = [˜p1, . . . , ˜pd]T
• Using the chain rule, we have
˜pj :=
∂ad,m
∂pj
=
m
t=1
∂ad,m
∂aj,t
:=˜aj,t
∂aj,t
∂pj
=aj−1,t−1xj
=
m
t=1
˜aj,t aj−1,t−1 xj
since pj influences aj,t ∀t ∈ [m]
• ˜aj,t can be computed recursively in reverse order
˜aj,t = ˜aj+1,t + pj+1xj+1 ˜aj+1,t+1
22
Gradient computation (2/2)
j = 1 j = 2 . . . j = d − 1 j = d
t = 1 ˜a1,1 ˜a2,1 . . . 0 0
t = 2 0 ˜a2,2
... 0 0
...
...
...
... ˜ad−1,t−1 0
t = m 0 0 0 1 1
t = m + 1 0 0 0 0 0
23
←
pd xd
goal
← start
m) Algorithm 2 Computing rAm
(p, x) in O(dm)
Input: p 2 Rd
, x 2 Rd
, {aj,t}d,m
j,t=0
˜aj,t 0 8t 2 [m + 1], j 2 [d]
˜ad,m 1
for t := m, . . . , 1 do
for j := d 1, . . . , t do
˜aj,t ˜aj+1,t + ˜aj+1,t+1pj+1xj+1
end for
end for
˜pj :=
Pm
t=1 ˜aj,taj 1,t 1xj 8j 2 [d]
Output: rAm
(p, x) = [˜p1, . . . , ˜pd]T
24
Summary so far
• HOFMs can be expressed using the ANOVA kernel Am
• We proposed O(dm) time algorithms for computing
Am
(p, x) and Am
(p, x)
• The cost per epoch of stochastic gradient algorithms for
learning Pm
is therefore O(dnkm)
• The prediction cost is O(dkm2
)
25
Other contributions
• Coordinate-descent algorithm for learning Pm
based on a
different recursion
◦ Cost per epoch is O(dnkm2
)
◦ However, no learning rate to tune!
• HOFMs with shared parameters: P2
= · · · = Pm
◦ Total prediction cost is O(dkm) instead of O(dkm2
)
◦ Corresponds to using new kernels derived from the ANOVA kernel
26
Experiments
27
Application to link prediction
Goal: predict missing links between nodes in a graph
?
?
Graph:
• Co-author network
• Enzyme network
?
?
Bipartite graph:
• User-movie
• Gene-disease
28
Application to link prediction
• We assume two sets of nodes A (e.g., users) and B (e.g,
movies) of size nA and nB
• Nodes in A are represented by feature vectors ai ∈ RdA
• Nodes in B are represented by feature vectors bj ∈ RdB
• We are given a matrix Y ∈ {−1, +1}nA×nB
such that
yi,j = +1 if there is a link between ai and bj
• Number of positive samples is n+
29
Datasets
Dataset n+ Columns of A nA dA Columns of B nB dB
NIPS 4,140 Authors 2,037 13,649
Enzyme 2,994 Enzymes 668 325
GD 3,954 Diseases 3,209 3,209 Genes 12,331 25,275
ML 100K 21,201 Users 943 49 Movies 1,682 29
Features:
• NIPS: word occurence in author publications
• Enzyme: phylogenetic information, gene expression information and
gene location information
• GD: MimMiner similarity scores (diseases) and HumanNet similarity
scores (genes)
• ML 100K: age, gender, occupation, living area (users); release year,
genre (movies)30
Models compared
Goal: predict if there is a link between ai and bj
• HOFM: ˆyi,j = ˆyHOFM(ai ⊕ bj; w, {Pt
}m
t=2)
• HOFM-shared: same but with P2
= · · · = Pm
• Polynomial network (PN): replace ANOVA kernel by
polynomial kernel
• Bilinear regression (BLR): ˆyi,j = aiUVT
bj
31
vector concatenation
Experimental protocol
• We sample n− = n+ negatives samples (missing edges
are treated as negative samples)
• We use 50% for training and 50% for testing
• We use ROC-AUC (area under ROC curve) for evaluation
• β tuned by CV, k fixed to 30
• P2
, . . . , Pm
initialized randomly
• is set to the squared loss
32
HOFM HOFM-shared PN BLR
0.70
0.75
0.80
0.85
0.90
AUC
m=2
m=3
m=4
m=5
(a) NIPS
HOFM HOFM-shared PN BLR
0.60
0.65
0.70
0.75
0.80
0.85
0.90
AUC
m=2
m=3
m=4
m=5
(b) Enzyme
HOFM HOFM-shared PN BLR
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
AUC
m=2
m=3
m=4
m=5
(c) GD
HOFM HOFM-shared PN BLR
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
AUC
m=2
m=3
m=4
m=5
(d) ML100K33
Solver comparison
• Coordinate descent
• AdaGrad
• L-BFGS
AdaGrad and L-BFGS use the proposed DP algorithm to
compute Am
(p, x)
34
100 101 102 103
CPU time (seconds)
10-3
10-2
10-1
100
101
Objectivevalueminusbest
CD
AdaGrad
L-BFGS
(a) Convergence when m = 2
100 101 102 103 104
CPU time (seconds)
10-3
10-2
10-1
100
101
Objectivevalueminusbest
CD
AdaGrad
L-BFGS
(b) Convergence when m = 3
100 101 102 103 104
CPU time (seconds)
10-3
10-2
10-1
100
101
Objectivevalueminusbest
CD
AdaGrad
L-BFGS
(c) Convergence when m = 4
2 3 4 5
Degree m
0
50
100
150
200
250
300
350
Timetocompleteoneepoch(sec.)
CD
AdaGrad
L-BFGS
(d) Scalability w.r.t. m35
NIPS dataset

More Related Content

PDF
深層学習による非滑らかな関数の推定
PDF
Cartographer を用いた 3D SLAM
PDF
スパースモデリング
PDF
Neural networks for Graph Data NeurIPS2018読み会@PFN
PDF
構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展
PDF
GAN(と強化学習との関係)
PDF
Skip Connection まとめ(Neural Network)
PDF
組合せ最適化入門:線形計画から整数計画まで
深層学習による非滑らかな関数の推定
Cartographer を用いた 3D SLAM
スパースモデリング
Neural networks for Graph Data NeurIPS2018読み会@PFN
構造方程式モデルによる因果推論: 因果構造探索に関する最近の発展
GAN(と強化学習との関係)
Skip Connection まとめ(Neural Network)
組合せ最適化入門:線形計画から整数計画まで

What's hot (20)

PDF
[Dl輪読会]dl hacks輪読
PDF
計算論的学習理論入門 -PAC学習とかVC次元とか-
PDF
大規模な組合せ最適化問題に対する発見的解法
PDF
12. Diffusion Model の数学的基礎.pdf
PPTX
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
PDF
[DL輪読会]Deep Learning 第15章 表現学習
PDF
パターン認識 05 ロジスティック回帰
PDF
Visual SLAM: Why Bundle Adjust?の解説(第4回3D勉強会@関東)
PPTX
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
PDF
深層生成モデルと世界モデル(2020/11/20版)
PPTX
[DL輪読会]World Models
PDF
Optimizer入門&最新動向
PPTX
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
PDF
[DL輪読会]YOLO9000: Better, Faster, Stronger
PDF
Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...
PPTX
Curriculum Learning (関東CV勉強会)
PPTX
ベルヌーイ分布における超パラメータ推定のための経験ベイズ法
PPTX
How Much Position Information Do Convolutional Neural Networks Encode?
PDF
バンディットアルゴリズム入門と実践
PDF
方策勾配型強化学習の基礎と応用
[Dl輪読会]dl hacks輪読
計算論的学習理論入門 -PAC学習とかVC次元とか-
大規模な組合せ最適化問題に対する発見的解法
12. Diffusion Model の数学的基礎.pdf
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
[DL輪読会]Deep Learning 第15章 表現学習
パターン認識 05 ロジスティック回帰
Visual SLAM: Why Bundle Adjust?の解説(第4回3D勉強会@関東)
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
深層生成モデルと世界モデル(2020/11/20版)
[DL輪読会]World Models
Optimizer入門&最新動向
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]YOLO9000: Better, Faster, Stronger
Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...
Curriculum Learning (関東CV勉強会)
ベルヌーイ分布における超パラメータ推定のための経験ベイズ法
How Much Position Information Do Convolutional Neural Networks Encode?
バンディットアルゴリズム入門と実践
方策勾配型強化学習の基礎と応用
Ad

Viewers also liked (20)

PDF
Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)
PDF
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
PDF
ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)
PDF
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
PPTX
第1回ステアラボ人工知能セミナー(オープニング)
PDF
言語資源と付き合う
PDF
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
PPTX
Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)
PDF
教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)
PDF
最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)
PDF
自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)
PDF
深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)
PDF
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
PPTX
視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)
PPTX
画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)
PDF
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
PDF
情報抽出入門 〜非構造化データを構造化させる技術〜
PPTX
深層学習による自然言語処理の研究動向
PDF
深層学習時代の自然言語処理
Stair Captions and Stair Actions(ステアラボ人工知能シンポジウム2017)
知識グラフの埋め込みとその応用 (第10回ステアラボ人工知能セミナー)
ヒューマンコンピュテーションのための専門家発見(ステアラボ人工知能シンポジウム2017)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
第1回ステアラボ人工知能セミナー(オープニング)
言語資源と付き合う
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
Computer Vision meets Fashion (第12回ステアラボ人工知能セミナー)
教師なしオブジェクトマッチング(第2回ステアラボ人工知能セミナー)
最近の重要な論文の紹介 - テキストとの対応付けによる映像の理解に関連して(ステアラボ人工知能シンポジウム2017)
自然言語処理分野の最前線(ステアラボ人工知能シンポジウム2017)
深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
視覚×言語の最前線(ステアラボ人工知能シンポジウム2017)
画像キャプションの自動生成(第3回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
情報抽出入門 〜非構造化データを構造化させる技術〜
深層学習による自然言語処理の研究動向
深層学習時代の自然言語処理
Ad

Similar to Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー) (20)

PPT
PDF
Introduction to Artificial Neural Networks
PDF
Low Complexity Regularization of Inverse Problems
PDF
SPDE presentation 2012
PPTX
ML unit2.pptx
PDF
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
PPTX
Optimization algorithms for solving computer vision problems
PDF
Simplified Runtime Analysis of Estimation of Distribution Algorithms
PDF
Simplified Runtime Analysis of Estimation of Distribution Algorithms
PDF
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
PDF
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
PDF
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
PDF
The Perceptron (D1L2 Deep Learning for Speech and Language)
PPS
A Tutorial On Ip 1
PDF
Efficient Analysis of high-dimensional data in tensor formats
PDF
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
PPT
Randomized algorithms ver 1.0
PDF
MVPA with SpaceNet: sparse structured priors
PDF
KAUST_talk_short.pdf
PDF
Reading Seminar (140515) Spectral Learning of L-PCFGs
Introduction to Artificial Neural Networks
Low Complexity Regularization of Inverse Problems
SPDE presentation 2012
ML unit2.pptx
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Optimization algorithms for solving computer vision problems
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
The Perceptron (D1L2 Deep Learning for Speech and Language)
A Tutorial On Ip 1
Efficient Analysis of high-dimensional data in tensor formats
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Randomized algorithms ver 1.0
MVPA with SpaceNet: sparse structured priors
KAUST_talk_short.pdf
Reading Seminar (140515) Spectral Learning of L-PCFGs

More from STAIR Lab, Chiba Institute of Technology (7)

PDF
リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)
PPTX
制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)
PDF
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
PDF
企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)
PPTX
メテオサーチチャレンジ報告 (2位解法)
PDF
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
PDF
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)
リアクティブプログラミングにおける時変値永続化の試み (第2回ステアラボソフトウェア技術セミナー)
制約解消によるプログラム検証・合成 (第1回ステアラボソフトウェア技術セミナー)
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
企業化する大学と、公益化する企業。そして、人工知能の社会実装に向けて。(ステアラボ人工知能シンポジウム)
メテオサーチチャレンジ報告 (2位解法)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
文法および流暢性を考慮した頑健なテキスト誤り訂正 (第15回ステアラボ人工知能セミナー)

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
KodekX | Application Modernization Development
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Empathic Computing: Creating Shared Understanding
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Modernizing your data center with Dell and AMD
KodekX | Application Modernization Development
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Weekly Chronicles - August'25 Week I
Empathic Computing: Creating Shared Understanding
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation_ Review paper, used for researhc scholars

Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)

  • 1. Higher-order Factorization Machines Mathieu Blondel NTT Communication Science Laboratories Kyoto, Japan Joint work with M. Ishihata, A. Fujino and N. Ueda 2016/9/28 1
  • 2. Regression analysis • Variables y ∈ R: target variable x ∈ Rd : explanatory variables (features) • Training data y = [y1, . . . , yn]T ∈ Rn X = [x1, . . . , xn] ∈ Rd×n • Goal ◦ Learn model parameters ◦ Compute prediction y for a new x 2
  • 3. Linear regression • Model ˆyLR(x; w) := w, x = d j=1 wjxj • Parameters w ∈ Rd : feature weights • Pros and cons O(d) predictions Learning w can be cast as a convex optimization problem Does not use feature interactions 3
  • 4. Polynomial regression • Model ˆyPR(x; w) := w, x +xT Wx = w, x + d j,j =1 wj,j xjxj • Parameters w ∈ Rd : feature weights W ∈ Rd×d : weight matrix • Pros and cons Learning w and W can be cast as a convex optimization problem O(d2 ) time and memory cost 4
  • 5. Kernel regression • Model ˆyKR(x; α) := n i=1 αiK(xi, x) • Parameters α ∈ Rn : instance weights • Pros and cons Can use non-linear kernels (RBF, polynomial, etc...) Learning α can be cast as a convex optimization problem O(dn) predictions (linear dependence on training set size) 5
  • 6. Factorization Machines (FMs) (Rendle, ICDM 2010) • Model ˆyFM(x; w, P) := w, x + j >j ¯pj, ¯pj xjxj • Parameters w ∈ Rd : feature weights P ∈ Rd×k : weight matrix • Pros and cons Takes into account feature combinations O(2dk) predictions (linear-time) instead of O(d2 ) Parameter estimation involves a non-convex optimization problem6 jth row of P
  • 7. Application 1: recsys without features • Formulate it as a matrix completion problem Movie 1 Movie 2 Movie 3 Movie 4 Alice ? ? Bob ? ? Charlie ? ? • Matrix factorization: find U, V that approximately reconstruct the rating matrix R ≈ UVT 7
  • 8. Conversion to a regression problem Movie 1 Movie 2 Movie 3 Movie 4 Alice ? ? Bob ? ? Charlie ? ? ⇓ one-hot encoding                     y           1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1           X 8 Using this representation, FMs are equivalent to MF!
  • 9. Generalization ability of FMs • The weight of xjxj is ¯pj, ¯pj compared to wj,j for PR • The same parameters ¯pj are shared for the weight of xjxj ∀j > j • This increases the amount of data used to estimate ¯pj at the cost of introducing some bias (low-rank assumption) • This allows to generalize to feature interactions that were not observed in the training set 9
  • 10. Application 2: recsys with features Rating User Movie Gender Age Genre Director M 20-30 Adventure S. Spielberg F 0-10 Anime H. Miyazaki M 20-30 Drama A. Kurosawa ... ... ... ... ... • Interactions between categorical variables ◦ Gender × genre: {M, F} × {Adventure, Anime, Drama, ...} ◦ Age × director: {0-10, 10-20, ...} × {S. Spielberg, H. Miyazaki, A. Kurosawa, ...} • In practice, the number of interactions can be huge! 10
  • 11. Conversion to regression Rating User Movie Gender Age Genre Director M 20-30 Adventure S. Spielberg F 0-10 Anime H. Miyazaki M 20-30 Drama A. Kurosawa ... ... ... ... ... ⇓ one-hot encoding       ...       y       1 0 0 0 1 1 0 0 . . . 0 1 1 0 0 0 1 0 . . . 1 0 0 0 1 0 0 1 . . . ... ... ... ... ... ... ... ... . . .       X11 very sparse binary data!
  • 12. FMs revisited (Blondel+, ICML 2016) • ANOVA kernel of degree m = 2 (Stitson+, 1997; Vapnik, 1998) A2 (p, x) := j >j pjxj pj xj • Then ˆyFM(x; w, P) = w, x + j >j ¯pj, ¯pj xjxj = w, x + k s=1 A2 (ps, x) 12 ↑ sth column of P
  • 13. ANOVA kernel (arbitrary-order case) • ANOVA kernel of degree 2 ≤ m ≤ d Am (p, x) := jm>···>j1 (pj1 xj1 ) . . . (pjm xjm ) • Intuitively, the kernel uses all m-combinations of features without replacement: xj1 . . . xjm for j1 = · · · = jm • Computing Am (p, x) naively takes O(dm ) 13 ↑ All possible m-combinations of {1, . . . , d}
  • 14. Higher-order FMs (HOFMs) • Model ˆyHOFM(x; w, {Pt }m t=2) := w, x + m t=2 k s=1 At (pt s, x) • Parameters w ∈ Rd : feature weights P2 , . . . , Pm ∈ Rd×k : weight matrices • Pros and cons Takes into account higher-order feature combinations O(dkm2 ) prediction cost using our proposed algorithms More complex than 2nd-order FMs14
  • 15. Learning HOFMs (1/2) • We use alternating mimimization w.r.t. w, P2 , ..., Pm • Learning w alone reduces to linear regression • Learning Pm can be cast as minimizing F(P) := 1 n n i=1  yi, k s=1 Am (ps, xi) + oi   + β 2 P 2 where oi is the contribution of degrees other than m 15
  • 16. Learning HOFMs (2/2) • Stochastic gradient update ps ← ps − η (yi, ˆyi) Am (ps, xi) − ηβps where η is a learning rate hyper-parameter and ˆyi := k s=1 Am (ps, xi ) + oi • We propose O(dm) (linear time) DP algorithms for ◦ Evaluating ANOVA kernel Am (p, x) ∈ R ◦ Computing gradient Am (p, x) ∈ Rd 16
  • 17. Evaluating the ANOVA kernel (1/3) • Recursion (Blondel+, ICML 2016) Am (p, x) = Am (p¬j, x¬j) + pjxj Am−1 (p¬j, x¬j) ∀j where p¬j, x¬j ∈ Rd−1 are vectors with the jth element removed • We can use this recursion to remove features until computing the kernel becomes trivial 17
  • 18. Evaluating the ANOVA kernel (2/3) ad,m ad-1,m ad-1,m-1 + pd xd ad-2,m ad-2,m-1 + ad-2,m-2 + pd-1 xd-1pd-1 xd-1 … … Value we want 
 to compute Redundant
 computation 2 allows us to focus on the ANOVA kernel as the main Ms. In this section, we develop dynamic programming (DP) ng its gradient in only O(dm) time, i.e., linear time. that we can use (7) to recursively remove features until et us denote a subvector of p by p1:j 2 Rj and similarly for rthand aj,t := At (p1:j, x1:j). Then we have that if j 0 if j < t ,t + pjxj aj 1,t 1 otherwise. (8) Table 1: Example of DP table j = 0 j = 1 j = 2 . . . j = d t = 0 1 1 1 1 1 t = 1 0 a1,1 a2,1 . . . ad,1 t = 2 0 0 a2,2 . . . ad,2 . . . . . . . . . . . . ... . . . t = m 0 0 0 . . . ad,m m (p, x) = ad,m. cursion, which tations, we use mputations in a ner to initialize to arrive at the procedure, sum- me and memory. of Am (p, x) w.r.t. p, we use reverse-mode differentiation Shortcut: Continue until all
 features have been eliminated … 18 p1:j := [p1, . . . , pj]T x1:j := [x1, . . . , xj]T
  • 19. Evaluating the ANOVA kernel (3/3) Ways to avoid redundant computations: • Top-down approach with memory table • Bottum-up dynamic programming (DP) j = 0 j = 1 j = 2 . . . j = d t = 0 1 1 1 1 1 t = 1 0 a1,1 a2,1 . . . ad,1 t = 2 0 0 a2,2 . . . ad,2 ... ... ... ... ... ... t = m 0 0 0 . . . ad,m 19 p2x2 → ← goal start
  • 20. Algorithm 1 Evaluating Am (p, x) in O(dm) Input: p 2 Rd , x 2 Rd aj,t 0 8t 2 {1, . . . , m}, j 2 {0, 1, . . . , d} aj,0 1 8j 2 {0, 1, . . . , d} for t := 1, . . . , m do for j := t, . . . , d do aj,t aj 1,t + pjxjaj 1,t 1 end for end for Output: Am (p, x) = ad,m Algor Inp ˜aj,t ˜ad,m for f e end ˜pj : Ou 20
  • 21. Backpropagation (chain rule) Ex: compute derivatives of composite function f (g(h(p))) • Forward pass a = h(p) b = g(a) c = f (b) • Backward pass ∂c ∂pj = ∂c ∂b ∂b ∂a ∂a ∂pj = f (b) g (a) hj(p) 21 ↓ Only the last part depends on j Can compute all derivatives in one pass!
  • 22. Gradient computation (1/2) • We want to compute Am (p, x) = [˜p1, . . . , ˜pd]T • Using the chain rule, we have ˜pj := ∂ad,m ∂pj = m t=1 ∂ad,m ∂aj,t :=˜aj,t ∂aj,t ∂pj =aj−1,t−1xj = m t=1 ˜aj,t aj−1,t−1 xj since pj influences aj,t ∀t ∈ [m] • ˜aj,t can be computed recursively in reverse order ˜aj,t = ˜aj+1,t + pj+1xj+1 ˜aj+1,t+1 22
  • 23. Gradient computation (2/2) j = 1 j = 2 . . . j = d − 1 j = d t = 1 ˜a1,1 ˜a2,1 . . . 0 0 t = 2 0 ˜a2,2 ... 0 0 ... ... ... ... ˜ad−1,t−1 0 t = m 0 0 0 1 1 t = m + 1 0 0 0 0 0 23 ← pd xd goal ← start
  • 24. m) Algorithm 2 Computing rAm (p, x) in O(dm) Input: p 2 Rd , x 2 Rd , {aj,t}d,m j,t=0 ˜aj,t 0 8t 2 [m + 1], j 2 [d] ˜ad,m 1 for t := m, . . . , 1 do for j := d 1, . . . , t do ˜aj,t ˜aj+1,t + ˜aj+1,t+1pj+1xj+1 end for end for ˜pj := Pm t=1 ˜aj,taj 1,t 1xj 8j 2 [d] Output: rAm (p, x) = [˜p1, . . . , ˜pd]T 24
  • 25. Summary so far • HOFMs can be expressed using the ANOVA kernel Am • We proposed O(dm) time algorithms for computing Am (p, x) and Am (p, x) • The cost per epoch of stochastic gradient algorithms for learning Pm is therefore O(dnkm) • The prediction cost is O(dkm2 ) 25
  • 26. Other contributions • Coordinate-descent algorithm for learning Pm based on a different recursion ◦ Cost per epoch is O(dnkm2 ) ◦ However, no learning rate to tune! • HOFMs with shared parameters: P2 = · · · = Pm ◦ Total prediction cost is O(dkm) instead of O(dkm2 ) ◦ Corresponds to using new kernels derived from the ANOVA kernel 26
  • 28. Application to link prediction Goal: predict missing links between nodes in a graph ? ? Graph: • Co-author network • Enzyme network ? ? Bipartite graph: • User-movie • Gene-disease 28
  • 29. Application to link prediction • We assume two sets of nodes A (e.g., users) and B (e.g, movies) of size nA and nB • Nodes in A are represented by feature vectors ai ∈ RdA • Nodes in B are represented by feature vectors bj ∈ RdB • We are given a matrix Y ∈ {−1, +1}nA×nB such that yi,j = +1 if there is a link between ai and bj • Number of positive samples is n+ 29
  • 30. Datasets Dataset n+ Columns of A nA dA Columns of B nB dB NIPS 4,140 Authors 2,037 13,649 Enzyme 2,994 Enzymes 668 325 GD 3,954 Diseases 3,209 3,209 Genes 12,331 25,275 ML 100K 21,201 Users 943 49 Movies 1,682 29 Features: • NIPS: word occurence in author publications • Enzyme: phylogenetic information, gene expression information and gene location information • GD: MimMiner similarity scores (diseases) and HumanNet similarity scores (genes) • ML 100K: age, gender, occupation, living area (users); release year, genre (movies)30
  • 31. Models compared Goal: predict if there is a link between ai and bj • HOFM: ˆyi,j = ˆyHOFM(ai ⊕ bj; w, {Pt }m t=2) • HOFM-shared: same but with P2 = · · · = Pm • Polynomial network (PN): replace ANOVA kernel by polynomial kernel • Bilinear regression (BLR): ˆyi,j = aiUVT bj 31 vector concatenation
  • 32. Experimental protocol • We sample n− = n+ negatives samples (missing edges are treated as negative samples) • We use 50% for training and 50% for testing • We use ROC-AUC (area under ROC curve) for evaluation • β tuned by CV, k fixed to 30 • P2 , . . . , Pm initialized randomly • is set to the squared loss 32
  • 33. HOFM HOFM-shared PN BLR 0.70 0.75 0.80 0.85 0.90 AUC m=2 m=3 m=4 m=5 (a) NIPS HOFM HOFM-shared PN BLR 0.60 0.65 0.70 0.75 0.80 0.85 0.90 AUC m=2 m=3 m=4 m=5 (b) Enzyme HOFM HOFM-shared PN BLR 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 AUC m=2 m=3 m=4 m=5 (c) GD HOFM HOFM-shared PN BLR 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 AUC m=2 m=3 m=4 m=5 (d) ML100K33
  • 34. Solver comparison • Coordinate descent • AdaGrad • L-BFGS AdaGrad and L-BFGS use the proposed DP algorithm to compute Am (p, x) 34
  • 35. 100 101 102 103 CPU time (seconds) 10-3 10-2 10-1 100 101 Objectivevalueminusbest CD AdaGrad L-BFGS (a) Convergence when m = 2 100 101 102 103 104 CPU time (seconds) 10-3 10-2 10-1 100 101 Objectivevalueminusbest CD AdaGrad L-BFGS (b) Convergence when m = 3 100 101 102 103 104 CPU time (seconds) 10-3 10-2 10-1 100 101 Objectivevalueminusbest CD AdaGrad L-BFGS (c) Convergence when m = 4 2 3 4 5 Degree m 0 50 100 150 200 250 300 350 Timetocompleteoneepoch(sec.) CD AdaGrad L-BFGS (d) Scalability w.r.t. m35 NIPS dataset