A review on structure learning in GNN

A review on structure learning in GNN
A review on structure learning in GNN
Nathalie Vialaneix, INRAE/MIAT
Nathalie Vialaneix, INRAE/MIAT
WG GNN, May 25, 2021
WG GNN, May 25, 2021
1 / 18
1 / 18

Overviewofthe structure learning problem
Overviewofthe structure learning problem
2 / 18
2 / 18

Problemframework
Question: Given a GNN ($Theta$ are the network parameters learned
during training) with input node features
and (unknown) graph
with nodes and adjacency matrix ,
find the GNN parameters,
learn the graph
so as to minimize
Problem: Learning means learning edges
between nodes
(discrete optimization problem)
fΘ
X ∈ R
N ×F
G
N i ∈ {1, … , N} A ∈ {0, 1}
N
Θ
G ∑
N
i=1
Loss(fΘ(xi , G), yi )
G Aii
′ ∈ {0, 1}
3 / 18

Possible routes to a solution
in what Thomas and Marianne presented: the output of NN is itself
discrete and thus the loss is piecewise constant approximation of
the
loss by a continuous loss to compute a gradient
in the more general case presented before: unsure what the shape of the
loss is, with respect to the graph description of a typology
of methods
as proposed by [Zhu et al, 2021]
⇒
A ⇒
4 / 18

Short reminder ofthe general learning process
Rk:
Iteration between forward pass (prediction and update of embedding) and
passward pass (backpropagation and update of GNN parameters and
graph)
backward pass is often based on gradient descent and the gradient with
respect to is maybe automatically computed through standard NN
library
(almost never fully described)
A
∗
5 / 18

First class ofmethods: metric based learning
First class ofmethods: metric based learning
6 / 18
6 / 18

General principle
Instead of a graph in , a similarity matrix (symmetric
with null diagonal) is learned using smoothness assumption of node
labels/embeddings wrt the graph:
edges correspond to closeness (in some sense) between the corresponding
values of
simplest approach: Gaussian weights
the final graph is often
learned with additional constrains ( or nuclear norm on )
combined with the input graph (like in gated NN)
for the simplest form
post-processed to ensure sparsity (e.g., hard thresholding)
Important remark: in this approach, the learned parameter for the graph are
related to the similarity (how to compute it?) and not to the adjacency matrix
itself (directly deduced from )
{0, 1}
N
S
∗
∈ [0, +∞[
N
Z
s
∗
ii
′ = e
−γ∥zi−z
i
′ ∥
2
ℓ1 S
∗
S
∗∗
= αA + (1 − α)S
∗
ϵ
S
7 / 18

An example: [Chen etal., 2020]
Forward step iteratively computes the adjacency matrices and the output
of
the two GNN times Loss is the average of losses combining graph loss
and prediction loss
1. Graph definition when is given
Graph at substep is learned through an adjacency matrix with:
Compute entries of adjacency matrices between nodes and with:
( is either the input
features on the nodes or the prediction of the first GNN as obtained in
substep ) are learned parameters (using
GD during backward step)
average the (three parameters, looks a bit like Gated
NN) and
threshold values below : . Question: How to compute a gradient
here?
Define adjacency matrix as a mix between , and
T ⇒ R
Θ
(t)
r + 1
m i i
′
s
(r+1),l
ii
′ (v
(r)
) = cos (w
(t)
l
⊗ z
(r)
i
, w
(t)
l
⊗ z
(r)
i
′ ) v
r ⇒ w
S
l
ii
′ (v
(r+1)
)
ϵ A
(r+1)
A
(r+1)
A
(0)
A
(1)
~
A
(r+1)
8 / 18

Prediction once the graph is given: two GNN
z
(r+1)
i
= GNN1 (
~
A
(r+1)
, xi )
^
y i
= GNN2 (
~
A
(r+1)
, z
(r+1)
i
)
9 / 18

Loss at substep
loss composed of three terms: adequation to (Laplacian like),
(ridge penalty) and a "log barrier" (prevents disconnected graph and
control sparsity)
Global loss at forward step : . Gradient is also the sum of
gradients that are back-propagated from each other
(I guess)
r + 1
L
r+1
pred
= ∑
i
Loss(yi , ^
y
i
)
L
r+1
graph
xi
∥. ∥
2
F
t L
t
= ∑
r
L
r
10 / 18

Second class ofmethods: Probabilistic models
Second class ofmethods: Probabilistic models
11 / 18
11 / 18

General principle
Assume a distribution on (or equivalently ) and solve
the expected
problem wrt this distribution
Ex: [Franceschi etal, ICML 2019]
learned parameters are the distribution parameter (and not the graph itself)
Optimization over a distribution
Initial problem to solve:
(can
contain a regularization term for )
Problem transformation Suppose (with unknown ),
then instead, solve

Remarks:
It does not solve the graph learning problem directly
but it gives information on the graph by learning (here, not very
informative because the model is too simple)
G w
minΘ(a),a ∑
nodes
Loss(fΘ(a) (xi , G), yi )
Θ(a)
aii
′ ∼iid Pθ = B(θ) θ
EPθ
[Loss(ϕΘ(a) (xi , G), yi )]
θ
12 / 18

Learning in practice
Iterate over:
update of : we need to compute

easy: is computed for a given (SG) and SGD steps are performed
times to update
update of : we need to compute

hard (because is involved via the expectation over a distribution)
magic trick: setting ( is set to its expectation)
and update with projected gradient descent
F (Θ(a), a) = EPθ
[Loss(fΘ(a) (xi , G), yi )]
Θ ∇ΘF
∇ΘF a ∼ Pθ
R Θ
a ∇aF
a
a = θ a
∇θF ≃ EPθ
[∇ΘF (Θ(a), a)∇aΘ(a) + ∇aF (Θ(a), a)]
θ
13 / 18

My concern
A bit rough:
distribution is too simple
averaging over the distribution is not really informative probably
14 / 18

Third class ofmethods: Direct optimization
Third class ofmethods: Direct optimization
15 / 18
15 / 18

Main principles
learn directly the graph (via the adjacency matrix optimization over )
by targeting:

is usually Laplacian smoothing (minimize
differences in values
between edges with strong weights)

often includes penalties to ensure:
graph connectivity (ex: barrier penalty)
low rank (ex: nuclear norm penalty)
sparsity (ex: penalty)
→ A
L
graph
(A) + L
prediction
(A)
L
graph
∑
ii
′ A
∗
ii
′ ∥zi − zi
′ ∥
2
= Tr (Z
⊤
LA
∗ Z)
1
2
L
graph
log
ℓ1
16 / 18

An example [Jin etal., 2020]
fix graph and update by backpropagation of the error
(SGD)
fix GNN parameters and update graph:
(FBS algorithm: alternate between gradient descent step and proximal
step)
node features
A θ
Err(GNN(A, θ)(X, Y ))
θ
arg minA ∥A − A
(t−1)
∥
2
+ Err(GNN(A, θ)(X, Y )) + λ1 Tr(X
⊤
L(A)X) + λ2 ∥A
17 / 18

References
Chen Y, Wu L, Zaki MJ (2020) Iterative deep graph learning for graph neural
networks: better and robust node embeddings. NeurIPS
Franceschi L, Frasconi P, Salzo S, Grazzi R, Pontil M (2018) Bilevel
programming for hyperparameter optimization and meta-learning. ICML
Jin W, Ma Y, Liu X, Tang X, Wang S, Tang J (2020) Graph structure learning for
robust graph neural networks. KDD
Zhu Y, Xu W, Zhang J, Liu Q, Wu S, Wang L (2021) Deep graph structure
learning
for robust representations: a survey. Preprint arXiv 2103.03036v1
18 / 18

A review on structure learning in GNN

More Related Content

What's hot (20)

Similar to A review on structure learning in GNN (20)

More from tuxette (20)

Recently uploaded (20)

A review on structure learning in GNN