From RNN to neural networks for cyclic undirected graphs

From RNN to neural networks for cyclicFrom RNN to neural networks for cyclic
undirected graphsundirected graphs
Nathalie Vialaneix, INRAE/MIATNathalie Vialaneix, INRAE/MIAT
WG GNN, May 7th, 2020WG GNN, May 7th, 2020
1 / 321 / 32

To start: a briefoverviewofthis talk...To start: a briefoverviewofthis talk...
2 / 322 / 32

Topic
(What is this presentation about?)
How to use (deep) NN for processing relational (graph) data?
I will first start by describing Recurrent Neural Network (RNN) and their
limits to process graphs
Then, I will present two alternatives able to address these limits
I will try to stay non technical (so potentially too vague) to focus on the
most important take home messages
3 / 32

Description ofthe purpose ofthe methods
Data: a graph , with a set of vertices with labels
and a set of edges that can also be labelled
The graph can be directed or undirected, with or without
cycles.
Purpose: find a method (a neural network with weights , ) that is able
to process these data (using the information about the relations / edges
between vertices) to obtain:
make a prediction for every node in the graph,
or make a prediction for the graph itself,
learning dataset: a collection of graphs or a graph (that can be
disconnected) associated to predictions or
G V = {x1 , . . . xn}
l(xi ) ∈ R
p
E = (ej)j=1,…,m
l(ej) ∈ R
q
w ϕw
ϕw(xi )
ϕw(G)
y(G) yi
4 / 32

The basis ofthe work: RNN for structured dataThe basis ofthe work: RNN for structured data
5 / 325 / 32

Framework
Reference: [Sperduti & Starita, 1997]
basic description of standard RNN
adaptations to deal with directed acyclic graphs (DAG)
output is obtained at the graph level ( )
The article also mentions way to deal with cycles and other types of learning
that the standard back-propagation that I'll describe
ϕw(G)
6 / 32

Fromstandard neuron to recurrent neuron
standard neuron
where are the inputs of the neuron (often: the neurons in the previous
layer).
o = f (
r
∑
j=1
wjvj)
vj
7 / 32

Fromstandard neuron to recurrent neuron
recurrent neuron
where is the self weight.
o(t) = f (
r
∑
j=1
wjvj + wSo(t − 1))
wS
8 / 32

Using that type ofrecurrent neuron for DAGencoding
(for a DAG with a supersource, here )
where is the position of the vertex within the children of (it means
that the DAG is a positional DAG).
x5
o(xi ) = f (∑
p
j=1
wjlj(xi ) + ∑
xi→x
i
′
^wn(i
′
) o(xi
′ ))
n(i
′
) i
′
xi
9 / 32

We have:
(and similarly for and )
x5
o(x8 ) = f (∑
p
j=1
wjlj(x8 )) x7 x2
10 / 32

We have:
(and similarly for )
x5
o(x9 ) = f (∑
p
j=1
wjlj(x9 ) + ^w1 o(x8 )) x3
11 / 32

We have:
x5
o(x10 ) = f (∑
p
j=1
wjlj(x10 ) + ^w1 o(x3 ))
12 / 32

We have:
x5
o(x11 ) = f (∑
p
j=1
wjlj(x11 ) + ^w1 o(x2 ) + ^w2 o(x7 ) + ^w3 o(x9 ) + ^w4 o(x10 ))
13 / 32

We have:
x5
o(x5 ) = f (∑
p
j=1
wjlj(x5 ) + ^w1 o(x11 ))
14 / 32

Learning can be performed by back-propagation:
for a given set of weights , recursively compute the outputs on the
graph structure
reciprocally, compute the gradient from the output, recursively on the
graph structure
(w, ^w)
15 / 32

Generalization: cascade correlation for networks
Idea: make several layer of outputs such that depends
on , (as for the previous case) and also on (but
these values are "frozen").
o
1
(x), … , o
r
(x) o
l
(x)
l(x) (o
l
′
(x
′
))x→x
′
, l
′
≤l (o
l
′
(x))l
′
<l
16 / 32

Main limits
Since the approach explicitely relies on the DAG order to successively
compute the output of the nodes, it is not adapted to undirected or cyclic
graphs
Also, the positional assumption of the neighbor of a given node (that an
objective "order" exist between neighbors) is not easily met in real-world
applications
Can only compute prediction for graphs (not for nodes)
Note: The method is tested (in this paper) on logic problems (not described
here)
17 / 32

A rst approach using contraction maps by ScarselliA rst approach using contraction maps by Scarselli etet
al.al., 2009, 2009
18 / 3218 / 32

Overviewofthe method
is able to deal with undirected and cyclic graphs
does not require a positional assumption on the neighbors of a given
node
can be used to make a prediction at a graph and node levels
Main idea: use a "time"-dependant update of the neurons and use restriction
on the weights to constrain the NN to be a contraction map so that the fixed
point theorem can be applied
19 / 32

For each node , we define:
a neuron value expressed as:
an output value obtained from
this neuron value as:
(that can be
combined into a graph output
value if needed)
Basic neuron equations
xi
vi = fw (l(xi ), {l(xi , xu )}xu ∈N (xi)
,
{vu }xu ∈N (xi)
, {l(xu )}xu ∈N (xi)
)
oi = gw(vi , l(xi ))
20 / 32

For each node , we define:
a neuron value expressed as:
an output value obtained from
this neuron value as:
(that can be
combined into a graph output
value if needed)
In a compressed version, this gives:
and .
Basic neuron equations
xi
vi = fw (l(xi ), {l(xi , xu )}xu ∈N (xi)
,
{vu }xu ∈N (xi)
)
oi = gw(vi , l(xi ))
V = Fw(V , l) O = Gw(V , l)
21 / 32

Making the process recurrent...
The neuron value is made "time" dependent with:
Equivalently, ) so, provided that is a contraction map,
converges to a fixed point. (a sufficient condition is that the norm of
is bounded by )
v
t
i
= fw (l(xi ), {l(xi , xu )}xu ∈N (xi)
, {v
t−1
u }xu ∈N (xi)
)
V
t+1
= Fw(V
t
, l) Fw
(V
t
)t
∇V Fw(V , l) μ < 1
22 / 32

What are and ?
is a fully-connected MLP
is decomposed into
and is trained as a 1 hidden layer MLP.
Rk: another version is provided in which is obtained as a linear function in
which the intercept and the slope are estimated by MLP.
fw gw
gw
vi = fw(l(xi ), . . . )
∑
xu ∈N (xi)
hw(l(xi ), l(xi , xu ), vu , l(xu ))
hw
hw
23 / 32

Training ofthe weights
The weights of the two MLP are trained by the minimization of
but, to ensure that the resulting is a contraction
map, the weights of are penalized during the training:
with for a given and .
The training is performed by gradient descent where the gradient is obtained
by back-propagation.
BP is simplified using the fact that tends to a fixed point.
∑
n
i=1
(yi − gw(v
T
i
))
2
Fw
Fw
n
∑
i=1
(yi − gw(v
T
i
))
2
+ βL (|∇V Fw|)
L(u) = u − μ μ ∈]0, β > 0
(v
t
i
)t
24 / 32

Applications
The method is illustrated on different types of problems:
the subgraph matching problem (finding a subgraph matching a target
graph in a large graph) in which the prediction is made at the node level
(does it belong to the subgraph or not?)
recover the mutagenic compounds into nitroaromatic compounds
(molecules used as intermediate subproducts in many industrial
reactions). Compounds are described by the graph molecule with
(qualitative and numerical) informations attached to the nodes
web page ranking in which the purpose is to predict a Google page rank
derived measure from a network of 5,000 web pages
25 / 32

A second approach using constructive architecture byA second approach using constructive architecture by
MicheliMicheli etal.etal., 2009, 2009
26 / 3226 / 32

Overviewofthe method
is able to deal with undirected and cyclic graphs (but no label on the
edges)
does not require a positional assumption on the neighbors of a given
node
can be used to make a prediction at a graph and node levels (probably,
though it is made explicit for the graph level)
Main idea: define an architecture close to "cascade correlation network" with
some "frozen" neurones that are not updated. The architecture is hierarchical
and adaptive, in the sense that it stops growing when a given accuracy is
achieved.
27 / 32

Neuron equations
Similarly as previously, neurons are computed in a recurrent way that
depends on "time". The neuron state at time for vertex depends on its
label and of the neuron state of the neighboring neurons at all past times:
Rk:
a stationnary assumption (the weights do not depend on the node nor on
the edge) is critical to obtain a simple enough formulation
contrary to RNN or to the previous version, are not updated: the
layer are trained one at a time and once the training is finished, the
neuron states are considered "frozen" (which is a way to avoid problem
with cycles)
t xi
v
t
i
= f (∑
j
w
t
j
lj(xi ) + ∑
t
′
<t
^w
tt
′
∑
xu ∈N (xi)
v
t
′
u )
(v
t
′
u )t
′
<t
28 / 32

Combining neuron outputs into a prediction
output of layer : where is a normalization factor
(equal to 1 or to the number of nodes for instance)
output of the network:
t ϕ
t
w(G) = ∑
n
i=1
v
t
i
1
C
C
Φw(G) = f (∑
t
w
t
ϕ
t
w(G))
29 / 32

Training
Training is also performed by minimization of the squared error but:
not constraint is needed on weights
back-propagation is not performed through unfolded layers
Examples
QSPR/QSAR task that consists in transforming information on molecular
structure into information on chemical properties. Here: prediction of
boiling point value
classification of cyclic/acyclic graphs
30 / 32

That's all for now...That's all for now...
... questions?... questions?
31 / 3231 / 32

References
Micheli A (2009) Neural networks for graphs: a contextual constructive approach. IEEE
Transactions on Neural Networks, 20(3): 498-511
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network
model. IEEE Transactions on Neural Networks, 20(1): 61-80
Sperduti A, Starita A (1997) Supervised neural network for the classification of structures.
IEEE Transactions on Neural Networks, 8(3): 714-735
32 / 32

From RNN to neural networks for cyclic undirected graphs

More Related Content

What's hot (20)

Similar to From RNN to neural networks for cyclic undirected graphs (20)

More from tuxette (20)

Recently uploaded (20)

From RNN to neural networks for cyclic undirected graphs