GraphDF: A Discrete Flow Model for Molecular Graph Generation

Page 1

GraphDF: A Discrete Flow Model for Molecular Graph Generation

Youzhi Luo 1 Keqiang Yan 1 Shuiwang Ji 1

Abstract

We consider the problem of molecular graph gen-

eration using deep models. While graphs are dis-

crete, most existing methods use continuous la-

tent variables, resulting in inaccurate modeling of

discrete graph structures. In this work, we pro-

pose GraphDF, a novel discrete latent variable

model for molecular graph generation based on

normalizing flow methods. GraphDF uses invert-

ible modulo shift transforms to map discrete la-

tent variables to graph nodes and edges. We show

that the use of discrete latent variables reduces

computational costs and eliminates the negative

effect of dequantization. Comprehensive exper-

imental results show that GraphDF outperforms

prior methods on random generation, property

optimization, and constrained optimization tasks.

1. Introduction

A fundamental problem in drug discovery (Stokes et al.,

2020; Wang et al., 2020) and chemistry is to design novel

molecules with specific properties. This task remains to

be challenging because the space of molecules is naturally

discrete and very large, with estimated size on the order

of 1033 (Polishchuk et al., 2013). Recent advances in deep

generative models have lead to significant progress in this

field. Notably, many recent studies represent molecular

structures as graphs and propose to generate novel molecular

graphs with advanced deep generative models. From a

conceptual and computational perspective, these methods

first map molecular graphs to vectors in the continuous latent

space. When generating, the generative models project a

randomly picked continuous vector in the latent space back

to the molecular graph space.

To make use of generative models with continuous latent

variables, existing methods (Madhawa et al., 2019; Shi*

et al., 2020; Zang & Wang, 2020) convert discrete graph

1Department of Computer Science & Engineering, Texas

A&M University, TX, USA. Correspondence to: Shuiwang Ji

<sji@tamu.edu>.

Proceedings of the 38th International Conference on Machine

data to continuous data by adding real-valued noise. How-

ever, such dequantization processing prevents models from

capturing the original discrete distribution of discrete graph

structures, thus increasing the difficulty of model training.

This becomes a key limitation that makes it hard to capture

the true distribution of graph structures and generate diverse

molecules.

In this work, we propose GraphDF, a generative model

using discrete latent variables for molecular graph genera-

tion. GraphDF generates molecular graphs by sequentially

sampling discrete latent variables and mapping them to

new nodes and edges via invertible modulo shift transforms.

Compared with previous methods, our proposed discrete

transform eliminates the cost of computing Jacobian matrix

and overcomes the limitations resulted from dequantization.

Experimental results show that GraphDF can outperform

previous state-of-the-art methods over various molecule gen-

eration tasks. The implementation of GraphDF has been

integrated into the DIG (Liu et al., 2021) framework.

2. Background and Related Work

2.1. Molecule Generation

Let M = {Mi}m

i=1 be a set of molecules, S(M) ∈ R be a

function computing a specific property score of the molecule

M, and SIM(M,M ) ∈ [0, 1] be a function measuring the

similarity between two molecules M and M . The three

molecule generation tasks can be defined as below.

• Learning a generation model pθ(·) from M, where pθ(M)

is the probability of sampling the molecule M from the

model.

• Learning a property optimization model pθ(·) with respect

to S by maximizing EM∼pθ [S(M)].

• Learning a constrained optimization model pθ(·|M),

so as to maximize EM |M∼pθ [S(M )] while satisfying

SIM(M,M ) > δ, where M is a given molecule, M is

the optimized molecule, and δ is a similarity threshold.

Early attempts for molecule generation (Gómez-Bombarelli

et al., 2018; Kusner et al., 2017; Dai et al., 2018) represented

molecules as SMILES strings (Weininger, 1988) and devel-

oped sequence generation models. These models may not

Page 2

GraphDF: A Discrete Flow Model for Molecular Graph Generation

easily learn the complicated grammatical rules of SMILES

and thus could not generate syntactically valid sequences

well. More recent studies represent molecules as graphs,

and significant progress has been made in molecular graph

generation with deep generative models. A variety of molec-

ular graph generation methods have been developed based

on variational auto-encoders (VAE) (Kingma & Welling,

2013). Some of these methods generate molecular graphs

from a tree structure (Jin et al., 2018; Kajino, 2019); others

add a node or an edge sequentially one at a time (Liu et al.,

2018), or generate node and adjacency matrices directly in

one-shot sampling (Simonovsky & Komodakis, 2018; Ma

et al., 2018). For VAE-based models, Bayesian optimization

can be applied in the latent space to search molecules with

optimized property scores (Jin et al., 2018; Kajino, 2019).

Generative adversarial networks (GAN) (Goodfellow et al.,

2014) have also been used in some molecular graph genera-

tors (De Cao & Kipf, 2018; You et al., 2018a). Similar to

SeqGAN (Yu et al., 2017), these models are trained with

reinforcement learning.

In addition to methods based on VAE and GAN, normal-

izing flow models have been used in various generation

tasks (Dinh et al., 2014; 2016; Rezende & Mohamed, 2015).

These models define invertible transformations between la-

tent variables and data samples, thus allowing for exact

likelihood computation. It has been demonstrated that these

models are able to capture the density of molecular graphs

more accurately (Shi* et al., 2020; Zang & Wang, 2020).

Our proposed discrete flow model is based on the normaliz-

ing flow method.

2.2. Flow Models

A flow model is a parametric invertible function fθ : Rd →

d, defining an invertible mapping from the latent variable

z ∈ Rd to the data point x ∈ Rd, where x ∼ pX(x) and

z ∼ pZ(z) are random variables. The latent distribution

pZ is a pre-defined probability distribution, e.g., a Gaussian

distribution. The data distribution pX is unknown. But

given an arbitrary data point x ∈ Rd, we can use the change-

of-variable theorem to compute its log-likelihood as

log pX(x) = log pZ(f−1

θ (x)) + log

∣

∣det

∂f−1

θ (x)

∂x

∣

, (1)

where

∂f−1

(x)

∂x

∈ Rd×d is the Jacobian matrix. To train a

flow model fθ on a dataset X = {xi}m

i=1, we use Eqn. (1)

to compute the exact log-likelihoods of all data points and

update the parameter θ by maximizing the log-likelihoods

via gradient descent. To randomly sample a data point from

fθ, a latent variable z is first sampled from the pre-defined

latent distribution pZ. Then we can obtain a data point x

by performing the feedforward transformation x = fθ(z).

Therefore, the mapping fθ should be invertible, and the

computation of the Jacobian matrix

∂f−1

(x)

∂x

needs to be

tractable in order to enabling efficient training and sampling.

A common operation satisfying these two requirements is

the affine coupling layer (Dinh et al., 2016), which has been

used frequently in deep flow models.

Recently, flow models have been used in molecule or graph

generation tasks. Liu et al. (2019) proposes an invertible

message passing neural network, known as GNF, for simple

community graph generation. Following GraphVAE (Si-

monovsky & Komodakis, 2018), several recent studies gen-

erate molecular graphs by generating node and adjacency

matrices in a one-shot fashion via flow models, including

GraphNVP (Madhawa et al., 2019), GRF (Honda et al.,

2019), and MoFlow (Zang & Wang, 2020). GraphAF (Shi*

et al., 2020) generates molecular graphs by adding nodes

and edges sequentially via an autoregressive flow model

(Papamakarios et al., 2017). GraphCNF (Lippe & Gavves,

2021) jointly optimizes a variational inference model and a

flow model for graph generation.

3. A Discrete Flow Model for Graphs

While all current molecular graph generation methods use

continuous latent variables, we argue that such continuous

prior distribution is not suitable to model molecular graph

data, which are naturally discrete. On the other hand, some

recent attempt has been made to develop generative models

with discrete latent variables (Tran et al., 2019). It is, how-

ever, only able to generate 1D sequences, such as texts, and

cannot be used to generate richly structured objects such as

graphs.

In this section, we introduce GraphDF, a novel molecular

graph generative model with discrete latent variables. To our

best knowledge, GraphDF is the first work which demon-

strates that discrete latent variable models have strong ca-

pacity to model the density of complicated molecular graph

data.

3.1. An Overview of Sequential Generation

Let k be the total number of node types and c be the total

number of edge types. We use G = (X, A) to denote a

molecular graph with n nodes, where X ∈ {0, 1}n×k is the

node matrix and A ∈ {0, 1}n×n×c is the adjacency tensor.

X[i, u] = 1 represents that the i-th node has type u, and

A[i, j, v]=1 represents that there is an edge with type v

between the i-th node and j-th node. If there is no edge

between the i-th node and j-th node, then A[i, j, v]=0 for

v = 0,...,c − 1.

Similar to MolecularRNN (Popova et al., 2019), we generate

molecular graphs by a sequential decision process. We

start from an empty molecular graph G0. At step i, let

the current sub-graph be Gi−1 which contains i − 1 nodes.

Page 3

GraphDF: A Discrete Flow Model for Molecular Graph Generation

<latexit sha1_base64="qvxXKTKQqGCgxt0gPawxPlWgovk=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyNyywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqKPNpLGLVCVEzwSXzDTeCdRLFMAoFa4eTu7nffmJK81g+mmnCgghHkg85RWMlv4p9r9ovV9yauwBZJ15OKpCj2S9/9QYxTSMmDRWodddzExNkqAyngs1KvVSzBOkER6xrqcSI6SBbHDsjF1YZkGGsbElDFurviQwjradRaDsjNGO96s3F/7xuaoY3QcZlkhom6XLRMBXExGT+ORlwxagRU0uQKm5vJXSMCqmx+ZRsCN7qy+ukVa95V7X6Q73SuM3jKMIZnMMleHANDbiHJvhAgcMzvMKbI50X5935WLYWnHzmFP7A+fwBoceN5w==</latexit>

<latexit sha1_base64="jODpMrYebbAIPxaKUl2jp1I38Ro=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJBAQ7bLFjZsd5vdrQlp+A1ePGiMV3+QN/+NC/Sg4EsmeXlvJjPzopQzbVz32yltbG5t75R3K3v7B4dH1eOTtpaZIjQgkkvVjbCmnAkaGGY47aaK4iTitBNN7uZ+54kqzaR4NNOUhgkeCRYzgo2Vgjoe+PVBteY23AXQOvEKUoMCrUH1qz+UJEuoMIRjrXuem5owx8owwums0s80TTGZ4BHtWSpwQnWYL46doQurDFEslS1h0EL9PZHjROtpEtnOBJuxXvXm4n9eLzPxTZgzkWaGCrJcFGccGYnmn6MhU5QYPrUEE8XsrYiMscLE2HwqNgRv9eV10vYb3lXDf/BrzdsijjKcwTlcggfX0IR7aEEABBg8wyu8OcJ5cd6dj2VrySlmTuEPnM8fo0yN6A==</latexit>

<latexit sha1_base64="YxnVmblYkmZ0a9tO33yJ8LGqPY4=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGLiIQlcyNyywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS1nGqKPNpLGLVCVEzwSXzDTeCdRLFMAoFewwnt3P/8YkpzWP5YKYJCyIcST7kFI2V/Cr2G9V+ueLW3AXIOvFyUoEcrX75qzeIaRoxaahArbuem5ggQ2U4FWxW6qWaJUgnOGJdSyVGTAfZ4tgZubDKgAxjZUsaslB/T2QYaT2NQtsZoRnrVW8u/ud1UzO8DjIuk9QwSZeLhqkgJibzz8mAK0aNmFqCVHF7K6FjVEiNzadkQ/BWX14n7XrNa9Tq9/VK8yaPowhncA6X4MEVNOEOWuADBQ7P8ApvjnRenHfnY9lacPKZU/gD5/MHpNGN6Q==</latexit>

b21

<latexit sha1_base64="9l4CjM3HWwIZ8NT7KVUbf0lpTBs=">AAAB73icbVA9SwNBEJ3zM8avqKXNYiJYhbtYaBm0sYxgPiA5wt5mLlmyt3fu7gnhyJ+wsVDE1r9j579xk1yhiQ8GHu/NMDMvSATXxnW/nbX1jc2t7cJOcXdv/+CwdHTc0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqzUqQT9rOZNK/1S2a26c5BV4uWkDDka/dJXbxCzNEJpmKBadz03MX5GleFM4LTYSzUmlI3pELuWShqh9rP5vVNybpUBCWNlSxoyV39PZDTSehIFtjOiZqSXvZn4n9dNTXjtZ1wmqUHJFovCVBATk9nzZMAVMiMmllCmuL2VsBFVlBkbUdGG4C2/vEpatap3Wa3d18r1mzyOApzCGVyAB1dQhztoQBMYCHiGV3hzHp0X5935WLSuOfnMCfyB8/kD272PMA==</latexit>

b31

<latexit sha1_base64="EKBErNKxiLt/qBgr2QKYVFMYPwE=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCTaWGIiHwlcyN6ywIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMviKUw6LrfTm5jc2t7J79b2Ns/ODwqHp+0TJRoxpsskpHuBNRwKRRvokDJO7HmNAwkbweT27nffuLaiEg94DTmfkhHSgwFo2ilTjnopzVvVu4XS27FXYCsEy8jJcjQ6Be/eoOIJSFXyCQ1puu5Mfop1SiY5LNCLzE8pmxCR7xrqaIhN366uHdGLqwyIMNI21JIFurviZSGxkzDwHaGFMdm1ZuL/3ndBIfXfipUnCBXbLlomEiCEZk/TwZCc4ZyagllWthbCRtTTRnaiAo2BG/15XXSqla8WqV6Xy3Vb7I48nAG53AJHlxBHe6gAU1gIOEZXuHNeXRenHfnY9mac7KZU/gD5/MH3USPMQ==</latexit>

<latexit sha1_base64="gcc9jSXN9EUzvx//j4C+HoVfs7Q=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJYhbtYaBm00DKC+YDkCHubvWTJ3t6xOyeEIz/CxkIRW3+Pnf/GTXKFJj4YeLw3w8y8IJHCoOt+O2vrG5tb24Wd4u7e/sFh6ei4ZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqEScJ9yM6VCIUjKKV2pW7fuZNK/1S2a26c5BV4uWkDDka/dJXbxCzNOIKmaTGdD03QT+jGgWTfFrspYYnlI3pkHctVTTixs/m507JuVUGJIy1LYVkrv6eyGhkzCQKbGdEcWSWvZn4n9dNMbz2M6GSFLlii0VhKgnGZPY7GQjNGcqJJZRpYW8lbEQ1ZWgTKtoQvOWXV0mrVvUuq7WHWrl+k8dRgFM4gwvw4ArqcA8NaAKDMTzDK7w5ifPivDsfi9Y1J585gT9wPn8AP5yO2Q==</latexit>

<latexit sha1_base64="Qrljrl6LeDVQIMYoyRBglFdL07E=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJYhbtYaBm00DKC+YDkCHubuWTJ3t6xuyeEIz/CxkIRW3+Pnf/GTXKFJj4YeLw3w8y8IBFcG9f9dtbWNza3tgs7xd29/YPD0tFxS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj25nffkKleSwfzSRBP6JDyUPOqLFSu3LXz2rTSr9UdqvuHGSVeDkpQ45Gv/TVG8QsjVAaJqjWXc9NjJ9RZTgTOC32Uo0JZWM6xK6lkkao/Wx+7pScW2VAwljZkobM1d8TGY20nkSB7YyoGellbyb+53VTE177GZdJalCyxaIwFcTEZPY7GXCFzIiJJZQpbm8lbEQVZcYmVLQheMsvr5JWrepdVmsPtXL9Jo+jAKdwBhfgwRXU4R4a0AQGY3iGV3hzEufFeXc+Fq1rTj5zAn/gfP4AQSKO2g==</latexit>

<latexit sha1_base64="3kUHrBSJT0bPTYItbYvmpsF8zCc=">AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsotCRaaImJfCRwIXvLAhv29i67cybkwo+wsdAYW3+Pnf/GBa5Q8CWTvLw3k5l5QSyFQdf9dnIbm1vbO/ndwt7+weFR8fikZaJEM95kkYx0J6CGS6F4EwVK3ok1p2EgeTuY3M799hPXRkTqEacx90M6UmIoGEUrtct3/bQ2K/eLJbfiLkDWiZeREmRo9ItfvUHEkpArZJIa0/XcGP2UahRM8lmhlxgeUzahI961VNGQGz9dnDsjF1YZkGGkbSkkC/X3REpDY6ZhYDtDimOz6s3F/7xugsNrPxUqTpArtlw0TCTBiMx/JwOhOUM5tYQyLeythI2ppgxtQgUbgrf68jppVSterVJ9qJbqN1kceTiDc7gED66gDvfQgCYwmMAzvMKbEzsvzrvzsWzNOdnMKfyB8/kDQqiO2w==</latexit>

<latexit sha1_base64="OUppMX9nvgBAawUge6ZDCJCa0GA=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJYhbtYaBm00DKC+YDkCHubvWTJ3t6xOyeEIz/CxkIRW3+Pnf/GTXKFJj4YeLw3w8y8IJHCoOt+O2vrG5tb24Wd4u7e/sFh6ei4ZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqEScJ9yM6VCIUjKKV2pW7fuZOK/1S2a26c5BV4uWkDDka/dJXbxCzNOIKmaTGdD03QT+jGgWTfFrspYYnlI3pkHctVTTixs/m507JuVUGJIy1LYVkrv6eyGhkzCQKbGdEcWSWvZn4n9dNMbz2M6GSFLlii0VhKgnGZPY7GQjNGcqJJZRpYW8lbEQ1ZWgTKtoQvOWXV0mrVvUuq7WHWrl+k8dRgFM4gwvw4ArqcA8NaAKDMTzDK7w5ifPivDsfi9Y1J585gT9wPn8APhaO2A==</latexit>

Step 0

Step 1

Step 2

Step 3

b32

<latexit sha1_base64="Oc55GfTEAPyJduLgB1WKQ5JiQVc=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkbuj0JJoY4mJfCRwIXvLHmzY3Tt390zIhT9hY6Extv4dO/+NC1yh4EsmeXlvJjPzwoQzbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TRWiLxDxW3RBrypmkLcMMp91EUSxCTjvh5Hbud56o0iyWD2aa0EDgkWQRI9hYqVsNB1ndn1UH5YpbcxdA68TLSQVyNAflr/4wJqmg0hCOte55bmKCDCvDCKezUj/VNMFkgke0Z6nEguogW9w7QxdWGaIoVrakQQv190SGhdZTEdpOgc1Yr3pz8T+vl5roOsiYTFJDJVkuilKOTIzmz6MhU5QYPrUEE8XsrYiMscLE2IhKNgRv9eV10vZrXr3m3/uVxk0eRxHO4BwuwYMraMAdNKEFBDg8wyu8OY/Oi/PufCxbC04+cwp/4Hz+AN7KjzI=</latexit>

Figure 1. An illustration of sequential generation procedure.

The generative model fθ first generates a new node with

type ai based on a sampled latent variable zi, where ai ∈

{0,...,k − 1}. Afterwards, the edges between the new

node and all nodes in Gi−1 are generated sequentially by

fθ. Specifically, the edge incident to the j-th node of Gi−1

is generated based on the latent variable zij, whose edge

type bij ∈ {0,...,c − 1,c}, where the extra edge type c

means no edge. If bij violates the chemical bond valency

rule, it is re-generated by re-sampling a new zij. The above

process is an autoregressive function of previously generated

elements and repeats until no new edge is added at step i,

i.e., bij = c for j = 1,...,i−1. An example of this process

is given in Fig. 1. In summary, generating a molecular graph

G is equivalent to generating the sequence representation

SG = (a1,a2,b21,a3,b31,b32,a4,... ) by

ai = fθ(a1,...,bi−1,i−2; zi), i > 0,

bij = fθ(a1,...,bi,j−1; zij), j = 1,...,i − 1.

(2)

3.2. Generation with Discrete Latent Variables

In our method, all latent variables are discrete and sampled

from multinomial distributions. Specifically, zi is sampled

from a multinomial distribution pZa with trainable parame-

ters (α0,...,αk−1) as

pZa (zi = s) =

exp (αs)

∑k−1

t=0 exp (αt)

(3)

Similarly, zij is sampled from the multinomial distribution

pZb with trainable parameters (β0,...,βc). We use a dis-

crete flow model to reversibly map discrete latent variables

to new nodes and edges. The discrete transform used in the

Flow

pZa

<latexit sha1_base64="0ggLGbi8xdq1TJUx03cbEU9us4M=">AAAB8HicbVA9T8MwEL2Ur1K+CowsFi0SU5WUAcYKFsYi0Q9oo8hxndaq7US2g1RF/RUsDCDEys9h49/gthmg5UknPb13p7t7YcKZNq777RTW1jc2t4rbpZ3dvf2D8uFRW8epIrRFYh6rbog15UzSlmGG026iKBYhp51wfDPzO09UaRbLezNJqC/wULKIEWys9FBNguwxwNNqUK64NXcOtEq8nFQgRzMof/UHMUkFlYZwrHXPcxPjZ1gZRjidlvqppgkmYzykPUslFlT72fzgKTqzygBFsbIlDZqrvycyLLSeiNB2CmxGetmbif95vdREV37GZJIaKsliUZRyZGI0+x4NmKLE8IklmChmb0VkhBUmxmZUsiF4yy+vkna95l3U6nf1SuM6j6MIJ3AK5+DBJTTgFprQAgICnuEV3hzlvDjvzseiteDkM8fwB87nDy9yj/8=</latexit>

Flow

R-GCN

Flow

<latexit sha1_base64="P5hewfRdl79dlSrcQa3Oy8qll2I=">AAAB7HicbVA9TwJBEJ3zE/ELtbTZCCZW5A4LLYk2FBaYeEACJ9lb9mDD3u5ld8+EXPgNNhYaY+sPsvPfuMAVCr5kkpf3ZjIzL0w408Z1v5219Y3Nre3CTnF3b//gsHR03NIyVYT6RHKpOiHWlDNBfcMMp51EURyHnLbD8e3Mbz9RpZkUD2aS0CDGQ8EiRrCxkl9pPN5V+qWyW3XnQKvEy0kZcjT7pa/eQJI0psIQjrXuem5iggwrwwin02Iv1TTBZIyHtGupwDHVQTY/dorOrTJAkVS2hEFz9fdEhmOtJ3FoO2NsRnrZm4n/ed3URNdBxkSSGirIYlGUcmQkmn2OBkxRYvjEEkwUs7ciMsIKE2PzKdoQvOWXV0mrVvUuq7X7Wrl+k8dRgFM4gwvw4Arq0IAm+ECAwTO8wpsjnBfn3flYtK45+cwJ/IHz+QOjGY3o</latexit>

Flow

pZb

<latexit sha1_base64="RBhCTaumSrR7Pj6w+aZt1RFedk0=">AAAB8HicbVA9T8MwEL2Ur1K+CowsFi0SU5WUAcYKFsYi0Q9oo8hxndaq7US2g1RF/RUsDCDEys9h49/gthmg5UknPb13p7t7YcKZNq777RTW1jc2t4rbpZ3dvf2D8uFRW8epIrRFYh6rbog15UzSlmGG026iKBYhp51wfDPzO09UaRbLezNJqC/wULKIEWys9FBNguwxCKfVoFxxa+4caJV4OalAjmZQ/uoPYpIKKg3hWOue5ybGz7AyjHA6LfVTTRNMxnhIe5ZKLKj2s/nBU3RmlQGKYmVLGjRXf09kWGg9EaHtFNiM9LI3E//zeqmJrvyMySQ1VJLFoijlyMRo9j0aMEWJ4RNLMFHM3orICCtMjM2oZEPwll9eJe16zbuo1e/qlcZ1HkcRTuAUzsGDS2jALTShBQQEPMMrvDnKeXHenY9Fa8HJZ47hD5zPHzD4kAA=</latexit>

b21

pZa

<latexit sha1_base64="N+S6IZnEe5lDb+zbF2PRiXqrMdo=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLigQlcyN4ywIa9vcvungle+A02Fhpj6w+y89+4wBUKvmSSl/dmMjMvTATXxnW/ncLa+sbmVnG7tLO7t39QPjxq6ThVDH0Wi1g9hFSj4BJ9w43Ah0QhjUKB7XB8M/Pbj6g0j+W9mSQYRHQo+YAzaqzkV596XrVXrrg1dw6ySrycVCBHs1f+6vZjlkYoDRNU647nJibIqDKcCZyWuqnGhLIxHWLHUkkj1EE2P3ZKzqzSJ4NY2ZKGzNXfExmNtJ5Eoe2MqBnpZW8m/ud1UjO4CjIuk9SgZItFg1QQE5PZ56TPFTIjJpZQpri9lbARVZQZm0/JhuAtv7xKWvWad1Gr39Urjes8jiKcwCmcgweX0IBbaIIPDDg8wyu8OdJ5cd6dj0VrwclnjuEPnM8fx/aOAA==</latexit>

<latexit sha1_base64="tk8MnFUw7xVEHpVWqImcw6buh90=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPRi0dMLJhAQ7bLFjZsd5vdrQk2/AYvHjTGqz/Im//GBXpQ8CWTvLw3k5l5UcqZNq777ZTW1jc2t8rblZ3dvf2D6uFRW8tMERoQyaV6iLCmnAkaGGY4fUgVxUnEaSca38z8ziNVmklxbyYpDRM8FCxmBBsrBfWnvl/vV2tuw50DrRKvIDUo0OpXv3oDSbKECkM41rrruakJc6wMI5xOK71M0xSTMR7SrqUCJ1SH+fzYKTqzygDFUtkSBs3V3xM5TrSeJJHtTLAZ6WVvJv7ndTMTX4U5E2lmqCCLRXHGkZFo9jkaMEWJ4RNLMFHM3orICCtMjM2nYkPwll9eJW2/4V00/Du/1rwu4ijDCZzCOXhwCU24hRYEQIDBM7zCmyOcF+fd+Vi0lpxi5hj+wPn8Acl7jgE=</latexit>

z21

<latexit sha1_base64="mGEHp6vJAR3z16vWMJacKFgC0pI=">AAAB73icbVA9T8MwEL2Ur1K+CowsFi0SU5WEAcYKFsYi0Q+pjSrHdVqrthNsB6lE/RMsDCDEyt9h49/gthmg5UknPb13p7t7YcKZNq777RTW1jc2t4rbpZ3dvf2D8uFRS8epIrRJYh6rTog15UzSpmGG006iKBYhp+1wfDPz249UaRbLezNJaCDwULKIEWys1Kk+9TPfm1b75Ypbc+dAq8TLSQVyNPrlr94gJqmg0hCOte56bmKCDCvDCKfTUi/VNMFkjIe0a6nEguogm987RWdWGaAoVrakQXP190SGhdYTEdpOgc1IL3sz8T+vm5roKsiYTFJDJVksilKOTIxmz6MBU5QYPrEEE8XsrYiMsMLE2IhKNgRv+eVV0vJr3kXNv/Mr9es8jiKcwCmcgweXUIdbaEATCHB4hld4cx6cF+fd+Vi0Fpx85hj+wPn8AQC8j0g=</latexit>

Conditional

information

Conditional

information

Conditional

information

Figure 2. An illustration of generation with GraphDF framework.

discrete flow is a modulo shift transform in the form of

q(z)=(z + µ) mod t,

(4)

where t is the number of categories, and z, µ ∈ {0,...,t −

1}. Formally, the discrete transform for generating new

nodes and edges in Eqn. (2) is the composition of D modulo

shift modules as

ai = qD

i ◦···◦ q1

i (zi), i > 0,

bij = qD

ij ◦···◦ q1

ij(zij), j = 1,...,i − 1,

(5)

where f ◦ g(·) is defined as f (g(·)), and

i (z)=(z + µd

i ) mod k, d = 1,...,D

ij(z)=(z + µd

ij) mod (c + 1), d = 1,...,D.

(6)

Here the shift factors µd

and µd

ij are functions of

(a1,a2,b21,...,bi−1,i−2) and (a1,a2,b21,...,bi,j−1), re-

spectively. They both play the role of capturing conditional

information from previously generated elements. The cal-

culation of µd

i and µd

ij will be described in Sec. 3.4. An

illustrative example of generating a simple molecular graph

with this framework is given in Fig. 2.

The use of discrete latent variables is the key factor distin-

guishing our method from other generation methods based

Page 4

GraphDF: A Discrete Flow Model for Molecular Graph Generation

on flow models. Our model has several advantages. The

change-of-variable formula of discrete invertible mapping

does not involve the Jacobian matrix term in Eqn. (1). Hence

the probabilities of sampled nodes and edges can be directly

obtained as

p(ai|a1,a2,b21,...,bi−1,i−2) = pZa (zi),

p(bij|a1,a2,b21,...,bi,j−1) = pZb (zij),

(7)

leading to dramatic reduction in computational cost.

More importantly, discrete latent variables free the gener-

ative model from the defects of dequantization. Previous

methods (Madhawa et al., 2019; Shi* et al., 2020; Zang

& Wang, 2020) convert categorical representation of graph

nodes and edges to continuous data by adding real-valued

noise to their one-hot encodings. Then continuous flow

models map such pseudo continuous data to latent vari-

ables subject to Gaussian distribution. We argue that such

a dequantization operation can be problematic. First, the

generative model does not capture the true distribution of

original discrete data, but the distorted distribution after

dequantization. Second, due to the randomness of noise, the

resulting continuous data points can be very different even if

they are obtained by performing dequantization on the same

graph. Hence during training, a single graph may be mapped

to two distant points in the latent space, leading to model

oscillation and failure to converge. However, our GraphDF

maps the discrete graph data to discrete latent space by dis-

crete invertible flow without introducing any distribution

distortion. Hence it can model the underlying distribution

of graph structures more accurately, thereby eliminating the

impact of two issues caused by dequantization.

Note that our method is fundamentally different from

GraphCNF (Lippe & Gavves, 2021), though dequantization

is avoided in both methods. GraphCNF maps discrete values

to continuous latent space where different categories occupy

non-overlapping regions. Such a discrete-to-continuous

mapping leads to distribution distortions. In the generation

process, discrete values are sampled by applying Bayes,

and all graph nodes and edges are generated at once. On

the other hand, GraphDF uses a fundamentally different

approach to model the discrete distribution. GraphDF com-

pletely discards continuous latent space and maps discrete

values to discrete latent space by the modulo shift trans-

form. No distribution distortion is introduced here. In

addition, compared with GraphCNF, the sequential fashion

of GraphDF can capture the graph structure information

efficiently and guarantee the chemical validity because of

using bond valency check in the generation.

3.3. Inference and Training

The inverse process of generation is inference, which in-

volves mapping from graphs to discrete latent variables.

Given a graph G = (X, A), we compute its sequence repre-

sentation SG by re-ordering its nodes and edges according

to the breadth-first search (BFS) order. Then we can obtain

the discrete latent variables corresponding to each element

of this sequence by inverting Eqn. (5) as

zi = o1

i ◦···◦ oD

i (ai), i > 0,

zij = o1

ij ◦···◦ oD

ij (bij), j = 1,...,i − 1,

(8)

where

i (z)=(z − µd

i ) mod k, d = 1,...,D

ij(z)=(z − µd

ij) mod (c + 1), d = 1,...,D.

(9)

Similar to the generation process, the shift factors µd

i and

µd

ij capture autoregressive information from elements in SG

before ai and bij, respectively.

We train a GraphDF model on a molecular graph dataset by

maximizing the log-likelihoods of all data samples. Given

a molecular graph G, we first perform inference on G to

compute the latent variables corresponding to its nodes and

edges. Then the log-likelihood of G can be computed as

log p(G) =

∑

i=1



log pZa (zi) +

i−1

∑

j=1

log pZb (zij)



 . (10)

The parameters of GraphDF models and discrete prior dis-

tributions pZa ,pZb are updated by gradient descent to maxi-

mize the log-likelihoods of all data samples.

3.4. Invariant and Discrete Conditional Generation

The transformation of discrete flow in Eqn. (5) is autore-

gressive and requires capturing conditional information into

µd

i and µd

ij. Tran et al. (2019) uses language models, like

LSTM (Hochreiter & Schmidhuber, 1997) and Transformer

(Vaswani et al., 2017), in discrete flow models to develop

sequence generators. However, the same method cannot be

used directly to generate SG for two reasons as below.

• Language models are only aware of the node and edge

type information encoded in SG, while rich structure in-

formation of graphs is not captured.

• There is no intrinsic order in an intermediately generated

sub-graph, but language models assume an order when

dealing with SG. For instance, for (a1,a2) = (1, 0) and

(a1,a2) = (0, 1), the generative model should give the

same estimate p(b21|a1,a2), but language models cannot

provide such a guarantee.

To tackle the two issues, we propose to capture conditional

information from intermediate sub-graphs by using graph

convolutional networks (GCN) (Kipf & Welling, 2017; Gao

Page 5

GraphDF: A Discrete Flow Model for Molecular Graph Generation

et al., 2018; Gao & Ji, 2019). Specifically, let the sub-

graph generated immediately before adding the node ai

be Gi−1 = (X, A), we use an L-layer relational GCN (R-

GCN) (Schlichtkrull et al., 2018) to obtain node embeddings

HL ∈ Rn×r from Gi−1 by performing L message passing

operations. The l-th message passing is performed by:

Hl =

∑

v=1

ReLU

( ˜

− 1

˜Av˜D− 1

Hl−1Wl

)

(11)

where H0 = X, ˜Av = A[:, :,v]+I, ˜Dv is a diagonal matrix

and˜Dv[j, j] =

∑

u=1

˜Av[j, u], {Wl

v }c

v=1 are trainable weight

matrices of the l-th layer. Note that the same R-GCN is

shared among all modulo shift modules in the discrete flow.

Afterwards, µd

i is calculated by a classification network

based on multi-layer perceptrons (MLP) as

h = sum(HL),

µd

i = arg max MLPd

a (h) , d = 1,...,D,

(12)

where h ∈ Rr is the graph embedding obtained by perform-

ing sum-pooling on node embeddings HL, MLP

a projects

the input to a k-dimensional logits vector, and arg max

outputs the index position of the maximum logits value.

Since arg max is not differentiable, we use the softmax-

temperature function to approximate the gradient (Jang et al.,

2016) during training as

γd = MLPd

a(h), d = 1,...,D,

dµd

dγd

≈

dγd

softmax(

γd

), d = 1,...,D,

(13)

where we fix the temperature τ = 0.1 in experiments.

Similarly, to obtain µd

ij, we use the R-GCN to extract

node embeddings HL and graph embedding h from Gi−1,j,

which is the sub-graph generated immediately before adding

the edge bij. Then we compute µd

ij by MLP as:

µd

ij = arg max MLPd

b(Con(h, HL

i ,HL

j )) ,d = 1,...,D,

where HL

i and HL

j denote the node embeddings of the i-

th and j-th node of Gi−1,j, respectively, and Con(·) is the

concatenation operation. The softmax approximation is also

used here to perform back propagation through arg max.

By using our proposed method, the discrete flow model

can elegantly extract conditional information from graph

structures, which is then incorporated into the mapping from

discrete latent space to newly generated graph nodes and

edges. The conditional information is guaranteed to be

invariant to the order of previously generated elements. We

summarize the detailed generation and training algorithms

of GraphDF in Appendix A and B, respectively.

3.5. Property Optimization with Reinforcement

Learning

The above discussions involve molecular graph generation

independent of property scores. However, both property

optimization and constrained optimization require the gen-

erated molecules to have optimal property scores. To this

end, we use reinforcement learning to fine-tune a trained

GraphDF model to optimize the specified property score.

We can formulate the sequential generation process with

GraphDF as a Markov Decision Process (MDP). Formally,

at each step, we treat the currently generated sub-graph as

the state, and adding a new node or edge as the action. The

action probabilities p(ai|Gi−1) and p(bij|Gi−1,j) are thus

determined by the GraphDF model in Eqn. (7). The rewards

consist of intermediate reward and final reward. The inter-

mediate reward is the penalty for violating chemical bond

valency rule. The final reward incorporates the property

score and penalty for containing excessive steric strain or

violating functional group filters of ZINC (Irwin et al., 2012)

in the finally generated molecule.

Following GCPN (You et al., 2018a), we use Proximal Pol-

icy Optimization (PPO) (Schulman et al., 2017), a powerful

policy optimization algorithm, to fine-tune GraphDF models

in the above reinforcement learning environment. The loss

function for a generated graph G is

L(G, θ) =

∑

i=1



Lc(ri,Ai) +

∑

j=1

Lc(rij,Aij)



 ,

Lc(r, A) = min{rA, clip(r, 1 − ϵ, 1 + ϵ)A},

(14)

where ri =

p(ai|Gi−1;θ)

p(ai|Gi−1;θold)

and rij =

p(bij |Gi−1,j ;θ)

p(bij |Gi−1,j ;θold)

are

ratios of action probabilities by new policy and old policy.

Ai and Aij are estimated advantage functions, for which we

use the accumulated rewards of future steps in experiments.

4. Experiments

In this section, we evaluate our GraphDF on three tasks

of molecule generation, as mentioned in Section 2.1. We

show that over most tasks, GraphDF can outperform three

strong baselines, JT-VAE (Jin et al., 2018), GCPN (You

et al., 2018a) and MRNN (Popova et al., 2019), and other

molecular graph generation methods based on flow models

including GraphNVP (Madhawa et al., 2019), GRF (Honda

et al., 2019), GraphAF (Shi* et al., 2020), MoFlow (Zang &

Wang, 2020) and GraphCNF (Lippe & Gavves, 2021).

4.1. Random Generation

Data. For random generation of molecular graphs, we

train and evaluate GraphDF on three molecule datasets,

ZINC250K (Irwin et al., 2012), QM9 (Ramakrishnan

Page 6

GraphDF: A Discrete Flow Model for Molecular Graph Generation

Table 1. Random generation performance on ZINC250K dataset. Results with ∗ are taken from Shi* et al. (2020). Results of MoFlow

(with ) are obtained by running its official source code.

Table 2. Random generation performance on QM9 dataset. Results of MoFlow (with ) are obtained by running its official source code.

Table 3. Random generation performance on MOSES dataset. Results with ∗ are taken from Polykovskiy et al. (2020).

Table 4. Random generation performance on COMMUNITY-SMALL and EGO-SMALL datasets. Following GNF (Liu et al., 2019), we

evaluate with MMD metric, the first set of results are obtained using node distribution matching, and the second set of results are obtained

from evaluating 1024 generated graphs without node distribution matching.

Method

COMMUNITY-SMALL

EGO-SMALL

Degree Cluster Orbit Degree Cluster

et al., 2014), and MOSES (Polykovskiy et al., 2020). All

molecules are preprocessed to kekulized form by RDKit

(Landrum, 2016). In addition, we conduct experiments

to show that GraphDF is very general and can be used to

graphs other than molecular graphs. Following the experi-

ments of GNF (Liu et al., 2019), we test GraphDF on two

generic graph datasets, COMMUNITY-SMALL and EGO-

SMALL. Detailed information about all datasets used in

random generation experiments is provided in Appendix C.

Setup. We adopt five widely used metrics to evaluate

GraphDF on molecular graph generation. Among all gen-

erated molecules, Validity is the percentage of molecules

which do not violate chemical valency rule. Since we use re-

sampling when valency check fails as mentioned in Sec. 3.1,

we also report Validity w/o check, which is the validity per-

Page 7

GraphDF: A Discrete Flow Model for Molecular Graph Generation

centage when this valency correction is disabled. Among

all generated valid molecules, Uniqueness and Novelty are

the percentage of unique molecules and molecules not ap-

pearing in the training data, respectively. Reconstruction is

the percentage of molecules which can be reconstructed

from their latent variables. The five metrics are calcu-

lated from 10,000 generated molecular graphs. We find

that the reported metrics in MoFlow paper (Zang & Wang,

2020) are not calculated from the same 10,000 generated

molecules, so we recalculate the metrics of MoFlow by re-

running its official source code for fair comparison. As for

COMMUNITY-SMALL and EGO-SMALL datasets, we

follow GNF (Liu et al., 2019) to calculate the Maximum

Mean Discrepancy (MMD) (Gretton et al., 2012) distance of

statistics of degrees, clustering coefficients and orbit counts

between generated graphs and graphs in the dataset. We

calculate the MMD from 1024 generated graphs in the two

cases of using and not using node distribution matching.

The flow model in GraphDF consists of 12 modulo shift

modules, where the shared R-GCN has 3 message passing

layers. The hidden dimension and output dimension of

node embeddings are 128. For molecular graphs, the input

node embeddings are one-hot embeddings of node types.

For graphs in COMMUNITY-SMALL and EGO-SMALL

datasets, the input node embeddings are one-hot indicator

vectors. After message passing layers, a batch normalization

layer and a sum pooling layer are used to get the graph

embedding. The two MLPs for calculating µd

i and µd

ij both

have two fully-connected layers, with a hidden dimension

of 128 and a non-linear activation function of tanh. A single

R-GCN is used among all 12 modulo shift modules in the

discrete flow model, while each modulo shift module has

an MLP to calculate µd

i or µd

ij. We use the above model

architecture configuration for all experiments in this paper.

Adam (Kingma & Ba, 2015) optimizer is used to train

GraphDF models. On each molecule dataset, GraphDF

is trained for 10 epochs, where the batch size is 32 and

the learning rate is 0.001. On COMMUNITY-SMALL and

EGO-SMALL datasets, GraphDF is trained for 1000 epochs,

where the batch size is 16 and the learning rate is 0.001. See

Appendix D for more training and generation details.

Results. Table 1, Table 2 and Table 3 show that GraphDF

can outperform the state-of-the-art methods over most met-

rics on all three molecule datasets. Thanks to the invertible

property of flow models and valency correction, GraphDF

can always achieve 100% validity and reconstruction rate.

In addition, GraphDF achieves reasonably high uniqueness

and novelty rate on three datasets, showing that it does not

overfit or produce mode collapse. Particularly, GraphDF can

outperform GraphAF by a large margin in terms of validity

w/o check. Since GraphAF generates molecular graphs in

the same sequential fashion but uses continuous flow, we

Table 5. Property optimization performance evaluated by top-3

property scores. The scores of MRNN are taken from Shi* et al.

(2020).

Method

Penalized logP

QED

1st

2nd

3rd

1st

2nd

3rd

ZINC

(Dataset)

4.52

4.3

4.23

0.948

JT-VAE

5.3

4.93

4.49

0.925

0.911

0.910

GCPN

7.98

7.85

7.80

0.948

0.947

0.946

MRNN

8.63

6.08

4.73

0.844

0.796

0.736

GraphAF

12.23

11.29

11.05

0.948

0.947

MoFlow

n/a

0.948

GraphDF

13.7

13.18

13.17

0.948

believe that the good performance of GraphDF comes from

the usage of discrete flow. It strongly demonstrates that

discrete latent variables can learn chemical rules and model

the underlying distributions more accurately than continu-

ous latent variables. Besides, results in Table 4 show that

GraphDF can consistently achieve good performance on

generic graph data when compared with GNF and GraphAF.

4.2. Property Optimization

Setup. In property optimization task, our objective is to

generate novel molecules with high property scores. We

use penalized logP and QED (Bickerton et al., 2012) as

the optimization targets. Penalized logP is the logP score

reduced by synthetic accessibility and ring size, and QED

is quantitative estimation of drug-likeness. We calculate

them using the scripts from the official implementation of

GraphAF (Shi* et al., 2020).

We use reinforcement learning to optimize these two targets.

In this process, both intermediate reward and final reward

are considered. The intermediate reward is the penalization

for violating chemical bond valency rule. At a generation

step, if the action in this step is to add a new edge to the sub-

graph and the new edge results in the violation of chemical

bond valency rule, then a negative reward of -1 is assigned

to this step. Besides, when a complete molecular graph G

is generated, the final reward R(G) is calculated as

R(G) = Rp(G) − Rss(G) − Rf (G).

(15)

Rss(G) is 1 if G is too sterically strained, i.e., the average

angle bend energy of G exceeds 0.82 kcal/mol, otherwise

Rss(G) is 0. Rf (G) is 1 when functional group filters

of ZINC (Irwin et al., 2012) detect problematic functional

groups from G, otherwise Rf (G) is 0. Rp(G) is related to

the property score of G. Denote by logP(·) the computation

of penalized logP property and QED(·) the computation of

Page 8

GraphDF: A Discrete Flow Model for Molecular Graph Generation

13.70

13.17

13.18

13.09

(a) Four molecules with top penalized logP scores. Note that they

are all long carbon chains, only differing in one atom. These are

typical chemical structures with high penalized logP property scores.

0.948

0.946

(b) Four molecules with top QED scores.

Figure 3. An illustration of generated molecules with best property

scores after property optimization.

QED property, we use

Rp(G) = exp (logP(G)/3.0 − 4.0),

and

Rp(G)=2.0 ∗ QED(G),

respectively. The final reward R(G) is distributed to all

intermediate steps with a discount factor γ = 0.9. Specifi-

cally, if the total number of steps to generate G is T, then

the assigned reward is γT −tR(G) for the t-th step where

t = 1,...,T.

-8.34

-1.61

-8.40

-0.54

Figure 4. An illustration of two examples in constrained optimiza-

tion of penalized logP.

We first pretrain GraphDF model on ZINC250K dataset for

1000 epochs and then fine-tune the model with reinforce-

ment learning for 200 iterations. More details about model

training are presented in Appendix D. Following previous

literatures, we report the top-3 property scores of molecules

generated from the model.

Results. We summarize the top-3 property scores of

ZINC250K dataset and generation methods in Table 5. It

shows that GraphDF can outperform all baselines in penal-

ized logP optimization task and achieve comparable perfor-

mance in QED optimization task. Note that similar rein-

forcement learning procedure described in Sec. 3.5 is used

to optimize the property scores in GCPN (You et al., 2018a)

and GraphAF (Shi* et al., 2020). Hence, we believe that the

good performance of GraphDF demonstrates its strong ca-

pacity to explore the complicated chemical structure space

and generate diverse molecular graphs. Some generated

molecules with high penalized logP scores and QED scores

are visualized in Figure 3.

4.3. Constrained Optimization

Data. In this task, we aim to modify the input molecu-

lar graph so as to improve its penalized logP score while

the similarity between the input and modified molecules

is higher than a threshold δ. Following JT-VAE (Jin et al.,

2018), GCPN (You et al., 2018a), GraphAF (Shi* et al.,

2020), and MoFlow (Zang & Wang, 2020), we use selected

800 molecules from ZINC250K with low penalized logP

scores as the input molecules to be optimized. However, we

find that the set of 800 molecules used in JT-VAE and GCPN

are different from that used in GraphAF and MoFlow, and

the official implementation of GCPN does not provide code

for re-running constrained optimization on new datasets.

Hence, for fair comparison, we train GraphDF and report

Page 9

GraphDF: A Discrete Flow Model for Molecular Graph Generation

Table 6. Constrained optimization performance on 800 molecules used in JT-VAE and GCPN.

JT-VAE

GCPN

GraphDF

Improvement Similarity

Success Improvement Similarity

Success

0.0

1.91±2.04

0.28±0.15

97.5%

4.20±1.28

0.32±0.12

100%

5.93±1.97

0.30±0.12

100%

0.2

1.68±1.85

0.33±0.13

97.1%

4.12±1.19

0.34±0.11

100%

5.62±1.65

0.32±0.10

100%

0.4

0.84±1.45

0.51±0.10

83.6%

2.49±1.30

0.47±0.08

100%

4.13±1.41

0.47±0.07

100%

0.6

0.21±0.71

0.69±0.06

46.4%

0.79±0.63

0.68±0.08

100%

1.72±1.15

0.67±0.06

93%

Table 7. Constrained optimization performance on 800 molecules used in GraphAF and MoFlow.

GraphAF

MoFlow

GraphDF

Improvement Similarity

Success Improvement Similarity

Success

0.0

13.13±6.89

0.29±0.15

100%

8.61±5.44

0.30±0.20

98.88%

14.15±6.86

0.29±0.13

100%

0.2

11.90±6.86

0.33±0.12

100%

7.06±5.04

0.43±0.20

96.75%

12.77±6.59

0.32±0.11

100%

0.4

8.21±6.51

0.49±0.09

99.88%

4.71±4.55

0.61±0.18

85.75%

9.19±6.43

0.48±0.08

99.63%

0.6

4.98±6.49

0.66±0.05

96.88%

2.10±2.86

0.79±0.14

58.25%

4.51±5.80

0.65±0.05

92.13%

results on two datasets, one is used in JT-VAE and GCPN

while the other is used in GraphAF and MoFlow.

Setup. Similar to the property optimization task, GraphDF

is pretrained on ZINC250K dataset for 1000 epochs and

then fine-tuned with reinforcement learning. More details

about model training can be found in Appendix D.

We use the reward defined in Eqn. (15), where Rp(G) is

defined as

Rp(G) = logP(G) − logP(Gin),

where Gin is the input molecular graph. The initial state

of a generation process is set to be a sub-graph of the in-

put molecular graph. Specifically, given an input molec-

ular graph Gin, we first re-order its nodes by a breadth-

first search, then randomly remove the last m nodes and

edges incident to them, where m is randomly selected from

{0, 1, 2, 3, 4, 5}. The remaining sub-graph is set as the ini-

tial state.

We use Tanimoto similarity with Morgan fingerprint (Rogers

& Hahn, 2010) to compute the similarity between the input

and the optimized molecules. We optimize the input 800

molecules under 4 different similarity constraints of 0.0, 0.2,

0.4, 0.6, respectively. The success rate under each similarity

constraint is reported, which is the percentage of molecules

that can be successfully optimized over 800 molecules. Over

all successfully optimized molecules, we calculate the mean

and standard deviation of the largest property improvement,

and similarities between them and their corresponding input

molecules.

Results. Results of constrained optimization on two

datasets are summarized in Table 6 and Table 7, respec-

tively. Results show that GraphDF can achieve much higher

average property improvement than JT-VAE, GCPN, and

MoFlow under 4 similarity constraints, and outperform

GraphAF under 3 of 4 similarity constraints. In addition,

the average similarities achieved by GraphDF are compara-

ble with all baselines, and unlike JT-VAE or MoFlow, the

success rate of GraphDF does not decrease dramatically as

the similarity threshold increases. The good performance

further proves that GraphDF has a strong ability to search

molecules with high property scores in chemical space. We

illustrate several optimization examples in Figure 4.

5. Conclusion

We propose GraphDF, the first molecular graph generation

framework using discrete latent variables. GraphDF sequen-

tially map discrete latent variables to graph nodes and edges

via invertible modulo shift transforms. The use of discrete

latent variables results in more accurate modeling of graph

structures and stronger capacity to explore the chemical

space. More importantly, GraphDF can avoid expensive

computation of Jacobian matrix and eliminate the negative

effects of dequantization, which exist in many prior methods.

We experimentally demonstrate the advantages of GraphDF

and achieve new state-of-the-art results in three molecule

generation tasks. The major limitations of GraphDF are

relying on BFS node ordering during training and the gener-

ation speed is slower than one-shot methods. In the future,

we would like to apply GraphDF to other types of graph

data and extend it to graph translation problems.

Acknowledgements

We thank Chence Shi for his help on providing technical

details on the implementation of reinforcement learning

method. This work was supported in part by National Sci-

ence Foundation grant DBI-2028361.

Page 10

GraphDF: A Discrete Flow Model for Molecular Graph Generation

References

Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S.,

and Hopkins, A. L. Quantifying the chemical beauty of

drugs. Nature chemistry, 4(2):90–98, 2012.

Dai, H., Tian, Y., Dai, B., Skiena, S., and Song, L. Syntax-

directed variational autoencoder for structured data. In

7th International Conference on Learning Representa-

tions, 2018.

De Cao, N. and Kipf, T. MolGAN: An implicit generative

model for small molecular graphs. ICML 2018 workshop

on Theoretical Foundations and Applications of Deep

Generative Models, 2018.

Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear

independent components estimation. arXiv preprint

arXiv:1410.8516, 2014.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima-

tion using real NVP. In 4th International Conference on

Learning Representations, 2016.

Gao, H. and Ji, S. Graph U-Nets. In Chaudhuri, K. and

Salakhutdinov, R. (eds.), Proceedings of the 36th Inter-

national Conference on Machine Learning, volume 97 of

Proceedings of Machine Learning Research, pp. 2083–

2092. PMLR, 09–15 Jun 2019.

Gao, H., Wang, Z., and Ji, S. Large-scale learnable graph

convolutional networks. In Proceedings of the 24th ACM

SIGKDD International Conference on Knowledge Discov-

ery & Data Mining, KDD ’18, pp. 1416–1424, New York,

NY, USA, 2018. Association for Computing Machinery.

Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D.,

Hernández-Lobato, J. M., Sánchez-Lengeling, B., She-

berla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams,

R. P., and Aspuru-Guzik, A. Automatic chemical de-

sign using a data-driven continuous representation of

molecules. ACS central science, 4(2):268–276, 2018.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.

Generative adversarial nets. In Ghahramani, Z., Welling,

M., Cortes, C., Lawrence, N., and Weinberger, K. Q.

(eds.), Advances in Neural Information Processing Sys-

tems, volume 27, pp. 2672–2680. Curran Associates, Inc.,

2014.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B.,

and Smola, A. A kernel two-sample test. The Journal of

Machine Learning Research, 13(1):723–773, 2012.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.

Neural computation, 9(8):1735–1780, 1997.

Honda, S., Akita, H., Ishiguro, K., Nakanishi, T., and Oono,

K. Graph residual flow for molecular graph generation.

arXiv preprint arXiv:1909.13521, 2019.

Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S.,

and Coleman, R. G. ZINC: a free tool to discover chem-

istry for biology. Journal of chemical information and

modeling, 52(7):1757–1768, 2012.

Jang, E., Gu, S., and Poole, B. Categorical reparameteriza-

tion with Gumbel-Softmax. In 4th International Confer-

ence on Learning Representations, 2016.

Jin, W., Barzilay, R., and Jaakkola, T. Junction tree vari-

ational autoencoder for molecular graph generation. In

Dy, J. and Krause, A. (eds.), Proceedings of the 35th In-

ternational Conference on Machine Learning, volume 80

of Proceedings of Machine Learning Research, pp. 2323–

2332, 2018.

Kajino, H. Molecular hypergraph grammar with its appli-

cation to molecular optimization. In Chaudhuri, K. and

Salakhutdinov, R. (eds.), Proceedings of the 36th Inter-

national Conference on Machine Learning, volume 97 of

Proceedings of Machine Learning Research, Long Beach,

California, USA, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic

optimization. In Proceddings of the 3rd international

conference on learning representations, 2015.

Kingma, D. P. and Welling, M. Auto-Encoding variational

bayes. In 2nd International Conference on Learning

Representations, 2013.

Kipf, T. N. and Welling, M. Semi-supervised classification

with graph convolutional networks. In 5th International

Conference on Learning Representations, 2017.

Kusner, M. J., Paige, B., and Hernández-Lobato, J. M.

Grammar variational autoencoder. In Precup, D. and Teh,

Y. W. (eds.), Proceedings of the 34th International Con-

ference on Machine Learning, volume 70 of Proceedings

of Machine Learning Research, pp. 1945–1954, 2017.

Landrum, G. RDKit: Open-source cheminformatics. http:

//www.rdkit.org, 2016. Accessed:2021-01-16.

Lippe, P. and Gavves, E. Categorical normalizing flows via

continuous transformations. In 9th International Confer-

ence on Learning Representations, 2021.

Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K.

Graph normalizing flows. In Wallach, H., Larochelle, H.,

Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R.

(eds.), Advances in Neural Information Processing Sys-

tems, volume 32, pp. 13578–13588. Curran Associates,

Inc., 2019.

Page 11

GraphDF: A Discrete Flow Model for Molecular Graph Generation

Liu, M., Luo, Y., Wang, L., Xie, Y., Yuan, H., Gui, S.,

Yu, H., Xu, Z., Zhang, J., Liu, Y., et al. DIG: A turnkey

library for diving into graph deep learning research. arXiv

preprint arXiv:2103.12608, 2021.

Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A.

Constrained graph variational autoencoders for molecule

design. In Bengio, S., Wallach, H., Larochelle, H., Grau-

man, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Ad-

vances in Neural Information Processing Systems, vol-

ume 31, pp. 7795–7804. Curran Associates, Inc., 2018.

Ma, T., Chen, J., and Xiao, C. Constrained generation of

semantically valid graphs via regularizing variational au-

toencoders. In Bengio, S., Wallach, H., Larochelle, H.,

Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),

Advances in Neural Information Processing Systems, vol-

ume 31, pp. 7113–7124. Curran Associates, Inc., 2018.

Madhawa, K., Ishiguro, K., Nakago, K., and Abe, M. Graph-

NVP: An invertible flow model for generating molecular

graphs. arXiv preprint arXiv:1905.11600, 2019.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked

autoregressive flow for density estimation. In Guyon,

I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R.,

Vishwanathan, S., and Garnett, R. (eds.), Advances in

Neural Information Processing Systems, volume 30, pp.

2338–2347. Curran Associates, Inc., 2017.

Polishchuk, P. G., Madzhidov, T. I., and Varnek, A. Es-

timation of the size of drug-like chemical space based

on GDB-17 data. Journal of computer-aided molecular

design, 27(8):675–679, 2013.

Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golo-

vanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Ar-

tamonov, A., Aladinskiy, V., Veselov, M., Kadurin,

A., Johansson, S., Chen, H., Nikolenko, S., Aspuru-

Guzik, A., and Zhavoronkov, A. Molecular sets

(MOSES): A benchmarking platform for molecular gen-

eration models. Frontiers in Pharmacology, 11:1931,

2020. ISSN 1663-9812. doi: 10.3389/fphar.2020.

565644. URL https://www.frontiersin.org/

article/10.3389/fphar.2020.565644.

Popova, M., Shvets, M., Oliva, J., and Isayev, O. Molecu-

larRNN: Generating realistic molecular graphs with op-

timized properties. arXiv preprint arXiv:1905.13372,

2019.

Ramakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld,

O. A. Quantum chemistry structures and properties of

134 kilo molecules. Scientific data, 1(1):1–7, 2014.

Rezende, D. and Mohamed, S. Variational inference with

normalizing flows. In Bach, F. and Blei, D. (eds.), Pro-

ceedings of the 32nd International Conference on Ma-

chine Learning, volume 37 of Proceedings of Machine

Learning Research, pp. 1530–1538, Lille, France, 2015.

Rogers, D. and Hahn, M. Extended-connectivity finger-

prints. Journal of chemical information and modeling, 50

(5):742–754, 2010.

Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg,

R., Titov, I., and Welling, M. Modeling relational data

with graph convolutional networks. In Gangemi, A., Nav-

igli, R., Vidal, M.-E., Hitzler, P., Troncy, R., Hollink, L.,

Tordai, A., and Alam, M. (eds.), The Semantic Web, pp.

593–607, Cham, 2018. Springer International Publishing.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. Proximal policy optimization algorithms.

arXiv preprint arXiv:1707.06347, 2017.

Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B.,

and Eliassi-Rad, T. Collective classification in network

data. AI magazine, 29(3):93–93, 2008.

Shi*, C., Xu*, M., Zhu, Z., Zhang, W., Zhang, M., and

Tang, J. GraphAF: a flow-based autoregressive model

for molecular graph generation. In 8th International

Conference on Learning Representations, 2020.

Simonovsky, M. and Komodakis, N. GraphVAE: Towards

generation of small graphs using variational autoencoders.

In Kurková, V., Manolopoulos, Y., Hammer, B., Iliadis,

L., and Maglogiannis, I. (eds.), Artificial Neural Net-

works and Machine Learning – ICANN 2018, pp. 412–

422, Cham, 2018. Springer International Publishing.

Stokes, J. M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz,

A., Donghia, N. M., MacNair, C. R., French, S., Carfrae,

L. A., Bloom-Ackerman, Z., et al. A deep learning ap-

proach to antibiotic discovery. Cell, 180(4):688–702,

2020.

Tran, D., Vafa, K., Agrawal, K., Dinh, L., and Poole, B.

Discrete flows: Invertible generative models of discrete

data. In Wallach, H., Larochelle, H., Beygelzimer, A.,

d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances

in Neural Information Processing Systems, volume 32,

pp. 14719–14728. Curran Associates, Inc., 2019.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. At-

tention is all you need. In Guyon, I., Luxburg, U. V.,

Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,

and Garnett, R. (eds.), Advances in Neural Information

Processing Systems, volume 30, pp. 5998–6008. Curran

Associates, Inc., 2017.

Page 12

GraphDF: A Discrete Flow Model for Molecular Graph Generation

Wang, Z., Liu, M., Luo, Y., Xu, Z., Xie, Y., Wang, L.,

Cai, L., and Ji, S. Advanced graph and sequence neu-

ral networks for molecular property prediction and drug

discovery. arXiv preprint arXiv:2012.01981, 2020.

Weininger, D. SMILES, a chemical language and informa-

tion system. 1. introduction to methodology and encoding

rules. Journal of chemical information and computer sci-

ences, 28(1):31–36, 1988.

You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. Graph

convolutional policy network for goal-directed molecular

graph generation. In Bengio, S., Wallach, H., Larochelle,

H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),

Advances in Neural Information Processing Systems, vol-

ume 31, pp. 6410–6421. Curran Associates, Inc., 2018a.

You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec,

J. GraphRNN: Generating realistic graphs with deep

auto-regressive models. In Dy, J. and Krause, A. (eds.),

Proceedings of the 35th International Conference on Ma-

chine Learning, volume 80 of Proceedings of Machine

Learning Research, pp. 5708–5717, Stockholmsmässan,

Stockholm Sweden, 2018b. PMLR.

Yu, L., Zhang, W., Wang, J., and Yu, Y. SeqGAN: Sequence

generative adversarial nets with policy gradient. In Pro-

ceedings of the AAAI conference on artificial intelligence,

volume 31, 2017.

Zang, C. and Wang, F. MoFlow: an invertible flow model for

generating molecular graphs. In Proceedings of the 26th

ACM SIGKDD International Conference on Knowledge

Discovery & Data Mining, pp. 617–626, 2020.