Why Use Diffusion Models for Molecules?
The PDB database is limited:
For robust diffusion model training, millions of diverse data points are needed. Data augmentation enhances:
Data augmentation techniques create a richer dataset, boosting model performance.
Data Augmentation Techniques:
Figure 1: MD Simulation Trajectories
Figure 2: Pharmacophore Modeling
Benchmark on Posebusters Dataset: Posebusters: Version 1 (428 structures) and Version 2 (308 structures), released post-2021 in PDB. Performance: % of ligand pairs with $RMSD < 2 Å$ in pocket alignment.
Traditional docking tools are slow, limiting the efficiency of application of virtual screening.
For a given protein linked to a certain disease, the goal of virtual screening is to select a few small molecules (i.e., ligand) from a library of millions candidates such that the selected candidate will have the highest utility in disease treating.
Even with the right trade-off objective elicited from expert, exhaustively screening millions of candidate from the virtual screening library is practically infeasible. To address this problem, we can choose to screen ligand that looks promising, while avoid ligand that are highly certain to be a bad candidate.
To prioritize high-potential ligands, we use Bayesian Optimization, an approach that balances exploring new candidates and exploiting known promising ones to efficiently find optimal solutions.
In virtual screening, our goal is to find effective ligands efficiently. Traditional methods can be slow, especially with large, diverse libraries. If identified ligands are too structurally unique, they may be difficult or impossible to synthesize for chemists. By using constrained settings, we can focus on ligands with desirable features that are also more likely to be synthetically accessible.
Using constrained settings, we can limit our search to clusters of chemically similar ligands, increasing the speed and accuracy of our screening while reducing computational demands.
Metrics for evaluation
Percent of Best Ligand Found
In drug discovery, selecting candidate ligands goes beyond targeting high-affinity molecules. Experts use their deep chemical intuition to balance competing properties such as synthesizability, solubility, and potential side effects. This approach ensures ligands are not only effective but also practical and safe for therapeutic use.
Depending on the specific disease and protein, experts have intuition about characteristics of candidate ligands, trading off various objectives such as synthesizability, affinity, solubility, and side effects.
These implicit expert knowledge, encoded as preferences over ligands, are valuable to elicit for effective virtual screening. We can leverage toolkits from the field of machine learning from human preferences to tackle this challenge.
First ligand | Second ligand | Preference $(x_1 \succ x_2)$ |
---|---|---|
[-7.81, 114.38, 0.51] | [-8.12, 116.28, 0.47] | 0 |
[-10.45, 186.17, 0.29] | [-8.12, 116.28, 0.47] | 1 |
[-6.18, 35.32, 0.83] | [-8.12, 116.28, 0.47] | 0 |
Each ligand is represented by a set of features, such as affinity, polar surface area, QED drug-likeness score
The latent utility function $f$ can be modeled using various approaches. One popular choice is the Gaussian Process (GP), a non-parametric Bayesian method that defines a distribution over possible functions.
Learning GP Classifier can be done with standard machine learning toolbox such as scikit-learn
.
For example, when the synthetic oracle is the Auckley function, we obtain 85% train and test accuracy.
Synthetic Auckley function
Learning chemical intuition is done in a close-loop, where the computer interacts with the chemist in an active manner. Starting with distribution over function $f$ condition on the current data, $p(f | D)$, our procedure includes 4 iterative steps:
Step 1: Sample two candidate utility: $f_1 \sim p(f|D), f_2 \sim p(f|D)$Step 2: Find the best ligand under each utility function:
$$x_1 = \arg\max_{x \in \mathcal{L}} f_1(x), x_2 = \arg\max_{x \in \mathcal{L}} f_2 (x)$$
Step 3: Present the two candidate ligands $x_1$ and $x_2$ to the expert to obtain preference $y$.
Step 4: Update the model in the present of new data $\mathcal{D} \leftarrow \mathcal{D} \cup {(x_1, x_2, y)}$.
Observed accuracy plot for high-dimensional data with objectives: QED, affinity, polar surface area, and molecular weight.
Incorporating chemists' intuition into the virtual screening loop.