Translating data to predictive models

Translating data to
predictive models
Akos Tarcsay

From data to prediction
Data ingestion Preprocessing Modelling
Features Models
Review Prediction
Model repository

Chemaxon
Descriptor
generation
Chemaxon
Standardizer
Persistence and
search
System overview
DB layer
PostgreSQL
Statistical
evaluation
ML library
(SMILE)
Conformal
prediction wrapper
Service layer
Programmatic
access
REST interface
“Comp Chem”
Trainer GUI
“Med Chem”
Prediction GUI

How to reduce noise?
Preprocessing

Effect of standardization
- Simple descriptors (Mw, fsp3,
HBDA, etc. )
Imipramine pamoate Furan-2-ol
- Phys-chem (logD, pKa)
- Molecular graph, Fingerprints
Salts, solvates Tautomerism
“Overall and despite our efforts to use open software wherever possible, we ﬁnd that
ChemAxon Tautomers node outperforms the other approaches we tested.”
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00606-7

Small molecule retention time (SMRT) dataset: Tautomerization
https:/
/www.nature.com/articles/s41467-019-13680-7

SMRT Tautomer example, edge case
Training set: single tautomer random 7k cases
Test set: tautomerization affected 252 cases
Run with and without tautomerization

SMRT results
https:/
ChemAxon Trainer Engine Original publication
15k test compounds, R2
=0.9

Prediction power?
ChEMBL Benchmark

Activity dataset: the ‘ChEMBL bioactivity benchmark set’
Data source: Journal of Cheminformatics, 9, 45 (2017) by Eelke B. Lenselink, Niels
ten Dijke, Brandon Bongers, George Papadatos, Herman W. T. van Vlijmen, Wojtek
Kowalczyk, Adriaan P. IJzerman, Gerard J. P. van Westen
- ChEMBL database (version 20)
- Activities were selected that met the following criteria:
- at least 30 compounds tested per protein and from at least 2 separate publications
- assay confidence score of 9
- ‘single protein’ target type
- assigned pCHEMBL value
- no flags on potential duplicate or data validity comment
- originating from scientific literature
- data points with activity comments ‘not active’, ‘inactive’, ‘inconclusive’, and ‘undetermined’ were
removed
- MED value was chosen
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0

Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 161 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points
reserved as External set: Ext
Last 30 Ext

Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 161 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points reserved
as External set: Ext
- 10-90% test-training set split: Test
- ~160k total training size
- ~18k total test size
- 20 different descriptor conﬁgurations
- Random Forest
Rnd 90% Train
Last 30 Ext
Rnd 10% Test
...

Prediction power?
ChEMBL Regression

Benchmark on 161 ChEMBL targets, best single conﬁguration
161 ChEMBL Targets

Analysis of the best models per target
Pearson R2
Ext Test Ext Test
Avg 0.672 0.824 0.306 0.679
Median 0.722 0.833 0.385 0.697

Prediction power?
ChEMBL Classiﬁcation
FP
TP
TN
FN

Is it a hard task? Original results random split
https:/
“The best method overall is the DNN_MC with an MCC of 0.57 (±0.07)”

Is it a hard task? Original results temporal split
https:/

Classiﬁcation case: class balance

Random Forest results
RF
MCC Ext Test
Avg 0.310 0.541
Median 0.282 0.589

Is it a hard task? Base model: MPNN
https:/
/keras.io/examples/graph/mpnn-molecular-graphs/
atom_featurizer = AtomFeaturizer(
allowable_sets={
"symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na", "O", "P", "S"},
"n_valence": {0, 1, 2, 3, 4, 5, 6},
"n_hydrogens": {0, 1, 2, 3, 4},
"hybridization": {"s", "sp", "sp2", "sp3"},
}
)
bond_featurizer = BondFeaturizer(
allowable_sets={
"bond_type": {"single", "double", "triple", "aromatic"},
"conjugated": {True, False},
}
)

MCC reverse cumulative histogram
Random split Test case
RF MPNN
MCC Ext Test Ext Test
Avg 0.310 0.541 0.207 0.471
Median 0.282 0.589 0.178 0.493

Binned MCC evaluation random selected Test case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High

MCC reverse cumulative histogram
Temporal split Ext case
RF MPNN
MCC Ext Test Ext Test
Avg 0.310 0.541 0.207 0.471
Median 0.282 0.589 0.178 0.493

Binned MCC evaluation temporal split Ext case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High

Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
https:/
/www.jmlr.org/papers/volume9/shafer08a/shafer08a.pdf
https:/
/pubs.acs.org/doi/10.1021/ci5001168

Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
Test: 14233 / 17661 80.6% within the error bound
Ext: 3344 / 4890 68.4% within the error bound

AVG prediction time per set
Descriptor: ECFP6_1024_MACCS_PHYSCHEM
Intel(R) Xeon(R) CPU E3-1270 v3 @ 3.50GHz
~50 ms/ cpd
72k cpd /h

Drill down to the lowest performance set

Drill down to the lowest performance set
TC_key: CHEMBL247 - CHEMBL2373969
BIOACT_PCHEMBL_VALUE: 7.54
TGT_CHEMBL_ID: CHEMBL247
TGT_ORGANISM: Human immunodeficiency virus 1

Feature engineering is the
process of using domain
knowledge to extract
features from raw data.

Combined descriptor set
∑ 20 *161 =3220 descriptor sets

RF Importance (ECFP4_CHEMTERM:All)
Sum of node
impurity decreases
for each individual
variable over trees.
Importance based
rank is evaluated

RF Importance (ECFP4_CHEMTERM:All)
50% related to
protonation or
partitioning

https:/
Protonation de-tour: Fentanyl

Protonation de-tour: Fentanyl F-derivative: FF3
https:/
No difference at pH 6.5
10x MOR IC50
difference at pH 7.4

Statistical Assessment of the Modeling of Proteins and Ligands,
SAMPL6 challenge logP
https:/
/chemaxon.com/news/cxn-logp-prediction-sampl-6

SAMPL7 pKa
https://link.springer.com/article/10.1007/s10822-021-00397-3

SAMPL7 logD
https://link.springer.com/article/10.1007/s10822-021-00397-3

Classiﬁcation use case:
Blood-Brain Barrier
Penetration

Blood brain barrier penetration model
MoleculeNet, 10.1039/c7sc02664a

PAMPA Permeability

Data preparation
1) Source: PubChem BioAssay AID: 1508612
a) NCATS Parallel Artificial Membrane Permeability Assay (PAMPA) Profiling
b) Observed PAMPA at pH 7.4 (x 10-6 cm/sec)
2) Standardize (strip salts, tautomerize)
3) Mw 0-800 -> 2029 cases
4) Permeability 100 cutoff using “Phenotype” field
0:Low, Medium (646 cases)
1:High (1383 cases)

Clustering MACCS similarity matrix tSNE projection

Is it a hard task? Base model: MPNN
Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740
https:/
/keras.io/examples/graph/mpnn-molecular-graphs/
atom_featurizer = AtomFeaturizer(
allowable_sets={
"symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na",
"O", "P", "S"},
"n_valence": {0, 1, 2, 3, 4, 5, 6},
"n_hydrogens": {0, 1, 2, 3, 4},
"hybridization": {"s", "sp", "sp2", "sp3"},
}
)
bond_featurizer = BondFeaturizer(
allowable_sets={
"bond_type": {"single", "double", "triple", "aromatic"},
"conjugated": {True, False},
}
)

MPNN: Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740

hERG

Data source: Scientiﬁc Reports vol 9, 12220 (2019)
Keiji Ogura, Tomohiro Sato, Hitomi Yuki, Teruki Honma
- 9 890 hERG inhibitors (IC50 ≤ 10 μM or ≥50% inhibition at 10 μM)
- 281 329 inactive compounds (IC50 > 10 μM or <50% inhibition at 10
μM) according to their standardized chemical structures
- 204k training data points
- 87k test data points
- Descriptor set: ECFP6_2048_MACCS_PHYSCHEM
Classiﬁcation model
https:/

hERG Classiﬁcation Model
https:/
Chemaxon

CONVERTING DATA TO
PROJECT TEAM INSIGHTS

Discovery teams
Design Landscape
Design Hub
Series
H1 H2 H3 H4
Discovery hub
connecting chemical
series, data,
predictions and
chemical project
management

Discovery teams
Analysis
Design Hub
Series
H1 H2 H3 H4
Biological
measurements

Discovery teams
Biological data focus
Design Hub
Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine

Discovery teams
Train&Deploy
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API

Discovery teams
Fill the gap
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API

Discovery teams
Multi parameter optimization
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine

Translate data to reliable
models
Centralize model
management
Connect project team
members and resources
Track and manage discovery
Design Hub
Lower the barrier to adopt AI models in design
Trainer Engine

Take away
- Chemical standardization to reduce noise
- Role of protonation and partitioning descriptors
- Successful model building on large and diverse
set of targets
- ML inference, delivery to medicinal chemists

Interested?
atarcsay@chemaxon.com

Translating data to predictive models

More Related Content

Similar to Translating data to predictive models (20)

More from ChemAxon (20)

Recently uploaded (20)

Translating data to predictive models