SlideShare a Scribd company logo
Translating data to
predictive models
Akos Tarcsay
Machine learning life-cycle
Trainer Engine
From data to prediction
Data ingestion Preprocessing Modelling
Features Models
Review Prediction
Model repository
Chemaxon
Descriptor
generation
Chemaxon
Standardizer
Persistence and
search
System overview
DB layer
PostgreSQL
Statistical
evaluation
ML library
(SMILE)
Conformal
prediction wrapper
Service layer
Programmatic
access
REST interface
“Comp Chem”
Trainer GUI
“Med Chem”
Prediction GUI
How to reduce noise?
Preprocessing
Effect of standardization
- Simple descriptors (Mw, fsp3,
HBDA, etc. )
Imipramine pamoate Furan-2-ol
- Phys-chem (logD, pKa)
- Molecular graph, Fingerprints
Salts, solvates Tautomerism
“Overall and despite our efforts to use open software wherever possible, we find that
ChemAxon Tautomers node outperforms the other approaches we tested.”
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00606-7
Small molecule retention time (SMRT) dataset: Tautomerization
https:/
/www.nature.com/articles/s41467-019-13680-7
SMRT Tautomer example, edge case
Training set: single tautomer random 7k cases
Test set: tautomerization affected 252 cases
Run with and without tautomerization
SMRT Tautomer effect
SMRT results
https:/
/www.nature.com/articles/s41467-019-13680-7
ChemAxon Trainer Engine Original publication
15k test compounds, R2
=0.9
Prediction power?
ChEMBL Benchmark
Activity dataset: the ‘ChEMBL bioactivity benchmark set’
Data source: Journal of Cheminformatics, 9, 45 (2017) by Eelke B. Lenselink, Niels
ten Dijke, Brandon Bongers, George Papadatos, Herman W. T. van Vlijmen, Wojtek
Kowalczyk, Adriaan P. IJzerman, Gerard J. P. van Westen
- ChEMBL database (version 20)
- Activities were selected that met the following criteria:
- at least 30 compounds tested per protein and from at least 2 separate publications
- assay confidence score of 9
- ‘single protein’ target type
- assigned pCHEMBL value
- no flags on potential duplicate or data validity comment
- originating from scientific literature
- data points with activity comments ‘not active’, ‘inactive’, ‘inconclusive’, and ‘undetermined’ were
removed
- MED value was chosen
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 161 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points
reserved as External set: Ext
Last 30 Ext
Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 161 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points reserved
as External set: Ext
- 10-90% test-training set split: Test
- ~160k total training size
- ~18k total test size
- 20 different descriptor configurations
- Random Forest
Rnd 90% Train
Last 30 Ext
Rnd 10% Test
...
Prediction power?
ChEMBL Regression
Benchmark on 161 ChEMBL targets, best single configuration
161 ChEMBL Targets
Analysis of the best models per target
Pearson R2
Ext Test Ext Test
Avg 0.672 0.824 0.306 0.679
Median 0.722 0.833 0.385 0.697
Prediction power?
ChEMBL Classification
FP
TP
TN
FN
Is it a hard task? Original results random split
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
“The best method overall is the DNN_MC with an MCC of 0.57 (±0.07)”
Is it a hard task? Original results temporal split
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
Classification case: class balance
Random Forest results
RF
MCC Ext Test
Avg 0.310 0.541
Median 0.282 0.589
Is it a hard task? Base model: MPNN
https:/
/keras.io/examples/graph/mpnn-molecular-graphs/
atom_featurizer = AtomFeaturizer(
allowable_sets={
"symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na", "O", "P", "S"},
"n_valence": {0, 1, 2, 3, 4, 5, 6},
"n_hydrogens": {0, 1, 2, 3, 4},
"hybridization": {"s", "sp", "sp2", "sp3"},
}
)
bond_featurizer = BondFeaturizer(
allowable_sets={
"bond_type": {"single", "double", "triple", "aromatic"},
"conjugated": {True, False},
}
)
MCC reverse cumulative histogram
Random split Test case
RF MPNN
MCC Ext Test Ext Test
Avg 0.310 0.541 0.207 0.471
Median 0.282 0.589 0.178 0.493
Binned MCC evaluation random selected Test case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High
Binned MCC evaluation random selected Test case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High
MCC reverse cumulative histogram
Temporal split Ext case
RF MPNN
MCC Ext Test Ext Test
Avg 0.310 0.541 0.207 0.471
Median 0.282 0.589 0.178 0.493
Binned MCC evaluation temporal split Ext case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High
Binned MCC evaluation temporal split Ext case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High
How reliable?
Confidence
?
Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
https:/
/www.jmlr.org/papers/volume9/shafer08a/shafer08a.pdf
https:/
/pubs.acs.org/doi/10.1021/ci5001168
Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
Test: 14233 / 17661 80.6% within the error bound
Ext: 3344 / 4890 68.4% within the error bound
Feasible?
Performance
AVG prediction time per set
Descriptor: ECFP6_1024_MACCS_PHYSCHEM
Intel(R) Xeon(R) CPU E3-1270 v3 @ 3.50GHz
~50 ms/ cpd
72k cpd /h
Drill down to the lowest performance set
Drill down to the lowest performance set
TC_key: CHEMBL247 - CHEMBL2373969
BIOACT_PCHEMBL_VALUE: 7.54
TGT_CHEMBL_ID: CHEMBL247
TGT_ORGANISM: Human immunodeficiency virus 1
Feature engineering is the
process of using domain
knowledge to extract
features from raw data.
Combined descriptor set
∑ 20 *161 =3220 descriptor sets
RF Importance (ECFP4_CHEMTERM:All)
Sum of node
impurity decreases
for each individual
variable over trees.
Importance based
rank is evaluated
RF Importance (ECFP4_CHEMTERM:All)
50% related to
protonation or
partitioning
https:/
/www.nature.com/articles/s41598-019-55886-1
Protonation de-tour: Fentanyl
https:/
/www.nature.com/articles/s41598-019-55886-1
Protonation de-tour: Fentanyl
Protonation de-tour: Fentanyl F-derivative: FF3
https:/
/www.nature.com/articles/s41598-019-55886-1
No difference at pH 6.5
10x MOR IC50
difference at pH 7.4
Statistical Assessment of the Modeling of Proteins and Ligands,
SAMPL6 challenge logP
https:/
/chemaxon.com/news/cxn-logp-prediction-sampl-6
SAMPL7 pKa
https://link.springer.com/article/10.1007/s10822-021-00397-3
SAMPL7 logD
https://link.springer.com/article/10.1007/s10822-021-00397-3
Classification use case:
Blood-Brain Barrier
Penetration
Blood brain barrier penetration model
MoleculeNet, 10.1039/c7sc02664a
Model analysis
Rich prediction results
Classification use case:
PAMPA Permeability
Data preparation
1) Source: PubChem BioAssay AID: 1508612
a) NCATS Parallel Artificial Membrane Permeability Assay (PAMPA) Profiling
b) Observed PAMPA at pH 7.4 (x 10-6 cm/sec)
2) Standardize (strip salts, tautomerize)
3) Mw 0-800 -> 2029 cases
4) Permeability 100 cutoff using “Phenotype” field
0:Low, Medium (646 cases)
1:High (1383 cases)
Clustering MACCS similarity matrix tSNE projection
Is it a hard task? Base model: MPNN
Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740
https:/
/keras.io/examples/graph/mpnn-molecular-graphs/
atom_featurizer = AtomFeaturizer(
allowable_sets={
"symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na",
"O", "P", "S"},
"n_valence": {0, 1, 2, 3, 4, 5, 6},
"n_hydrogens": {0, 1, 2, 3, 4},
"hybridization": {"s", "sp", "sp2", "sp3"},
}
)
bond_featurizer = BondFeaturizer(
allowable_sets={
"bond_type": {"single", "double", "triple", "aromatic"},
"conjugated": {True, False},
}
)
MPNN: Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740
Classification use case:
hERG
Data source: Scientific Reports vol 9, 12220 (2019)
Keiji Ogura, Tomohiro Sato, Hitomi Yuki, Teruki Honma
- 9 890 hERG inhibitors (IC50 ≤ 10 μM or ≥50% inhibition at 10 μM)
- 281 329 inactive compounds (IC50 > 10 μM or <50% inhibition at 10
μM) according to their standardized chemical structures
- 204k training data points
- 87k test data points
- Descriptor set: ECFP6_2048_MACCS_PHYSCHEM
Classification model
https:/
/www.nature.com/articles/s41598-019-47536-3
hERG Classification Model
https:/
/www.nature.com/articles/s41598-019-47536-3
Chemaxon
CONVERTING DATA TO
PROJECT TEAM INSIGHTS
Machine learning life-cycle
Discovery teams
Design Landscape
Design Hub
Series
H1 H2 H3 H4
Discovery hub
connecting chemical
series, data,
predictions and
chemical project
management
Discovery teams
Analysis
Design Hub
Series
H1 H2 H3 H4
Biological
measurements
Discovery teams
Biological data focus
Design Hub
Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine
Discovery teams
Train&Deploy
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API
Discovery teams
Fill the gap
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API
Discovery teams
Multi parameter optimization
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine
Local and global models
Local and global models
Translate data to reliable
models
Centralize model
management
Connect project team
members and resources
Track and manage discovery
Design Hub
Lower the barrier to adopt AI models in design
Trainer Engine
Take away
- Chemical standardization to reduce noise
- Role of protonation and partitioning descriptors
- Successful model building on large and diverse
set of targets
- ML inference, delivery to medicinal chemists
Interested?
atarcsay@chemaxon.com
Thank you

More Related Content

PDF
Translating data to predictive models
PDF
Automation of building reliable models
PDF
Sh rn awhitepaper
PDF
Validaternai
PPTX
Lung Cancer Prediction using Image Classification
PDF
MS (and NMR) data standards in Metabolomics why, how and some caveats
PPTX
Medical Image Segmentation Using Hidden Markov Random Field A Distributed Ap...
PDF
A comparison of three chromatographic retention time prediction models
Translating data to predictive models
Automation of building reliable models
Sh rn awhitepaper
Validaternai
Lung Cancer Prediction using Image Classification
MS (and NMR) data standards in Metabolomics why, how and some caveats
Medical Image Segmentation Using Hidden Markov Random Field A Distributed Ap...
A comparison of three chromatographic retention time prediction models

Similar to Translating data to model ICCS2022_pub.pdf (20)

PDF
Models Can Lie
PDF
Abrf poster2007
PPT
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PPTX
Paper Presentation
PDF
Automated parameter optimization should be included in future 
defect predict...
PDF
Workshop - Introduction to Machine Learning with R
PDF
Aai 2007-pcr array-poster
PDF
Ascb 2007-pcr array-poster
PPTX
Prediction of pKa from chemical structure using free and open source tools
PDF
Complex models in ecology: challenges and solutions
PPTX
Wang labsummer2010
PDF
1073958 wp guide-develop-pcr_primers_1012
PDF
Tpa 2013
PDF
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Computational tools for drug discovery
PDF
[2017-05-29] DNASmartTagger
PPTX
Development and comparison of deep learning toolkit with other machine learni...
PDF
Q pcr symposium2007-pcrarray
Models Can Lie
Abrf poster2007
Deep learning methods applied to physicochemical and toxicological endpoints
Paper Presentation
Automated parameter optimization should be included in future 
defect predict...
Workshop - Introduction to Machine Learning with R
Aai 2007-pcr array-poster
Ascb 2007-pcr array-poster
Prediction of pKa from chemical structure using free and open source tools
Complex models in ecology: challenges and solutions
Wang labsummer2010
1073958 wp guide-develop-pcr_primers_1012
Tpa 2013
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
Raven: End-to-end Optimization of ML Prediction Queries
Computational tools for drug discovery
[2017-05-29] DNASmartTagger
Development and comparison of deep learning toolkit with other machine learni...
Q pcr symposium2007-pcrarray
Ad

Recently uploaded (20)

PPTX
anaemia in PGJKKKKKKKKKKKKKKKKHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH...
PPTX
Acid Base Disorders educational power point.pptx
PPTX
Transforming Regulatory Affairs with ChatGPT-5.pptx
PDF
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
DOC
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
PPTX
SKIN Anatomy and physiology and associated diseases
PPTX
Uterus anatomy embryology, and clinical aspects
PPTX
History and examination of abdomen, & pelvis .pptx
PPTX
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
PPTX
Fundamentals of human energy transfer .pptx
PPT
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
PPTX
Imaging of parasitic D. Case Discussions.pptx
PPT
MENTAL HEALTH - NOTES.ppt for nursing students
PPT
Obstructive sleep apnea in orthodontics treatment
PPT
Breast Cancer management for medicsl student.ppt
PPTX
CME 2 Acute Chest Pain preentation for education
PPTX
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
PPTX
surgery guide for USMLE step 2-part 1.pptx
PPTX
ACID BASE management, base deficit correction
PPTX
neonatal infection(7392992y282939y5.pptx
anaemia in PGJKKKKKKKKKKKKKKKKHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH...
Acid Base Disorders educational power point.pptx
Transforming Regulatory Affairs with ChatGPT-5.pptx
Oral Aspect of Metabolic Disease_20250717_192438_0000.pdf
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
SKIN Anatomy and physiology and associated diseases
Uterus anatomy embryology, and clinical aspects
History and examination of abdomen, & pelvis .pptx
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
Fundamentals of human energy transfer .pptx
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
Imaging of parasitic D. Case Discussions.pptx
MENTAL HEALTH - NOTES.ppt for nursing students
Obstructive sleep apnea in orthodontics treatment
Breast Cancer management for medicsl student.ppt
CME 2 Acute Chest Pain preentation for education
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
surgery guide for USMLE step 2-part 1.pptx
ACID BASE management, base deficit correction
neonatal infection(7392992y282939y5.pptx
Ad

Translating data to model ICCS2022_pub.pdf