SlideShare a Scribd company logo
Translating data to
predictive models
Akos Tarcsay
Machine learning life-cycle
Trainer Engine
From data to prediction
Data ingestion Preprocessing Modelling
Features Models
Review Prediction
Model repository
Chemaxon
Descriptor
generation
Chemaxon
Standardizer
Persistence and
search
System overview
DB layer
PostgreSQL
Statistical
evaluation
ML library
(SMILE)
Conformal
prediction wrapper
Service layer
Programmatic
access
REST interface
“Comp Chem”
Trainer GUI
“Med Chem”
Prediction GUI
How to reduce noise?
Preprocessing
Effect of standardization
- Simple descriptors (Mw, fsp3,
HBDA, etc. )
Imipramine pamoate Furan-2-ol
- Phys-chem (logD, pKa)
- Molecular graph, Fingerprints
Salts, solvates Tautomerism
“Overall and despite our efforts to use open software wherever possible, we find that
ChemAxon Tautomers node outperforms the other approaches we tested.”
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00606-7
Small molecule retention time (SMRT) dataset: Tautomerization
https:/
/www.nature.com/articles/s41467-019-13680-7
SMRT Tautomer example, edge case
Training set: single tautomer random 7k cases
Test set: tautomerization affected 252 cases
Run with and without tautomerization
SMRT Tautomer effect
SMRT results
https:/
/www.nature.com/articles/s41467-019-13680-7
ChemAxon Trainer Engine Original publication
15k test compounds, R2
=0.9
Prediction power?
ChEMBL Benchmark
Activity dataset: the ‘ChEMBL bioactivity benchmark set’
Data source: Journal of Cheminformatics, 9, 45 (2017) by Eelke B. Lenselink, Niels
ten Dijke, Brandon Bongers, George Papadatos, Herman W. T. van Vlijmen, Wojtek
Kowalczyk, Adriaan P. IJzerman, Gerard J. P. van Westen
- ChEMBL database (version 20)
- Activities were selected that met the following criteria:
- at least 30 compounds tested per protein and from at least 2 separate publications
- assay confidence score of 9
- ‘single protein’ target type
- assigned pCHEMBL value
- no flags on potential duplicate or data validity comment
- originating from scientific literature
- data points with activity comments ‘not active’, ‘inactive’, ‘inconclusive’, and ‘undetermined’ were
removed
- MED value was chosen
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 161 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points
reserved as External set: Ext
Last 30 Ext
Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 161 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points reserved
as External set: Ext
- 10-90% test-training set split: Test
- ~160k total training size
- ~18k total test size
- 20 different descriptor configurations
- Random Forest
Rnd 90% Train
Last 30 Ext
Rnd 10% Test
...
Prediction power?
ChEMBL Regression
Benchmark on 161 ChEMBL targets, best single configuration
161 ChEMBL Targets
Analysis of the best models per target
Pearson R2
Ext Test Ext Test
Avg 0.672 0.824 0.306 0.679
Median 0.722 0.833 0.385 0.697
Prediction power?
ChEMBL Classification
FP
TP
TN
FN
Is it a hard task? Original results random split
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
“The best method overall is the DNN_MC with an MCC of 0.57 (±0.07)”
Is it a hard task? Original results temporal split
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
Classification case: class balance
Random Forest results
RF
MCC Ext Test
Avg 0.310 0.541
Median 0.282 0.589
Is it a hard task? Base model: MPNN
https:/
/keras.io/examples/graph/mpnn-molecular-graphs/
atom_featurizer = AtomFeaturizer(
allowable_sets={
"symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na", "O", "P", "S"},
"n_valence": {0, 1, 2, 3, 4, 5, 6},
"n_hydrogens": {0, 1, 2, 3, 4},
"hybridization": {"s", "sp", "sp2", "sp3"},
}
)
bond_featurizer = BondFeaturizer(
allowable_sets={
"bond_type": {"single", "double", "triple", "aromatic"},
"conjugated": {True, False},
}
)
MCC reverse cumulative histogram
Random split Test case
RF MPNN
MCC Ext Test Ext Test
Avg 0.310 0.541 0.207 0.471
Median 0.282 0.589 0.178 0.493
Binned MCC evaluation random selected Test case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High
Binned MCC evaluation random selected Test case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High
MCC reverse cumulative histogram
Temporal split Ext case
RF MPNN
MCC Ext Test Ext Test
Avg 0.310 0.541 0.207 0.471
Median 0.282 0.589 0.178 0.493
Binned MCC evaluation temporal split Ext case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High
Binned MCC evaluation temporal split Ext case
MCC<0.45: Low
0.45<MCC<0.65: Medium
0.65<MCC: High
How reliable?
Confidence
?
Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
https:/
/www.jmlr.org/papers/volume9/shafer08a/shafer08a.pdf
https:/
/pubs.acs.org/doi/10.1021/ci5001168
Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
Test: 14233 / 17661 80.6% within the error bound
Ext: 3344 / 4890 68.4% within the error bound
Feasible?
Performance
AVG prediction time per set
Descriptor: ECFP6_1024_MACCS_PHYSCHEM
Intel(R) Xeon(R) CPU E3-1270 v3 @ 3.50GHz
~50 ms/ cpd
72k cpd /h
Drill down to the lowest performance set
Drill down to the lowest performance set
TC_key: CHEMBL247 - CHEMBL2373969
BIOACT_PCHEMBL_VALUE: 7.54
TGT_CHEMBL_ID: CHEMBL247
TGT_ORGANISM: Human immunodeficiency virus 1
Feature engineering is the
process of using domain
knowledge to extract
features from raw data.
Combined descriptor set
∑ 20 *161 =3220 descriptor sets
RF Importance (ECFP4_CHEMTERM:All)
Sum of node
impurity decreases
for each individual
variable over trees.
Importance based
rank is evaluated
RF Importance (ECFP4_CHEMTERM:All)
50% related to
protonation or
partitioning
https:/
/www.nature.com/articles/s41598-019-55886-1
Protonation de-tour: Fentanyl
https:/
/www.nature.com/articles/s41598-019-55886-1
Protonation de-tour: Fentanyl
Protonation de-tour: Fentanyl F-derivative: FF3
https:/
/www.nature.com/articles/s41598-019-55886-1
No difference at pH 6.5
10x MOR IC50
difference at pH 7.4
Statistical Assessment of the Modeling of Proteins and Ligands,
SAMPL6 challenge logP
https:/
/chemaxon.com/news/cxn-logp-prediction-sampl-6
SAMPL7 pKa
https://link.springer.com/article/10.1007/s10822-021-00397-3
SAMPL7 logD
https://link.springer.com/article/10.1007/s10822-021-00397-3
Classification use case:
Blood-Brain Barrier
Penetration
Blood brain barrier penetration model
MoleculeNet, 10.1039/c7sc02664a
Model analysis
Rich prediction results
Classification use case:
PAMPA Permeability
Data preparation
1) Source: PubChem BioAssay AID: 1508612
a) NCATS Parallel Artificial Membrane Permeability Assay (PAMPA) Profiling
b) Observed PAMPA at pH 7.4 (x 10-6 cm/sec)
2) Standardize (strip salts, tautomerize)
3) Mw 0-800 -> 2029 cases
4) Permeability 100 cutoff using “Phenotype” field
0:Low, Medium (646 cases)
1:High (1383 cases)
Clustering MACCS similarity matrix tSNE projection
Is it a hard task? Base model: MPNN
Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740
https:/
/keras.io/examples/graph/mpnn-molecular-graphs/
atom_featurizer = AtomFeaturizer(
allowable_sets={
"symbol": {"B", "Br", "C", "Ca", "Cl", "F", "H", "I", "N", "Na",
"O", "P", "S"},
"n_valence": {0, 1, 2, 3, 4, 5, 6},
"n_hydrogens": {0, 1, 2, 3, 4},
"hybridization": {"s", "sp", "sp2", "sp3"},
}
)
bond_featurizer = BondFeaturizer(
allowable_sets={
"bond_type": {"single", "double", "triple", "aromatic"},
"conjugated": {True, False},
}
)
MPNN: Epoch 40/40
Train: AUC: 0.7897 MCC: 0.3921
Validation: AUC: 0.5879 MCC: 0.0740
Classification use case:
hERG
Data source: Scientific Reports vol 9, 12220 (2019)
Keiji Ogura, Tomohiro Sato, Hitomi Yuki, Teruki Honma
- 9 890 hERG inhibitors (IC50 ≤ 10 μM or ≥50% inhibition at 10 μM)
- 281 329 inactive compounds (IC50 > 10 μM or <50% inhibition at 10 
μM) according to their standardized chemical structures
- 204k training data points
- 87k test data points
- Descriptor set: ECFP6_2048_MACCS_PHYSCHEM
Classification model
https:/
/www.nature.com/articles/s41598-019-47536-3
hERG Classification Model
https:/
/www.nature.com/articles/s41598-019-47536-3
Chemaxon
CONVERTING DATA TO
PROJECT TEAM INSIGHTS
Machine learning life-cycle
Discovery teams
Design Landscape
Design Hub
Series
H1 H2 H3 H4
Discovery hub
connecting chemical
series, data,
predictions and
chemical project
management
Discovery teams
Analysis
Design Hub
Series
H1 H2 H3 H4
Biological
measurements
Discovery teams
Biological data focus
Design Hub
Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine
Discovery teams
Train&Deploy
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API
Discovery teams
Fill the gap
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API
Discovery teams
Multi parameter optimization
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine
Local and global models
Local and global models
Translate data to reliable
models
Centralize model
management
Connect project team
members and resources
Track and manage discovery
Design Hub
Lower the barrier to adopt AI models in design
Trainer Engine
Take away
- Chemical standardization to reduce noise
- Role of protonation and partitioning descriptors
- Successful model building on large and diverse
set of targets
- ML inference, delivery to medicinal chemists
Interested?
atarcsay@chemaxon.com
Thank you

More Related Content

PDF
Translating data to model ICCS2022_pub.pdf
PDF
Automation of building reliable models
PPTX
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
PDF
Accelerating lead optimisation with active learning by exploiting MMPA based ...
PPT
Predicting Pharmacology
PDF
Mining Big datasets to create and validate machine learning models
PPTX
Development and comparison of deep learning toolkit with other machine learni...
PPTX
Using open bioactivity data for developing machine-learning prediction models...
Translating data to model ICCS2022_pub.pdf
Automation of building reliable models
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Accelerating lead optimisation with active learning by exploiting MMPA based ...
Predicting Pharmacology
Mining Big datasets to create and validate machine learning models
Development and comparison of deep learning toolkit with other machine learni...
Using open bioactivity data for developing machine-learning prediction models...

Similar to Translating data to predictive models (20)

PDF
Predicting Molecular Properties
PDF
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
PPTX
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
PDF
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
PPTX
Machine learning methods for chemical properties and toxicity based endpoints
PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
PDF
Robots, Small Molecules & R
PDF
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PDF
Open Science Data Repository - Dataledger
PPTX
Drug properties (ADMET) prediction using AI
PPTX
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
PPTX
Free online access to experimental and predicted chemical properties through ...
PDF
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
PPTX
Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...
PDF
Moving from Artisanal to Industrial Machine Learning
PDF
Garrett Goh, Scientist, Pacific Northwest National Lab
PDF
Protein family specific models using deep neural networks and transfer learni...
PPT
SOT short course on computational toxicology
PDF
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
Predicting Molecular Properties
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Machine learning methods for chemical properties and toxicity based endpoints
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Robots, Small Molecules & R
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Deep learning methods applied to physicochemical and toxicological endpoints
Open Science Data Repository - Dataledger
Drug properties (ADMET) prediction using AI
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Free online access to experimental and predicted chemical properties through ...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
Medicinal Chemistry Due Diligence: Computational Predictions of an expert’s e...
Moving from Artisanal to Industrial Machine Learning
Garrett Goh, Scientist, Pacific Northwest National Lab
Protein family specific models using deep neural networks and transfer learni...
SOT short course on computational toxicology
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
Ad

More from ChemAxon (20)

PPTX
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
PDF
Chemaxon EU UGM 2022 | Translating data to predictive models
PDF
Efficient biomolecular structural data handling and analysis - Webinar with D...
PDF
Biomolecule structural data management
PPTX
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
PDF
Enhanced stereochemistry representation
PDF
Intellectual property (IP) intelligence solutions designed for the way resear...
PDF
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
PDF
Patent Data for Artificial Intelligence based Drug Discovery
PDF
Research data management on the cloud
PDF
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
PDF
Cheminfo Stories APAC 2020 - JChem Engines introduction
PDF
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
PDF
Cheminfo Stories APAC 2020 -- Markush technology
PDF
JChem Microservices
PDF
Migration from joc to jpc or choral
PPTX
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
PPTX
Chemicalize Pro - Cheminfo Stories 2020 Day 5
PPTX
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
PPTX
ChemAxon ChemLocator - Cheminfo Stories Day 5
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Chemaxon EU UGM 2022 | Translating data to predictive models
Efficient biomolecular structural data handling and analysis - Webinar with D...
Biomolecule structural data management
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Enhanced stereochemistry representation
Intellectual property (IP) intelligence solutions designed for the way resear...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
Patent Data for Artificial Intelligence based Drug Discovery
Research data management on the cloud
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 -- Markush technology
JChem Microservices
Migration from joc to jpc or choral
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
ChemAxon ChemLocator - Cheminfo Stories Day 5
Ad

Recently uploaded (20)

PDF
AI in Product Development-omnex systems
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
L1 - Introduction to python Backend.pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Online Work Permit System for Fast Permit Processing
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administration Chapter 2
PDF
top salesforce developer skills in 2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Transform Your Business with a Software ERP System
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
AI in Product Development-omnex systems
Wondershare Filmora 15 Crack With Activation Key [2025
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
Understanding Forklifts - TECH EHS Solution
Softaken Excel to vCard Converter Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
L1 - Introduction to python Backend.pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Online Work Permit System for Fast Permit Processing
How to Choose the Right IT Partner for Your Business in Malaysia
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administration Chapter 2
top salesforce developer skills in 2025.pdf
CHAPTER 2 - PM Management and IT Context
Transform Your Business with a Software ERP System
Design an Analysis of Algorithms II-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily

Translating data to predictive models