SlideShare a Scribd company logo
© 2019 KNIME AG. All Rights Reserved.
Building useful models for
imbalanced datasets (without
resampling)
Greg Landrum
(greg.landrum@knime.com)
COMP Together, UCSF
22 Aug 2019
© 2019 KNIME AG. All Rights Reserved. 2
First things first
• RDKit blog post with initial work:
http://rdkit.blogspot.com/2018/11/working-with-
unbalanced-data-part-i.html
• The notebooks I used for this presentation are all in
Github:
– Original notebook: https://bit.ly/2UY2u2K
– Using the balanced random forest: https://bit.ly/2tuafSc
– Plotting: https://bit.ly/2GJSeHH
• I have a KNIME workflow that does the same thing. Let
me know if you're interested
• Download links for the datasets are in the blog post
© 2019 KNIME AG. All Rights Reserved. 3
The problem
• Typical datasets for bioactivity prediction tend to
have way more inactives than actives
• This leads to a couple of pathologies:
– Overall accuracy is really not a good metric for how useful
a model is
– Many learning algorithms produce way too many false
negatives
© 2019 KNIME AG. All Rights Reserved. 4
Example dataset
• Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS
for Inhibitors of Tau Fibril Formation, Thioflavin T
Binding. (Class of assay: confirmatory))
– https://www.ebi.ac.uk/chembl/assay_report_card/CHEM
BL1614166/
– https://pubchem.ncbi.nlm.nih.gov/bioassay/1460
• 43345 inactives, 5602 actives (using the annotations
from PubChem)
© 2019 KNIME AG. All Rights Reserved. 5
Data Preparation
• Structures are taken from ChEMBL
– Already some standardization done
– Processed with RDKit
• Fingerprints: RDKit Morgan-2, 2048 bits
© 2019 KNIME AG. All Rights Reserved. 6
Modeling
• Stratified 80-20 training/holdout split
• KNIME random forest classifier
– 500 trees
– Max depth 15
– Min node size 2
This is a first pass through the cycle, we will try
other fingerprints, learning algorithms, and
hyperparameters in future iterations
© 2019 KNIME AG. All Rights Reserved. 7
Results CHEMBL1614421: holdout data
© 2019 KNIME AG. All Rights Reserved. 8
Evaluation CHEMBL1614421: holdout data
AUROC=0.75
© 2019 KNIME AG. All Rights Reserved. 9
Taking stock
• Model has:
– Good overall accuracies (because of imbalance)
– Decent AUROC values
– Terrible Cohen kappas
Now what?
© 2019 KNIME AG. All Rights Reserved. 10
Quick diversion on bag classifiers
When making predictions, each tree in the
classifier votes on the result.
Majority wins
The predicted class probabilities are often the
means of the predicted probabilities from the
individual trees
We construct the ROC curve by sorting the
predictions in decreasing order of predicted
probability of being active.
Note that the actual predictions are irrelevant for an ROC curve. As long
as true actives tend to have a higher predicted probability of being active
than true inactives the AUC will be good.
© 2019 KNIME AG. All Rights Reserved. 11
Handling imbalanced data
• The standard decision rule for a random forest (or
any bag classifier) is that the majority wins1, i.e. at
the predicted probability of being active must be
>=0.5 in order for the model to predict "active"
• Shift that threshold to a lower value for models built
on highly imbalanced datasets2
1 This is only strictly true for binary classifiers
2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and
QSAR in Environmental Research 17 (2006): 337–52.
© 2019 KNIME AG. All Rights Reserved. 12
Picking a new decision threshold: approach 1
• Generate a random forest for the dataset using the
training set
• Generate out-of-bag predicted probabilities using
the training set
• Try a number of different decision thresholds1 and
pick the one that gives the best kappa
• Once we have the decision threshold, use it to
generate predictions for the test set.
1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
© 2019 KNIME AG. All Rights Reserved. 13
• Balanced confusion matrix
Results CHEMBL1614421
Previously 0.005
Nice! But does it work in general?
14© 2019 KNIME AG. All Rights Reserved.
Validation experiment
© 2019 KNIME AG. All Rights Reserved. 15
• "Serotonin": 6 datasets with >900 Ki values for human
serotonin receptors
– Active: pKi > 9.0, Inactive: pKi < 8.5
– If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi
< 7.5
• "DS1": 80 "Dataset 1" sets.1
– Active: 100 diverse measured actives ("standard_value<10uM");
Inactive: 2000 random compounds from the same property space
• "PubChem": 8 HTS Validation assays with at least 3K
"Potency" values
– Active: "active" in dataset. Inactive: "inactive", "not active", or
"inconclusive" in dataset
• "DrugMatrix": 44 DrugMatrix assays with at least 40 actives
– Active: "active" in dataset. Inactive: "not active" in dataset
The datasets (all extracted from ChEMBL_24)
1 S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision
making by committee can be a good thing." Journal of chemical information and modeling 53:2829-36 (2013).
© 2019 KNIME AG. All Rights Reserved. 16
Model building and validation
• Fingerprints: 2048 bit MorganFP radius=2
• 80/20 training/test split
• Random forest parameters:
– cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True)
• Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35,
0.4 , 0.45, 0.5 ] with out-of-bag predictions and pick the best
based on kappa
• Generate initial kappa value for the test data using threshold
= 0.5
• Generate "balanced" kappa value for the test data with the
optimized threshold
© 2019 KNIME AG. All Rights Reserved. 17
Does it work in general?
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 18
Does it work in general?
Proprietary data, time-split validation
© 2019 KNIME AG. All Rights Reserved. 19
Picking a new decision threshold: approach 2
• Generate a random forest for the dataset using the
training set
• Generate out-of-bag predicted probabilities using
the training set
• Pick the threshold corresponding to the point on the
ROC curve that’s closest to the upper left corner
• Once we have the decision threshold, use it to
generate predictions for the test set.
Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in
Environmental Research 17 (2006): 337–52.
© 2019 KNIME AG. All Rights Reserved. 20
Does it work in general?
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 21
Does it work in general?
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 22
Other evaluation metrics: F1 score
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 23
Does it work in general?
Proprietary data, time-split validation
© 2019 KNIME AG. All Rights Reserved. 24
Compare to balanced random forests
• Resampling strategy that still uses the entire training
set
• Idea: train each tree on a balanced bootstrap
sample of the training data
Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data.
https://statistics.berkeley.edu/tech-reports/666 (2004).
© 2019 KNIME AG. All Rights Reserved. 25
How do bag classifiers end up with different models?
Each tree is built
with a different
dataset
© 2019 KNIME AG. All Rights Reserved. 26
Balanced random forests
• Take advantage of the structure of the classifier.
• Learn each tree with a balanced dataset:
– Select a bootstrap sample of the minority class (actives)
– Randomly select, with replacement, the same number of
points from the majority class (inactives)
• Prediction works the same as with a normal random
forest
• Easy to do in scikit-learn using the imbalanced-learn
contrib package: https://imbalanced-
learn.readthedocs.io/en/stable/ensemble.html#forest-of-randomized-trees
– cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True
Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666
(2004).
© 2019 KNIME AG. All Rights Reserved. 27
Comparing to resampling: balanced random forests
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 28
Comparing to resampling: balanced random forests
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 29
What comes next
• Try the same thing with other learning methods like
logistic regression and stochastic gradient boosting
– These are more complicated since they can't do out-of-
bag classification
– We need to add another data split and loop to do
calibration and find the best threshold
• More datasets! I need *your* help with this
– I have a script for you to run that takes sets of compounds
with activity labels and outputs the summary statistics
that I'm using here
© 2019 KNIME AG. All Rights Reserved. 30
Acknowledgements
• Dean Abbott (Abbott Analytics)
• Daria Goldmann (KNIME)
• NIBR:
– Nik Stiefl
– Nadine Schneider
– Niko Fechner

More Related Content

PDF
CuPy解説
PDF
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
PDF
Keras: Deep Learning Library for Python
PDF
20130116_pfiseminar_gwas_postgwas
PPTX
MIRU2014 tutorial deeplearning
PPTX
introduction to Data Structure and classification
PDF
古典プログラマ向け量子プログラミング入門 [フル版]
PPTX
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
CuPy解説
PFP:材料探索のための汎用Neural Network Potential - 2021/10/4 QCMSR + DLAP共催
Keras: Deep Learning Library for Python
20130116_pfiseminar_gwas_postgwas
MIRU2014 tutorial deeplearning
introduction to Data Structure and classification
古典プログラマ向け量子プログラミング入門 [フル版]
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...

What's hot (20)

PDF
Neural Architecture Search: Learning How to Learn
PPTX
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
KEY
FDRの使い方 (Kashiwa.R #3)
PDF
機械学習によるハイスループット 第一原理計算の代替の可能性_日本化学会_20230323
PPTX
Hessian free
PDF
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
PDF
【技術解説20】 ミニバッチ確率的勾配降下法
PDF
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
PDF
A Brief Introduction on Recurrent Neural Network and Its Application
PDF
PFP:材料探索のための汎用Neural Network Potential_中郷_20220422POLセミナー
PDF
PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...
PDF
XGBoostからNGBoostまで
PDF
XGBoost & LightGBM
PDF
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
PDF
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
PDF
Goodfellow先生おすすめのGAN論文6つを紹介
PDF
Cassandra Database
PDF
PRML輪読#12
PDF
Deep State Space Models for Time Series Forecasting の紹介
PDF
データに内在する構造をみるための埋め込み手法
Neural Architecture Search: Learning How to Learn
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
FDRの使い方 (Kashiwa.R #3)
機械学習によるハイスループット 第一原理計算の代替の可能性_日本化学会_20230323
Hessian free
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【技術解説20】 ミニバッチ確率的勾配降下法
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
A Brief Introduction on Recurrent Neural Network and Its Application
PFP:材料探索のための汎用Neural Network Potential_中郷_20220422POLセミナー
PFN Summer Internship 2021 / Kohei Shinohara: Charge Transfer Modeling in Neu...
XGBoostからNGBoostまで
XGBoost & LightGBM
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
Goodfellow先生おすすめのGAN論文6つを紹介
Cassandra Database
PRML輪読#12
Deep State Space Models for Time Series Forecasting の紹介
データに内在する構造をみるための埋め込み手法
Ad

Similar to Building useful models for imbalanced datasets (without resampling) (20)

PDF
Moving from Artisanal to Industrial Machine Learning
PDF
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
PDF
Building useful models for imbalanced datasets (without resampling)
PDF
Machine learning in the life sciences with knime
PDF
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
PDF
Dark Knowledge - Google Transference in Ml
PPTX
Using Apache Spark with IBM SPSS Modeler
PPTX
Machine learning algorithms
PDF
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
PPTX
Performance Issue? Machine Learning to the rescue!
PDF
Random forests-talk-nl-meetup
PDF
Introduction to XGBoost
PPTX
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
PPTX
pradeep ppt final.pptx
PPTX
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
PDF
Advanced Hyperparameter Optimization for Deep Learning with MLflow
PPTX
Robust Design And Variation Reduction Using DiscoverSim
PPTX
Final edited master defense-hyun_wong choi_2019_05_23_rev21
PDF
Machine Learning for Incident Detection: Getting Started
PDF
GA.-.Presentation
Moving from Artisanal to Industrial Machine Learning
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Building useful models for imbalanced datasets (without resampling)
Machine learning in the life sciences with knime
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
Dark Knowledge - Google Transference in Ml
Using Apache Spark with IBM SPSS Modeler
Machine learning algorithms
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Performance Issue? Machine Learning to the rescue!
Random forests-talk-nl-meetup
Introduction to XGBoost
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
pradeep ppt final.pptx
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Robust Design And Variation Reduction Using DiscoverSim
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Machine Learning for Incident Detection: Getting Started
GA.-.Presentation
Ad

More from Greg Landrum (15)

PDF
Chemical registration
PDF
Mike Lynch Award Lecture, ICCS 2022
PDF
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
PDF
ACS San Diego - The RDKit: Open-source cheminformatics
PDF
Let’s talk about reproducible data analysis
PDF
Interactive and reproducible data analysis with the open-source KNIME Analyti...
PDF
Processing malaria HTS results using KNIME: a tutorial
PDF
Big (chemical) data? No Problem!
PDF
Is one enough? Data warehousing for biomedical research
PDF
Some "challenges" on the open-source/open-data front
PDF
Large scale classification of chemical reactions from patent data
PDF
Open-source from/in the enterprise: the RDKit
PDF
Open-source tools for querying and organizing large reaction databases
PDF
Is that a scientific report or just some cool pictures from the lab? Reproduc...
PDF
Reproducibility in cheminformatics and computational chemistry research: cert...
Chemical registration
Mike Lynch Award Lecture, ICCS 2022
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
ACS San Diego - The RDKit: Open-source cheminformatics
Let’s talk about reproducible data analysis
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Processing malaria HTS results using KNIME: a tutorial
Big (chemical) data? No Problem!
Is one enough? Data warehousing for biomedical research
Some "challenges" on the open-source/open-data front
Large scale classification of chemical reactions from patent data
Open-source from/in the enterprise: the RDKit
Open-source tools for querying and organizing large reaction databases
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Reproducibility in cheminformatics and computational chemistry research: cert...

Recently uploaded (20)

PPTX
neck nodes and dissection types and lymph nodes levels
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
An interstellar mission to test astrophysical black holes
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Microbiology with diagram medical studies .pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPT
protein biochemistry.ppt for university classes
PDF
Biophysics 2.pdffffffffffffffffffffffffff
neck nodes and dissection types and lymph nodes levels
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
7. General Toxicologyfor clinical phrmacy.pptx
famous lake in india and its disturibution and importance
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
An interstellar mission to test astrophysical black holes
Placing the Near-Earth Object Impact Probability in Context
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Cell Membrane: Structure, Composition & Functions
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
The scientific heritage No 166 (166) (2025)
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Microbiology with diagram medical studies .pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Phytochemical Investigation of Miliusa longipes.pdf
protein biochemistry.ppt for university classes
Biophysics 2.pdffffffffffffffffffffffffff

Building useful models for imbalanced datasets (without resampling)

  • 1. © 2019 KNIME AG. All Rights Reserved. Building useful models for imbalanced datasets (without resampling) Greg Landrum (greg.landrum@knime.com) COMP Together, UCSF 22 Aug 2019
  • 2. © 2019 KNIME AG. All Rights Reserved. 2 First things first • RDKit blog post with initial work: http://rdkit.blogspot.com/2018/11/working-with- unbalanced-data-part-i.html • The notebooks I used for this presentation are all in Github: – Original notebook: https://bit.ly/2UY2u2K – Using the balanced random forest: https://bit.ly/2tuafSc – Plotting: https://bit.ly/2GJSeHH • I have a KNIME workflow that does the same thing. Let me know if you're interested • Download links for the datasets are in the blog post
  • 3. © 2019 KNIME AG. All Rights Reserved. 3 The problem • Typical datasets for bioactivity prediction tend to have way more inactives than actives • This leads to a couple of pathologies: – Overall accuracy is really not a good metric for how useful a model is – Many learning algorithms produce way too many false negatives
  • 4. © 2019 KNIME AG. All Rights Reserved. 4 Example dataset • Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory)) – https://www.ebi.ac.uk/chembl/assay_report_card/CHEM BL1614166/ – https://pubchem.ncbi.nlm.nih.gov/bioassay/1460 • 43345 inactives, 5602 actives (using the annotations from PubChem)
  • 5. © 2019 KNIME AG. All Rights Reserved. 5 Data Preparation • Structures are taken from ChEMBL – Already some standardization done – Processed with RDKit • Fingerprints: RDKit Morgan-2, 2048 bits
  • 6. © 2019 KNIME AG. All Rights Reserved. 6 Modeling • Stratified 80-20 training/holdout split • KNIME random forest classifier – 500 trees – Max depth 15 – Min node size 2 This is a first pass through the cycle, we will try other fingerprints, learning algorithms, and hyperparameters in future iterations
  • 7. © 2019 KNIME AG. All Rights Reserved. 7 Results CHEMBL1614421: holdout data
  • 8. © 2019 KNIME AG. All Rights Reserved. 8 Evaluation CHEMBL1614421: holdout data AUROC=0.75
  • 9. © 2019 KNIME AG. All Rights Reserved. 9 Taking stock • Model has: – Good overall accuracies (because of imbalance) – Decent AUROC values – Terrible Cohen kappas Now what?
  • 10. © 2019 KNIME AG. All Rights Reserved. 10 Quick diversion on bag classifiers When making predictions, each tree in the classifier votes on the result. Majority wins The predicted class probabilities are often the means of the predicted probabilities from the individual trees We construct the ROC curve by sorting the predictions in decreasing order of predicted probability of being active. Note that the actual predictions are irrelevant for an ROC curve. As long as true actives tend to have a higher predicted probability of being active than true inactives the AUC will be good.
  • 11. © 2019 KNIME AG. All Rights Reserved. 11 Handling imbalanced data • The standard decision rule for a random forest (or any bag classifier) is that the majority wins1, i.e. at the predicted probability of being active must be >=0.5 in order for the model to predict "active" • Shift that threshold to a lower value for models built on highly imbalanced datasets2 1 This is only strictly true for binary classifiers 2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in Environmental Research 17 (2006): 337–52.
  • 12. © 2019 KNIME AG. All Rights Reserved. 12 Picking a new decision threshold: approach 1 • Generate a random forest for the dataset using the training set • Generate out-of-bag predicted probabilities using the training set • Try a number of different decision thresholds1 and pick the one that gives the best kappa • Once we have the decision threshold, use it to generate predictions for the test set. 1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
  • 13. © 2019 KNIME AG. All Rights Reserved. 13 • Balanced confusion matrix Results CHEMBL1614421 Previously 0.005 Nice! But does it work in general?
  • 14. 14© 2019 KNIME AG. All Rights Reserved. Validation experiment
  • 15. © 2019 KNIME AG. All Rights Reserved. 15 • "Serotonin": 6 datasets with >900 Ki values for human serotonin receptors – Active: pKi > 9.0, Inactive: pKi < 8.5 – If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi < 7.5 • "DS1": 80 "Dataset 1" sets.1 – Active: 100 diverse measured actives ("standard_value<10uM"); Inactive: 2000 random compounds from the same property space • "PubChem": 8 HTS Validation assays with at least 3K "Potency" values – Active: "active" in dataset. Inactive: "inactive", "not active", or "inconclusive" in dataset • "DrugMatrix": 44 DrugMatrix assays with at least 40 actives – Active: "active" in dataset. Inactive: "not active" in dataset The datasets (all extracted from ChEMBL_24) 1 S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing." Journal of chemical information and modeling 53:2829-36 (2013).
  • 16. © 2019 KNIME AG. All Rights Reserved. 16 Model building and validation • Fingerprints: 2048 bit MorganFP radius=2 • 80/20 training/test split • Random forest parameters: – cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True) • Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ] with out-of-bag predictions and pick the best based on kappa • Generate initial kappa value for the test data using threshold = 0.5 • Generate "balanced" kappa value for the test data with the optimized threshold
  • 17. © 2019 KNIME AG. All Rights Reserved. 17 Does it work in general? ChEMBL data, random-split validation
  • 18. © 2019 KNIME AG. All Rights Reserved. 18 Does it work in general? Proprietary data, time-split validation
  • 19. © 2019 KNIME AG. All Rights Reserved. 19 Picking a new decision threshold: approach 2 • Generate a random forest for the dataset using the training set • Generate out-of-bag predicted probabilities using the training set • Pick the threshold corresponding to the point on the ROC curve that’s closest to the upper left corner • Once we have the decision threshold, use it to generate predictions for the test set. Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in Environmental Research 17 (2006): 337–52.
  • 20. © 2019 KNIME AG. All Rights Reserved. 20 Does it work in general? ChEMBL data, random-split validation
  • 21. © 2019 KNIME AG. All Rights Reserved. 21 Does it work in general? ChEMBL data, random-split validation
  • 22. © 2019 KNIME AG. All Rights Reserved. 22 Other evaluation metrics: F1 score ChEMBL data, random-split validation
  • 23. © 2019 KNIME AG. All Rights Reserved. 23 Does it work in general? Proprietary data, time-split validation
  • 24. © 2019 KNIME AG. All Rights Reserved. 24 Compare to balanced random forests • Resampling strategy that still uses the entire training set • Idea: train each tree on a balanced bootstrap sample of the training data Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666 (2004).
  • 25. © 2019 KNIME AG. All Rights Reserved. 25 How do bag classifiers end up with different models? Each tree is built with a different dataset
  • 26. © 2019 KNIME AG. All Rights Reserved. 26 Balanced random forests • Take advantage of the structure of the classifier. • Learn each tree with a balanced dataset: – Select a bootstrap sample of the minority class (actives) – Randomly select, with replacement, the same number of points from the majority class (inactives) • Prediction works the same as with a normal random forest • Easy to do in scikit-learn using the imbalanced-learn contrib package: https://imbalanced- learn.readthedocs.io/en/stable/ensemble.html#forest-of-randomized-trees – cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666 (2004).
  • 27. © 2019 KNIME AG. All Rights Reserved. 27 Comparing to resampling: balanced random forests ChEMBL data, random-split validation
  • 28. © 2019 KNIME AG. All Rights Reserved. 28 Comparing to resampling: balanced random forests ChEMBL data, random-split validation
  • 29. © 2019 KNIME AG. All Rights Reserved. 29 What comes next • Try the same thing with other learning methods like logistic regression and stochastic gradient boosting – These are more complicated since they can't do out-of- bag classification – We need to add another data split and loop to do calibration and find the best threshold • More datasets! I need *your* help with this – I have a script for you to run that takes sets of compounds with activity labels and outputs the summary statistics that I'm using here
  • 30. © 2019 KNIME AG. All Rights Reserved. 30 Acknowledgements • Dean Abbott (Abbott Analytics) • Daria Goldmann (KNIME) • NIBR: – Nik Stiefl – Nadine Schneider – Niko Fechner