SlideShare a Scribd company logo
Building useful models for
imbalanced datasets
(without resampling)
Greg Landrum
Feb 2019
T5 Informatics GmbH
greg.landrum@t5informatics.com
@dr_greg_landrum
2T5 Informatics
First things first
● RDKit blog post with initial work:
http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html
● The notebooks I used for this presentation are all in Github:
○ Original notebook: https://bit.ly/2UY2u2K
○ Using the balanced random forest: https://bit.ly/2tuafSc
○ Plotting: https://bit.ly/2GJSeHH
● I have a KNIME workflow that does the same thing. Let me know if you're
interested
● Download links for the datasets are in the blog post
3T5 Informatics
The problem
● Typical datasets for bioactivity prediction tend to have way more inactives
than actives
● This leads to a couple of pathologies:
○ Overall accuracy is really not a good metric for how useful a model is
○ Many learning algorithms produce way too many false negatives
4T5 Informatics
The accuracy problem
Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation
Predicted inactive Predicted active
Measured inactive 8681 4
Measured active 1102 22
Accuracy is high, but the model is pretty useless for predicting actives
Overall accuracy: 88.7%
5T5 Informatics
The accuracy problem
Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation
Predicted inactive Predicted active
Measured inactive 8681 4
Measured active 1102 22
Overall accuracy: 88.7%
kappa: 0.033
This one has an easy solution: use Cohen's kappa
https://en.wikipedia.org/wiki/Cohen%27s_kappa
https://link.springer.com/article/10.1007/s10822-014-9759-6
https://www.researchgate.net/publication/258240105_TheKappaStatistic_PaulCzodrowski
kappa makes the problem obvious
6T5 Informatics
Is that model actually useless?
Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation
Predicted inactive Predicted active
Measured inactive 8681 4
Measured active 1102 22
Overall accuracy: 88.7%
kappa: 0.033
AUC: 0.850
The ROC curve shows that the model has actually
learned something about the bioactivity. Why
aren't we seeing that in the predictions?
7T5 Informatics
Quick diversion on bag classifiers
When making predictions, each tree in the
classifier votes on the result.
Majority wins
In scikit-learn the predicted class probabilities are
the means of the predicted probabilities from the
individual trees
We construct the ROC curve by sorting the
predictions in decreasing order of predicted
probability of being active.
Note that actual predictions are irrelevant for a ROC curve.
As long as true actives tend to have a higher predicted
probability of being active than true inactives the AUC will
be good.
8T5 Informatics
Ok, but who cares?
9T5 Informatics
An idea
● The standard decision rule for a random forest (or any bag classifier) is that
the majority wins1
, i.e. at the predicted probability of being active must be
>=0.5 in order for the model to predict "active"
● Shift that threshold to a lower value for models built on highly imbalanced
datasets2
1
This is only strictly true for binary classifiers
2
F. Provost, T. Fawcett. "Robust classification for imprecise environments." Machine learning 42:203-31 (2001).
10T5 Informatics
How do we come up with a new decision threshold?
1. Generate a random forest for the dataset using a training set
2. Generate out-of-bag predicted probabilities using the training set
3. Try a number of different decision thresholds1
and pick the one that gives the
best kappa
Once we have the decision threshold, use it to generate predictions for the test
set.
1
Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
11T5 Informatics
Changing the decision threshold
Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation
Threshold = 0.5 Predicted inactive Predicted active
Measured inactive 8681 4
Measured active 1101 22
Overall accuracy: 88.7%
kappa: 0.033
Threshold = 0.2 Predicted inactive Predicted active
Measured inactive 8261 424
Measured active 595 528
Overall accuracy: 89.6%
kappa: 0.451kappa has improved dramatically
12T5 Informatics
Why does it work?
Prob(active)
Superimpose Prob(active) on the ROC curve
13T5 Informatics
Why does it work?
The predicted probabilities of being active are clearly carrying
significant information about activity
Prob(active)
Superimpose Prob(active) on the ROC curve Plot TPR and FPR vs -Prob(active)
14T5 Informatics
It works!
● By shifting the decision threshold from 0.5->0.2 we managed to improve the
predictivity of the model, as measured by kappa, from 0.033 to 0.451
● We did not need to retrain/change the model at all.
Yay! We're done!
15T5 Informatics
It works!
● By shifting the decision threshold from 0.5->0.2 we managed to improve the
predictivity of the model, as measured by kappa, from 0.033 to 0.451
● We did not need to retrain/change the model at all.
Not quite done yet!
● This is just one model. Does this approach work broadly?
● Let's try a broad collection of datasets
16T5 Informatics
The datasets (all extracted from ChEMBL_24)
● "Serotonin": 6 datasets with >900 Ki values for human serotonin receptors
○ Active: pKi > 9.0, Inactive: pKi < 8.5
○ If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi < 7.5
● "DS1": 80 "Dataset 1" sets.1
○ Active: 100 diverse measured actives ("standard_value<10uM"); Inactive: 2000 random
compounds from the same property space
● "PubChem": 8 HTS Validation assays with at least 3K "Potency" values
○ Active: "active" in dataset. Inactive: "inactive", "not active", or "inconclusive" in dataset
● "DrugMatrix": 44 DrugMatrix assays with at least 40 actives
○ Active: "active" in dataset. Inactive: "not active" in dataset
1
S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or,
how decision making by committee can be a good thing." Journal of chemical information and modeling
53:2829-36 (2013).
17T5 Informatics
Model building and validation
● Fingerprints: 2048 bit MorganFP radius=2
● 80/20 training/test split
● Random forest parameters:
○ cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2,
n_jobs=4, oob_score=True)
● Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
with out-of-bag predictions and pick the best based on kappa
● Generate initial kappa value for the test data using threshold = 0.5
● Generate "balanced" kappa value for the test data with the optimized
threshold
18T5 Informatics
Does shifting the threshold actually help?
Point coloring is based on AUC for
the test set
Yes, it definitely helps
19T5 Informatics
It works!
● When looking across a range of different assays, shifting the decision
threshold improved the predictivity of the model, as measured by kappa, in
most cases. Often significantly so
● We did not need to retrain/change the model at all.
Yay! We're done!
20T5 Informatics
It works!
● When looking across a range of different assays, shifting the decision
threshold improved the predictivity of the model, as measured by kappa, in
most cases. Often significantly so
● We did not need to retrain/change the model at all.
No, wait, there's still more...
● What about other approaches for handling imbalanced datasets?
21T5 Informatics
Quick diversion: How do bag classifiers end up with
different models?
Each tree is built
with a different
dataset
22T5 Informatics
Another approach: Balanced Random Forests1
● Take advantage of the structure of the classifier.
● Learn each tree with a balanced dataset:
○ Select a bootstrap sample of the minority class (actives)
○ Randomly select, with replacement, the same number of points from the majority class
(inactives)
● Prediction works the same as with a normal random forest
● Easy to do in scikit-learn using the imbalanced-learn contrib package:
https://imbalanced-learn.readthedocs.io/en/stable/ensemble.html#forest-of-ra
ndomized-trees
○ cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4,
oob_score=True)
1
C. Chen, A. Liaw, L. Breiman. “Using random forest to learn imbalanced data.” University of California,
Berkeley (2004), https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
23T5 Informatics
Does BRF help?
Point coloring is based on AUC for
the test set
Yes, it definitely helps
24T5 Informatics
How does BRF compare to shifting the threshold?
Point coloring is based on AUC for
the test set
Shifting the threshold
tends to do better
25T5 Informatics
Does shifting the threshold with BRF models help?
Point coloring is based on AUC for
the test set
Nope, that doesn't do
anything
26T5 Informatics
It works!
● When looking across a range of different assays, shifting the decision
threshold improved the predictivity of the model, as measured by kappa, in
most cases. Often significantly so
● The approach was generally better than using Balanced Random Forests
● We did not need to retrain/change the model at all.
Yay! We're done!
27T5 Informatics
What comes next
● Try the same thing with other learning methods like logistic regression and
stochastic gradient boosting
○ These are more complicated since they can't do out-of-bag classification
○ We need to add another data split and loop to do calibration and find the best threshold
● More datasets! I need *your* help with this
○ I can put together a script for you to run that takes sets of compounds with activity labels and
outputs summary statistics like what I'm using here
28T5 Informatics
Prelim results: logistic regression
29T5 Informatics
Acknowledgements
● Dean Abbott (Abbott Analytics, @deanabb)
● Daria Goldmann (KNIME)
30T5 Informatics
More info
● RDKit blog post with initial work:
http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html
● The notebooks I used for this presentation are all in Github:
○ Original notebook: https://bit.ly/2UY2u2K
○ Using the balanced random forest: https://bit.ly/2tuafSc
○ Plotting: https://bit.ly/2GJSeHH
● I have a KNIME workflow that does the same thing. Let me know if you're
interested
● Download links for the datasets are in the blog post

More Related Content

PDF
Building useful models for imbalanced datasets (without resampling)
PDF
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
PDF
Moving from Artisanal to Industrial Machine Learning
PDF
Introduction aux algorithmes génétiques
PDF
Scaling Deep Learning Algorithms on Extreme Scale Architectures
PDF
Wind meteodyn WT cfd micro scale modeling combined statistical learning for s...
PDF
ブラックボックス最適化とその応用
PDF
How Data Saves Time
Building useful models for imbalanced datasets (without resampling)
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Moving from Artisanal to Industrial Machine Learning
Introduction aux algorithmes génétiques
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Wind meteodyn WT cfd micro scale modeling combined statistical learning for s...
ブラックボックス最適化とその応用
How Data Saves Time

Similar to Building useful models for imbalanced datasets (without resampling) (20)

PDF
Random Forest / Bootstrap Aggregation
PPT
Lecture11_ Evaluation Metrics for classification.ppt
PPTX
Supervised and Unsupervised Learning .pptx
PDF
Predicting breast cancer: Adrian Valles
DOCX
Classification Using Decision Trees and RulesChapter 5.docx
PDF
Bank loan purchase modeling
PDF
Machine learning in the life sciences with knime
PPTX
13 random forest
PDF
Human_Activity_Recognition_Predictive_Model
PDF
Data Science - Part V - Decision Trees & Random Forests
PDF
From decision trees to random forests
PDF
MLHEP 2015: Introductory Lecture #4
PDF
Machine Learning Feature Selection - Random Forest
PDF
Random forest sgv_ai_talk_oct_2_2018
PPTX
What we got from the Predicting Red Hat Business Value competition
PDF
M3R.FINAL
PDF
Module 6: Ensemble Algorithms
PPTX
Machine Learning Project
PPS
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
PDF
Noorbehbahani classification evaluation measure
Random Forest / Bootstrap Aggregation
Lecture11_ Evaluation Metrics for classification.ppt
Supervised and Unsupervised Learning .pptx
Predicting breast cancer: Adrian Valles
Classification Using Decision Trees and RulesChapter 5.docx
Bank loan purchase modeling
Machine learning in the life sciences with knime
13 random forest
Human_Activity_Recognition_Predictive_Model
Data Science - Part V - Decision Trees & Random Forests
From decision trees to random forests
MLHEP 2015: Introductory Lecture #4
Machine Learning Feature Selection - Random Forest
Random forest sgv_ai_talk_oct_2_2018
What we got from the Predicting Red Hat Business Value competition
M3R.FINAL
Module 6: Ensemble Algorithms
Machine Learning Project
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Noorbehbahani classification evaluation measure
Ad

More from Greg Landrum (15)

PDF
Chemical registration
PDF
Mike Lynch Award Lecture, ICCS 2022
PDF
ACS San Diego - The RDKit: Open-source cheminformatics
PDF
Let’s talk about reproducible data analysis
PDF
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
PDF
Interactive and reproducible data analysis with the open-source KNIME Analyti...
PDF
Processing malaria HTS results using KNIME: a tutorial
PDF
Big (chemical) data? No Problem!
PDF
Is one enough? Data warehousing for biomedical research
PDF
Some "challenges" on the open-source/open-data front
PDF
Large scale classification of chemical reactions from patent data
PDF
Open-source from/in the enterprise: the RDKit
PDF
Open-source tools for querying and organizing large reaction databases
PDF
Is that a scientific report or just some cool pictures from the lab? Reproduc...
PDF
Reproducibility in cheminformatics and computational chemistry research: cert...
Chemical registration
Mike Lynch Award Lecture, ICCS 2022
ACS San Diego - The RDKit: Open-source cheminformatics
Let’s talk about reproducible data analysis
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Processing malaria HTS results using KNIME: a tutorial
Big (chemical) data? No Problem!
Is one enough? Data warehousing for biomedical research
Some "challenges" on the open-source/open-data front
Large scale classification of chemical reactions from patent data
Open-source from/in the enterprise: the RDKit
Open-source tools for querying and organizing large reaction databases
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Reproducibility in cheminformatics and computational chemistry research: cert...
Ad

Recently uploaded (20)

PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
BIOMOLECULES PPT........................
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
famous lake in india and its disturibution and importance
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
2Systematics of Living Organisms t-.pptx
Derivatives of integument scales, beaks, horns,.pptx
BIOMOLECULES PPT........................
Phytochemical Investigation of Miliusa longipes.pdf
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Classification Systems_TAXONOMY_SCIENCE8.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
neck nodes and dissection types and lymph nodes levels
famous lake in india and its disturibution and importance
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf

Building useful models for imbalanced datasets (without resampling)

  • 1. Building useful models for imbalanced datasets (without resampling) Greg Landrum Feb 2019 T5 Informatics GmbH greg.landrum@t5informatics.com @dr_greg_landrum
  • 2. 2T5 Informatics First things first ● RDKit blog post with initial work: http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html ● The notebooks I used for this presentation are all in Github: ○ Original notebook: https://bit.ly/2UY2u2K ○ Using the balanced random forest: https://bit.ly/2tuafSc ○ Plotting: https://bit.ly/2GJSeHH ● I have a KNIME workflow that does the same thing. Let me know if you're interested ● Download links for the datasets are in the blog post
  • 3. 3T5 Informatics The problem ● Typical datasets for bioactivity prediction tend to have way more inactives than actives ● This leads to a couple of pathologies: ○ Overall accuracy is really not a good metric for how useful a model is ○ Many learning algorithms produce way too many false negatives
  • 4. 4T5 Informatics The accuracy problem Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation Predicted inactive Predicted active Measured inactive 8681 4 Measured active 1102 22 Accuracy is high, but the model is pretty useless for predicting actives Overall accuracy: 88.7%
  • 5. 5T5 Informatics The accuracy problem Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation Predicted inactive Predicted active Measured inactive 8681 4 Measured active 1102 22 Overall accuracy: 88.7% kappa: 0.033 This one has an easy solution: use Cohen's kappa https://en.wikipedia.org/wiki/Cohen%27s_kappa https://link.springer.com/article/10.1007/s10822-014-9759-6 https://www.researchgate.net/publication/258240105_TheKappaStatistic_PaulCzodrowski kappa makes the problem obvious
  • 6. 6T5 Informatics Is that model actually useless? Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation Predicted inactive Predicted active Measured inactive 8681 4 Measured active 1102 22 Overall accuracy: 88.7% kappa: 0.033 AUC: 0.850 The ROC curve shows that the model has actually learned something about the bioactivity. Why aren't we seeing that in the predictions?
  • 7. 7T5 Informatics Quick diversion on bag classifiers When making predictions, each tree in the classifier votes on the result. Majority wins In scikit-learn the predicted class probabilities are the means of the predicted probabilities from the individual trees We construct the ROC curve by sorting the predictions in decreasing order of predicted probability of being active. Note that actual predictions are irrelevant for a ROC curve. As long as true actives tend to have a higher predicted probability of being active than true inactives the AUC will be good.
  • 9. 9T5 Informatics An idea ● The standard decision rule for a random forest (or any bag classifier) is that the majority wins1 , i.e. at the predicted probability of being active must be >=0.5 in order for the model to predict "active" ● Shift that threshold to a lower value for models built on highly imbalanced datasets2 1 This is only strictly true for binary classifiers 2 F. Provost, T. Fawcett. "Robust classification for imprecise environments." Machine learning 42:203-31 (2001).
  • 10. 10T5 Informatics How do we come up with a new decision threshold? 1. Generate a random forest for the dataset using a training set 2. Generate out-of-bag predicted probabilities using the training set 3. Try a number of different decision thresholds1 and pick the one that gives the best kappa Once we have the decision threshold, use it to generate predictions for the test set. 1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
  • 11. 11T5 Informatics Changing the decision threshold Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation Threshold = 0.5 Predicted inactive Predicted active Measured inactive 8681 4 Measured active 1101 22 Overall accuracy: 88.7% kappa: 0.033 Threshold = 0.2 Predicted inactive Predicted active Measured inactive 8261 424 Measured active 595 528 Overall accuracy: 89.6% kappa: 0.451kappa has improved dramatically
  • 12. 12T5 Informatics Why does it work? Prob(active) Superimpose Prob(active) on the ROC curve
  • 13. 13T5 Informatics Why does it work? The predicted probabilities of being active are clearly carrying significant information about activity Prob(active) Superimpose Prob(active) on the ROC curve Plot TPR and FPR vs -Prob(active)
  • 14. 14T5 Informatics It works! ● By shifting the decision threshold from 0.5->0.2 we managed to improve the predictivity of the model, as measured by kappa, from 0.033 to 0.451 ● We did not need to retrain/change the model at all. Yay! We're done!
  • 15. 15T5 Informatics It works! ● By shifting the decision threshold from 0.5->0.2 we managed to improve the predictivity of the model, as measured by kappa, from 0.033 to 0.451 ● We did not need to retrain/change the model at all. Not quite done yet! ● This is just one model. Does this approach work broadly? ● Let's try a broad collection of datasets
  • 16. 16T5 Informatics The datasets (all extracted from ChEMBL_24) ● "Serotonin": 6 datasets with >900 Ki values for human serotonin receptors ○ Active: pKi > 9.0, Inactive: pKi < 8.5 ○ If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi < 7.5 ● "DS1": 80 "Dataset 1" sets.1 ○ Active: 100 diverse measured actives ("standard_value<10uM"); Inactive: 2000 random compounds from the same property space ● "PubChem": 8 HTS Validation assays with at least 3K "Potency" values ○ Active: "active" in dataset. Inactive: "inactive", "not active", or "inconclusive" in dataset ● "DrugMatrix": 44 DrugMatrix assays with at least 40 actives ○ Active: "active" in dataset. Inactive: "not active" in dataset 1 S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing." Journal of chemical information and modeling 53:2829-36 (2013).
  • 17. 17T5 Informatics Model building and validation ● Fingerprints: 2048 bit MorganFP radius=2 ● 80/20 training/test split ● Random forest parameters: ○ cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True) ● Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ] with out-of-bag predictions and pick the best based on kappa ● Generate initial kappa value for the test data using threshold = 0.5 ● Generate "balanced" kappa value for the test data with the optimized threshold
  • 18. 18T5 Informatics Does shifting the threshold actually help? Point coloring is based on AUC for the test set Yes, it definitely helps
  • 19. 19T5 Informatics It works! ● When looking across a range of different assays, shifting the decision threshold improved the predictivity of the model, as measured by kappa, in most cases. Often significantly so ● We did not need to retrain/change the model at all. Yay! We're done!
  • 20. 20T5 Informatics It works! ● When looking across a range of different assays, shifting the decision threshold improved the predictivity of the model, as measured by kappa, in most cases. Often significantly so ● We did not need to retrain/change the model at all. No, wait, there's still more... ● What about other approaches for handling imbalanced datasets?
  • 21. 21T5 Informatics Quick diversion: How do bag classifiers end up with different models? Each tree is built with a different dataset
  • 22. 22T5 Informatics Another approach: Balanced Random Forests1 ● Take advantage of the structure of the classifier. ● Learn each tree with a balanced dataset: ○ Select a bootstrap sample of the minority class (actives) ○ Randomly select, with replacement, the same number of points from the majority class (inactives) ● Prediction works the same as with a normal random forest ● Easy to do in scikit-learn using the imbalanced-learn contrib package: https://imbalanced-learn.readthedocs.io/en/stable/ensemble.html#forest-of-ra ndomized-trees ○ cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True) 1 C. Chen, A. Liaw, L. Breiman. “Using random forest to learn imbalanced data.” University of California, Berkeley (2004), https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
  • 23. 23T5 Informatics Does BRF help? Point coloring is based on AUC for the test set Yes, it definitely helps
  • 24. 24T5 Informatics How does BRF compare to shifting the threshold? Point coloring is based on AUC for the test set Shifting the threshold tends to do better
  • 25. 25T5 Informatics Does shifting the threshold with BRF models help? Point coloring is based on AUC for the test set Nope, that doesn't do anything
  • 26. 26T5 Informatics It works! ● When looking across a range of different assays, shifting the decision threshold improved the predictivity of the model, as measured by kappa, in most cases. Often significantly so ● The approach was generally better than using Balanced Random Forests ● We did not need to retrain/change the model at all. Yay! We're done!
  • 27. 27T5 Informatics What comes next ● Try the same thing with other learning methods like logistic regression and stochastic gradient boosting ○ These are more complicated since they can't do out-of-bag classification ○ We need to add another data split and loop to do calibration and find the best threshold ● More datasets! I need *your* help with this ○ I can put together a script for you to run that takes sets of compounds with activity labels and outputs summary statistics like what I'm using here
  • 28. 28T5 Informatics Prelim results: logistic regression
  • 29. 29T5 Informatics Acknowledgements ● Dean Abbott (Abbott Analytics, @deanabb) ● Daria Goldmann (KNIME)
  • 30. 30T5 Informatics More info ● RDKit blog post with initial work: http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html ● The notebooks I used for this presentation are all in Github: ○ Original notebook: https://bit.ly/2UY2u2K ○ Using the balanced random forest: https://bit.ly/2tuafSc ○ Plotting: https://bit.ly/2GJSeHH ● I have a KNIME workflow that does the same thing. Let me know if you're interested ● Download links for the datasets are in the blog post