Building useful models for imbalanced datasets (without resampling)

© 2019 KNIME AG. All Rights Reserved.
Building useful models for
imbalanced datasets (without
resampling)
Greg Landrum
(greg.landrum@knime.com)
COMP Together, UCSF
22 Aug 2019

© 2019 KNIME AG. All Rights Reserved. 2
First things first
• RDKit blog post with initial work:
http://rdkit.blogspot.com/2018/11/working-with-
unbalanced-data-part-i.html
• The notebooks I used for this presentation are all in
Github:
– Original notebook: https://bit.ly/2UY2u2K
– Using the balanced random forest: https://bit.ly/2tuafSc
– Plotting: https://bit.ly/2GJSeHH
• I have a KNIME workflow that does the same thing. Let
me know if you're interested
• Download links for the datasets are in the blog post

The problem
• Typical datasets for bioactivity prediction tend to
have way more inactives than actives
• This leads to a couple of pathologies:
– Overall accuracy is really not a good metric for how useful
a model is
– Many learning algorithms produce way too many false
negatives

Example dataset
• Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS
for Inhibitors of Tau Fibril Formation, Thioflavin T
Binding. (Class of assay: confirmatory))
– https://www.ebi.ac.uk/chembl/assay_report_card/CHEM
BL1614166/
– https://pubchem.ncbi.nlm.nih.gov/bioassay/1460
• 43345 inactives, 5602 actives (using the annotations
from PubChem)

Data Preparation
• Structures are taken from ChEMBL
– Already some standardization done
– Processed with RDKit
• Fingerprints: RDKit Morgan-2, 2048 bits

Modeling
• Stratified 80-20 training/holdout split
• KNIME random forest classifier
– 500 trees
– Max depth 15
– Min node size 2
This is a first pass through the cycle, we will try
other fingerprints, learning algorithms, and
hyperparameters in future iterations

Results CHEMBL1614421: holdout data

Evaluation CHEMBL1614421: holdout data
AUROC=0.75

Taking stock
• Model has:
– Good overall accuracies (because of imbalance)
– Decent AUROC values
– Terrible Cohen kappas
Now what?

Quick diversion on bag classifiers
When making predictions, each tree in the
classifier votes on the result.
Majority wins
The predicted class probabilities are often the
means of the predicted probabilities from the
individual trees
We construct the ROC curve by sorting the
predictions in decreasing order of predicted
probability of being active.
Note that the actual predictions are irrelevant for an ROC curve. As long
as true actives tend to have a higher predicted probability of being active
than true inactives the AUC will be good.

Handling imbalanced data
• The standard decision rule for a random forest (or
any bag classifier) is that the majority wins1, i.e. at
the predicted probability of being active must be
>=0.5 in order for the model to predict "active"
• Shift that threshold to a lower value for models built
on highly imbalanced datasets2
1 This is only strictly true for binary classifiers
2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and
QSAR in Environmental Research 17 (2006): 337–52.

Picking a new decision threshold: approach 1
• Generate a random forest for the dataset using the
training set
• Generate out-of-bag predicted probabilities using
the training set
• Try a number of different decision thresholds1 and
pick the one that gives the best kappa
• Once we have the decision threshold, use it to
generate predictions for the test set.
1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]

• Balanced confusion matrix
Results CHEMBL1614421
Previously 0.005
Nice! But does it work in general?

14© 2019 KNIME AG. All Rights Reserved.
Validation experiment

• "Serotonin": 6 datasets with >900 Ki values for human
serotonin receptors
– Active: pKi > 9.0, Inactive: pKi < 8.5
– If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi
< 7.5
• "DS1": 80 "Dataset 1" sets.1
– Active: 100 diverse measured actives ("standard_value<10uM");
Inactive: 2000 random compounds from the same property space
• "PubChem": 8 HTS Validation assays with at least 3K
"Potency" values
– Active: "active" in dataset. Inactive: "inactive", "not active", or
"inconclusive" in dataset
• "DrugMatrix": 44 DrugMatrix assays with at least 40 actives
– Active: "active" in dataset. Inactive: "not active" in dataset
The datasets (all extracted from ChEMBL_24)
1 S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision
making by committee can be a good thing." Journal of chemical information and modeling 53:2829-36 (2013).

Model building and validation
• Fingerprints: 2048 bit MorganFP radius=2
• 80/20 training/test split
• Random forest parameters:
– cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True)
• Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35,
0.4 , 0.45, 0.5 ] with out-of-bag predictions and pick the best
based on kappa
• Generate initial kappa value for the test data using threshold
= 0.5
• Generate "balanced" kappa value for the test data with the
optimized threshold

Does it work in general?
ChEMBL data, random-split validation

Proprietary data, time-split validation

Picking a new decision threshold: approach 2
• Generate a random forest for the dataset using the
training set
• Generate out-of-bag predicted probabilities using
the training set
• Pick the threshold corresponding to the point on the
ROC curve that’s closest to the upper left corner
• Once we have the decision threshold, use it to
generate predictions for the test set.
Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in
Environmental Research 17 (2006): 337–52.

Other evaluation metrics: F1 score

Proprietary data, time-split validation

Compare to balanced random forests
• Resampling strategy that still uses the entire training
set
• Idea: train each tree on a balanced bootstrap
sample of the training data
Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data.
https://statistics.berkeley.edu/tech-reports/666 (2004).

How do bag classifiers end up with different models?
Each tree is built
with a different
dataset

Balanced random forests
• Take advantage of the structure of the classifier.
• Learn each tree with a balanced dataset:
– Select a bootstrap sample of the minority class (actives)
– Randomly select, with replacement, the same number of
points from the majority class (inactives)
• Prediction works the same as with a normal random
forest
• Easy to do in scikit-learn using the imbalanced-learn
contrib package: https://imbalanced-
learn.readthedocs.io/en/stable/ensemble.html#forest-of-randomized-trees
– cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True
Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666
(2004).

Comparing to resampling: balanced random forests

What comes next
• Try the same thing with other learning methods like
logistic regression and stochastic gradient boosting
– These are more complicated since they can't do out-of-
bag classification
– We need to add another data split and loop to do
calibration and find the best threshold
• More datasets! I need *your* help with this
– I have a script for you to run that takes sets of compounds
with activity labels and outputs the summary statistics
that I'm using here

Acknowledgements
• Dean Abbott (Abbott Analytics)
• Daria Goldmann (KNIME)
• NIBR:
– Nik Stiefl
– Nadine Schneider
– Niko Fechner

Building useful models for imbalanced datasets (without resampling)

More Related Content

What's hot (20)

Similar to Building useful models for imbalanced datasets (without resampling) (20)

More from Greg Landrum (15)

Recently uploaded (20)

Building useful models for imbalanced datasets (without resampling)