Calibrating Probability with Undersampling for Unbalanced Classification

Introduction Impact of sampling on probabilities Classiﬁcation threshold Experiments Conclusion
Calibrating Probability with Undersampling
for Unbalanced Classiﬁcation
Andrea Dal Pozzolo, Olivier Caelen,
Reid A. Johnson, and Gianluca Bontempi
8/12/2015
IEEE CIDM 2015
Cape Town, South Africa
1/ 22

INTRODUCTION
In several binary classification problems, the two classes
are not equally represented in the dataset.
In Fraud detection for example, fraudulent transactions are
rare compared to genuine ones (less than 1% [2]).
Many classification algorithms performs poorly in with
unbalanced class distribution [5].
A standard solution to unbalanced classification is
rebalancing the classes before training a classifier.
2/ 22

UNDERSAMPLING
Undersampling is a well-know technique used to balanced
a dataset.
It consists in down-sizing the majority class by removing
observations at random until the dataset is balanced.
Several studies [11, 4] have reported that it improves
classiﬁcation performances.
Most often, the consequences of undersampling on the
posterior probability of a classiﬁer are ignored.
3/ 22

OBJECTIVE OF THIS STUDY
In this work we:
formalize how undersampling works.
show that undersampling is responsible for a shift in the
posterior probability of a classiﬁer.
study how this shift is linked to class separability.
investigate how this shift produces biased probability
estimates.
show how to obtain and use unbiased (calibrated)
probability for classiﬁcation.
4/ 22

THE PROBLEM
Let us consider a binary classiﬁcation task f : Rn → {+, −}
X ∈ Rn is the input and Y ∈ {+, −} the output domain.
+ is the minority and − the majority class.
Given a classiﬁer K and a training set TN, we are interested
in estimating for a new sample (x, y) the posterior
probability p(y = +|x).
5/ 22

EFFECT OF UNDERSAMPLING
Suppose that a classiﬁer K is trained on set TN which is
unbalanced.
Let s be a random variable associated to each sample
(x, y) ∈ TN, s = 1 if the point is sampled and s = 0
otherwise.
Assume that s is independent of the input x given the class
y (class-dependent selection):
p(s|y, x) = p(s|y) ⇔ p(x|y, s) = p(x|y)
.
Undersampling--
Unbalanced- Balanced-
Figure : Undersampling: remove randomly majority class examples.
In red samples that are removed from the unbalanced dataset (s = 0).
6/ 22

POSTERIOR PROBABILITIES
Let ps = p(+|x, s = 1) and p = p(+|x). We can write ps as a
function of p [1]:
ps =
p
p + β(1 − p)
(1)
where β = p(s = 1|−). Using (1) we can obtain an expression of
p as a function of ps:
p =
βps
βps − ps + 1
(2)
7/ 22

WARPING AND CLASS SEPARABILITY
Let ω+ and ω− denote p(x|+) and p(x|−), and π+ (π+
s ) the class
priors before (after) undersampling. Using Bayes’ theorem:
p =
ω+π+
ω+ − δπ−
(3)
where δ = ω+ − ω−. Similarly, since ω+ does not change with
undersampling:
ps =
ω+π+
s
ω+ − δπ−
s
(4)
Now we can write ps − p as:
ps − p =
ω+π+
s
ω+ − δπ−
s
−
ω+π+
ω+ − δπ−
(5)
8/ 22

WARPING AND CLASS SEPARABILITY
Figure : ps − p as a function of δ, where δ = ω+
− ω−
for values of
ω+
∈ {0.01, 0.1} when π+
s = 0.5 and π+
= 0.1.
9/ 22

WARPING AND CLASS SEPARABILITY II
(a) ps as a function of β
3 15
0
500
1000
1500
−10 0 10 20 −10 0 10 20
x
Count
class
0
1
(b) Class distribution
Figure : Class distribution and posterior probability as a function of β
for two univariate binary classiﬁcation tasks with norm class
conditional densities X− ∼ N(0, σ) and X+ ∼ N(µ, σ) (on the left
µ = 3 and on the right µ = 15, in both examples σ = 3). Note that p
corresponds to β = 1 and ps to β < 1.
10/ 22

ADJUSTING POSTERIOR PROBABILITIES
We propose to use correct ps with p , which is obtained
using (2):
p =
βps
βps − ps + 1
(6)
Eq. (6) is a special case of the framework proposed by Saerens
et al. [8] and Elkan [3] (see Appendix in the paper).
0.00
0.25
0.50
0.75
1.00
−10 −5 0 5 10 15
x
Posteriorprobability
Probability
ps
p'
p
Figure : Posterior probabilities ps, p and p for β = N+
N− in the dataset
with overlapping classes (µ = 3).
11/ 22

CLASSIFICATION THRESHOLD
Let r+ and r− be the risk of predicting an instance as positive
and negative:
r+
= (1 − p) · l1,0 + p · l1,1
r−
= (1 − p) · l0,0 + p · l0,1
where li,j is the cost in predicting i when the true class is j and
p = p(y = +|x). A sample is predicted as positive if
r+ ≤ r− [10]:
ˆy =
+ if r+ ≤ r−
− if r+ > r− (7)
Alternatively, predict as positive when p > τ with τ:
τ =
l1,0 − l0,0
l1,0 − l0,0 + l0,1 − l1,1
(8)
12/ 22

CORRECTING THE CLASSIFICATION THRESHOLD
When the costs of a FN (l0,1) and FP (l1,0) are unknown, we can
use the priors. Let l1,0 = π+ and l0,1 = π−, from (8) we get:
τ =
l1,0
l1,0 + l0,1
=
π+
π+ + π−
= π+
(9)
since π+ + π− = 1. Then we should use π+ as threshold with p:
p −→ τ = π+
Similarly
ps −→ τs = π+
s
From Elkan [3]:
τ
1 − τ
1 − τs
τs
= β (10)
Therefore, we obtain:
p −→ τ = π+
13/ 22

EXPERIMENTAL SETTINGS
We denote as ˆps, ˆp and ˆp the estimates of ps, p and p
Goal: understand which probability return the highest
ranking (AUC), calibration (BS) and classiﬁcation accuracy
(G-mean).
We use a 10-fold cross validation (CV) to test our models
and we repeated the CV 10 times.
We test several classiﬁcation algorithms: Random
Forest [7], SVM [6], and Logit Boost [9].
We consider real-world unbalanced datasets from the UCI
repository used in [1].
14/ 22

LEARNING FRAMEWORK
Test%set% Train%set%
Undersampling%Unbalanced%Model%
Balanced%Model%ˆp
ˆ!p
τ
τ '
τs
Fold%1% Fold%2% Fold%3% Fold%4% Fold%10%
Unbalanced%Dataset%
.%.%.%.%
ˆps
Figure : Learning framework for comparing models with and
without undersampling using Cross Validation (CV). We use one fold
of the CV as testing set and the others for training, and iterate the
framework to use all the folds once for testing.
15/ 22

DATASETS
Table : Datasets from the UCI repository used in [1].
Datasets N N+ N− N+/N
ecoli 336 35 301 0.10
glass 214 17 197 0.08
letter-a 20000 789 19211 0.04
letter-vowel 20000 3878 16122 0.19
ism 11180 260 10920 0.02
letter 20000 789 19211 0.04
oil 937 41 896 0.04
page 5473 560 4913 0.10
pendigits 10992 1142 9850 0.10
PhosS 11411 613 10798 0.05
satimage 6430 625 5805 0.10
segment 2310 330 1980 0.14
boundary 3505 123 3382 0.04
estate 5322 636 4686 0.12
cam 18916 942 17974 0.05
compustat 13657 520 13137 0.04
covtype 38500 2747 35753 0.07
16/ 22

RESULTS
Table : Sum of ranks and p-values of the paired t-test between the
ranks of ˆp and ˆp and between ˆp and ˆps for different metrics. In bold
the probabilities with the best rank sum (higher for AUC and
G-mean, lower for BS).
Metric Algo Rˆp Rˆps
Rˆp ρ(Rˆp, Rˆps
) ρ(Rˆp, Rˆp )
AUC LB 22,516 23,572 23,572 0.322 0.322
AUC RF 24,422 22,619 22,619 0.168 0.168
AUC SVM 19,595 19,902.5 19,902.5 0.873 0.873
G-mean LB 23,281 23,189.5 23,189.5 0.944 0.944
G-mean RF 22,986 23,337 23,337 0.770 0.770
G-mean SVM 19,550 19,925 19,925 0.794 0.794
BS LB 19809.5 29448.5 20402 0.000 0.510
BS RF 18336 28747 22577 0.000 0.062
BS SVM 17139 23161 19100 0.001 0.156
17/ 22

RESULTS II
The rank sum is the same for ˆps and ˆp since (6) is
monotone.
Undersampling does not always improve the ranking
(AUC) or classiﬁcation accuracy (G-mean) of an algorithm.
ˆp is the probability estimate with the best calibration
(lower rank sum with BS).
ˆp has always better calibration than ˆps, then we should use
ˆp instead of ˆps.
18/ 22

CREDIT CARDS DATASET
Real-world credit card dataset with transactions from Sep 2013,
frauds account for 0.172% of all transactions.
LB RF SVM
qqqqqqqqqq qqqqqqqqqq
qqqqqqqqqq qqqqqqqqqq
qqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqq
qqqqqq qqqqqq
qqqqqq qqqqqq
qqqqqq qqqqqq
qqqqqqqq qqqqqqqq
qqqqqqqq qqqqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.900
0.925
0.950
0.975
1.000
0.10.20.30.40.50.60.70.80.9
1
0.10.20.30.40.50.60.70.80.9
1
0.10.20.30.40.50.60.70.80.9
1
beta
AUC
Probability
p
p'
ps
Credit−card
LB RF SVM
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqqqqqqq
qqqqqqqqqq
qqqqqq
qqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
qqqqqq
qqqqqq
qqqqqqqq qqqqqqqq
3e−04
6e−04
9e−04
0.10.20.30.40.50.60.70.80.9
1
0.10.20.30.40.50.60.70.80.9
1
0.10.20.30.40.50.60.70.80.9
1
beta
BS
Probability
p
p'
ps
Credit−card
19/ 22

CONCLUSION
As a result of undersampling, ˆps is shifted away from ˆp.
This shift is stronger for overlapping distributions and gets
larger for small values of β.
Using (6), we can remove the drift in ˆps and obtain ˆp
which has better calibration.
ˆp provides the same ranking quality of ˆps.
Results from UCI and credit card datasets show that using
ˆp with τ we are able to improve calibration without losing
predictive accuracy.
20/ 22

Credit card dataset: http://www.ulb.ac.be/di/map/
adalpozz/data/creditcard.Rdata
Website: www.ulb.ac.be/di/map/adalpozz
Email: adalpozz@ulb.ac.be
Thank you for the attention
Research is supported by the Doctiris scholarship
funded by Innoviris, Brussels, Belgium.
21/ 22

BIBLIOGRAPHY
[1] A. Dal Pozzolo, O. Caelen, and G. Bontempi.
When is undersampling effective in unbalanced classification tasks?
In Machine Learning and Knowledge Discovery in Databases. Springer, 2015.
[2] A. Dal Pozzolo, O. Caelen, Y.-A. Le Borgne, S. Waterschoot, and G. Bontempi.
Learned lessons in credit card fraud detection from a practitioner perspective.
Expert Systems with Applications, 41(10):4915–4928, 2014.
[3] C. Elkan.
The foundations of cost-sensitive learning.
In International Joint Conference on Artificial Intelligence, volume 17, pages 973–978, 2001.
[4] A. Estabrooks, T. Jo, and N. Japkowicz.
A multiple resampling method for learning from imbalanced data sets.
Computational Intelligence, 20(1):18–36, 2004.
[5] N. Japkowicz and S. Stephen.
The class imbalance problem: A systematic study.
Intelligent data analysis, 6(5):429–449, 2002.
[6] A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis.
kernlab-an s4 package for kernel methods in r.
2004.
[7] A. Liaw and M. Wiener.
Classification and regression by randomforest.
R News, 2(3):18–22, 2002.
[8] M. Saerens, P. Latinne, and C. Decaestecker.
Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure.
Neural computation, 14(1):21–41, 2002.
[9] J. Tuszynski.
caTools: Tools: moving window statistics, GIF, Base64, ROC AUC, etc., 2013.
R package version 1.16.
[10] V. N. Vapnik and V. Vapnik.
Statistical learning theory, volume 1.
Wiley New York, 1998.
[11] G. M. Weiss and F. Provost.
The effect of class distribution on classifier learning: an empirical study.
Rutgers Univ, 2001.
22/ 22

Calibrating Probability with Undersampling for Unbalanced Classification

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Calibrating Probability with Undersampling for Unbalanced Classification (20)

More from Andrea Dal Pozzolo (9)

Recently uploaded (20)

Calibrating Probability with Undersampling for Unbalanced Classification