When is undersampling effective in unbalanced classification tasks?

Introduction The warping effect Ranking error Experiments Conclusion
When is undersampling effective in
unbalanced classiﬁcation tasks?
Andrea Dal Pozzolo, Olivier Caelen,
and Gianluca Bontempi
09/09/2015
ECML-PKDD 2015
Porto, Portugal
1/ 23

INTRODUCTION
In several binary classification problems, the two classes
are not equally represented in the dataset.
In Fraud detection for example, fraudulent transactions are
rare compared to genuine ones (less than 1% [3]).
Many classification algorithms performs poorly in with
unbalanced class distribution [7].
A standard solution to unbalanced classification is
rebalancing the classes before training a classifier.
2/ 23

UNDERSAMPLING
Undersampling is a well-know technique used to balanced
a dataset.
It consists in down-sizing the majority class by removing
observations at random until the dataset is balanced.
Some works have empirically shown that classiﬁers
perform better with balanced dataset [10] [6].
Other show that balanced training set do not improve
performances [2] [7].
There is not yet a theoretical framework motivating
undersampling.
3/ 23

OBJECTIVE OF THIS STUDY
We aim to analyse the role of the two side-effects of
undersampling on the final accuracy:
The warping in the posterior distribution [5, 8].
The increase in variance due to samples removal.
We analyse their impact on the final ranking of posterior
probabilities.
We show under which condition undersampling is
expected to improve classification accuracy.
4/ 23

THE PROBLEM
Let us consider a binary classiﬁcation task f : Rn → {+, −}
X ∈ Rn is the input and Y ∈ {+, −} the output domain.
+ is the minority and − the majority class.
Given a classiﬁer K and a sample (x, y), we are interested
in estimating the posterior probability p(y = +|x).
We want to study the effect of undersampling on the
posterior probability.
5/ 23

THE PROBLEM II
Let (X, Y) ⊂ (X, Y) be the balanced sample of (X, Y), i.e.
(X, Y) contains a subset of the negatives in (X, Y).
Let s be a random variable associated to each sample
(x, y) ∈ (X, Y), s = 1 if the point is in (x, y) ∈ (X, Y) and
s = 0 otherwise.
Assume that s is independent of the input x given the class
y (class-dependent selection):
p(s|y, x) = p(s|y) ⇔ p(x|y, s) = p(x|y)
.
Undersampling)
Figure : Undersampling: remove randomly majority class examples.6/ 23

POSTERIOR PROBABILITIES
p(+|x, s = 1) =
p(s = 1|+, x)p(+|x)
p(s = 1|+, x)p(+|x) + p(s = 1|−, x)p(−|x)
(1)
In undersampling we have p(s = 1|+, x) = 1, so we can write:
ps = p(+|x, s = 1) =
p(+|x)
p(+|x) + p(s = 1|−)p(−|x)
=
p
p + β(1 − p)
(2)
Figure : p and ps at different β.7/ 23

WARPING AND CLASS SEPARABILITY
(a) ps as a function of β
3 15
0
500
1000
1500
−10 0 10 20 −10 0 10 20
x
Count
class
0
1
(b) Class distribution
Figure : Class distribution and posterior probability as a function of β
for two univariate binary classiﬁcation tasks with norm class
conditional densities X− ∼ N(0, σ) and X+ ∼ N(µ, σ) (on the left
µ = 3 and on the right µ = 15, in both examples σ = 3). Note that p
corresponds to β = 1 and ps to β < 1.
8/ 23

RANKING ERROR
Let ˆp (resp. ˆps) denote the estimation of p (resp. ps).
Assume p1 < p2, ∆p = p2 − p1 with ∆p > 0.
Let ˆp1 = p1 + 1 and ˆp2 = p2 + 2, with ε ∼ N(b, ν) where b
and ν are the bias and the variance of the estimator of p.
We have a wrong ranking if ˆp1 > ˆp2 and its probability is:
P(ˆp2 < ˆp1) = P(p2 + 2 < p1 + 1) = P( 1 − 2 > ∆p)
where 2 − 1 ∼ N(0, 2ν). By making an hypothesis of
normality we have
P( 1 − 2 > ∆p) = 1 − Φ
∆p
√
2ν
(3)
where Φ is the cumulative distribution function of the standard
normal distribution.
9/ 23

RANKING ERROR WITH UNDERSAMPLING
Let ˆps,1 = ps,1 + η1 and ˆps,2 = ps,2 + η2, where η ∼ N(bs, νs).
νs > ν, i.e. variance is larger given the smaller number of
samples.
ps,1 < ps,2 and ∆ps = ps,2 − ps,1 > 0 because (2) is monotone.
The probability of a ranking error with undersampling is:
P(ˆps,2 < ˆps,1) = P(η1 − η2 > ∆ps)
and
P(η1 − η2 > ∆ps) = 1 − Φ
∆ps
√
2νs
(4)
10/ 23

CONDITION FOR A BETTER RANKING WITH
UNDERSAMPLING
A classiﬁer K has better ranking with undersampling when
P( 1 − 2 > ∆p) > P(η1 − η2 > ∆ps) (5)
or equivalently from (3) and (4) when
1 − Φ
∆p
√
2ν
> 1 − Φ
∆ps
√
2νs
⇔ Φ
∆p
√
2ν
< Φ
∆ps
√
2νs
since Φ is monotone non decreasing and νs > ν:
dps
dp
>
νs
ν
(6)
where
dps
dp is the derivative of ps w.r.t. p:
dps
dp
=
β
(p + β(1 − p))2
11/ 23

FACTORS INFLUENCING (6)
The value of inequality (6) depends on several terms:
The rate of undersampling β impacts the terms ps and νs.
The ratio of the variances νs
ν .
The posteriori probability p of the testing point.
The condition (6) is hard to verify: β can be controlled by the
designer, but
dps
dp and νs
ν vary over the input space.
This means that (6) does not necessarily hold for all the test
points.
12/ 23

UNIVARIATE SYNTHETIC DATASET
−2 −1 0 1 2
0.00.20.40.60.81.01.2
x
Posteriorprobability
(a) Class conditional distributions
(thin lines) and the posterior distri-
bution of the minority class (thicker
line).
(b) dps
dp
(solid lines), νs
ν
(dotted
lines).
Figure : Non separable case. On the right we plot both terms of
inequality 6 (solid: left-hand, dotted: right-hand term) for β = 0.1
and β = 0.413/ 23

BIVARIATE SYNTHETIC DATASET
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
0
5
10
15
20
0 5 10 15
X1
X2
class
q
q
0
1
(a) Synthetic dataset 1 (b) νs
ν
and dps
dp
for different β
Figure : Left: distribution of the testing set where the positive
samples account for 5% of the total. Right: plot of
dps
dp percentiles (25th
,
50th
and 75th
) and of νs
ν (black dashed).14/ 23

BIVARIATE SYNTHETIC DATASET II
(a) Undersampling with β = 0.053 (b) Undersampling with β = 0.323
Figure : Regions where undersampling should work. Triangles
indicate the testing samples where the condition (6) holds for the
dataset in Figure 5.
15/ 23

BIVARIATE SYNTHETIC DATASET III
Table : Classification task in Figure 5: Ranking correlation between
the posterior probability ˆp (ˆps) and p for different values of β. The
value K (Ks) denotes the Kendall rank correlation without (with)
undersampling. The first (last) five lines refer to samples for which
the condition (6) is (not) satisfied.
β K Ks Ks − K %points satisfying (6)
0.053 0.298 0.749 0.451 88.8
0.076 0.303 0.682 0.379 89.7
0.112 0.315 0.619 0.304 91.2
0.176 0.323 0.555 0.232 92.1
0.323 0.341 0.467 0.126 93.7
0.053 0.749 0.776 0.027 88.8
0.076 0.755 0.773 0.018 89.7
0.112 0.762 0.764 0.001 91.2
0.176 0.767 0.761 -0.007 92.1
0.323 0.768 0.748 -0.020 93.7
16/ 23

REAL DATASETS
Table : Selected datasets from the UCI repository [1]1
Datasets N N+ N− N+/N
ecoli 336 35 301 0.10
glass 214 17 197 0.08
letter-a 20000 789 19211 0.04
letter-vowel 20000 3878 16122 0.19
ism 11180 260 10920 0.02
letter 20000 789 19211 0.04
oil 937 41 896 0.04
page 5473 560 4913 0.10
pendigits 10992 1142 9850 0.10
PhosS 11411 613 10798 0.05
satimage 6430 625 5805 0.10
segment 2310 330 1980 0.14
boundary 3505 123 3382 0.04
estate 5322 636 4686 0.12
cam 18916 942 17974 0.05
compustat 13657 520 13137 0.04
covtype 38500 2747 35753 0.07
1
Transformed datasets are available at http:
//www.ulb.ac.be/di/map/adalpozz/imbalanced-datasets.zip
17/ 23

Figure : Difference between the Kendall rank correlation of ˆps and ˆp
with p, namely Ks and K, for points having the conditions (6)
satisﬁed and not. Ks and K are calculated as the mean of the
correlations over all βs.
18/ 23

Figure : Ratio between the number of sample satisfying condition 6
and all the instances available in each dataset averaged over all the
βs.
19/ 23

SUMMARY
Undersampling has two major effects:
it increases the variance of the classiﬁer
it produces warped posterior probabilities.
Countermeasures:
averaging strategies (e.g. UnderBagging [9])
calibration of the probability to the new priors of the
testing set [8].
Despite the popularity of undersampling, it is not clear how
these two effects interact and when undersampling leads to
better accuracy in the classiﬁcation task.
20/ 23

CONCLUSION
When (6) is satisfied the posterior probability obtained
after sampling returns a more accurate ordering.
Several factors influence (6) (e.g. β, variance of the
classifier, class separability)
Practical use (6) is not straightforward since it requires
knowledge of p and νs
ν (not easy to estimate).
This result warning against a naive use of undersampling
in unbalanced tasks.
We suggest the adoption of adaptive selection techniques
(e.g. racing [4]) to perform a case-by-case use of
undersampling.
21/ 23

Code: https://github.com/dalpozz/warping
Website: www.ulb.ac.be/di/map/adalpozz
Email: adalpozz@ulb.ac.be
Thank you for the attention
Research is supported by the Doctiris scholarship
funded by Innoviris, Brussels, Belgium.
22/ 23

BIBLIOGRAPHY
[1] D. N. A. Asuncion.
UCI machine learning repository, 2007.
[2] G. E. Batista, R. C. Prati, and M. C. Monard.
A study of the behavior of several methods for balancing machine learning training data.
ACM SIGKDD Explorations Newsletter, 6(1):20–29, 2004.
[3] A. Dal Pozzolo, O. Caelen, Y.-A. Le Borgne, S. Waterschoot, and G. Bontempi.
Learned lessons in credit card fraud detection from a practitioner perspective.
Expert Systems with Applications, 41(10):4915–4928, 2014.
[4] A. Dal Pozzolo, O. Caelen, S. Waterschoot, and G. Bontempi.
Racing for unbalanced methods selection.
In Proceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning.
IDEAL, 2013.
[5] C. Elkan.
The foundations of cost-sensitive learning.
In International joint conference on artificial intelligence, volume 17, pages 973–978. Citeseer, 2001.
[6] A. Estabrooks, T. Jo, and N. Japkowicz.
A multiple resampling method for learning from imbalanced data sets.
Computational Intelligence, 20(1):18–36, 2004.
[7] N. Japkowicz and S. Stephen.
The class imbalance problem: A systematic study.
Intelligent data analysis, 6(5):429–449, 2002.
[8] M. Saerens, P. Latinne, and C. Decaestecker.
Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure.
Neural computation, 14(1):21–41, 2002.
[9] S. Wang, K. Tang, and X. Yao.
Diversity exploration and negative correlation learning on imbalanced data sets.
In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 3259–3266. IEEE, 2009.
[10] G. M. Weiss and F. Provost.
The effect of class distribution on classifier learning: an empirical study.
Rutgers Univ, 2001.
23/ 23

When is undersampling effective in unbalanced classification tasks?

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to When is undersampling effective in unbalanced classification tasks? (20)

More from Andrea Dal Pozzolo (7)

Recently uploaded (20)

When is undersampling effective in unbalanced classification tasks?