SlideShare a Scribd company logo
Introduction The warping effect Ranking error Experiments Conclusion
When is undersampling effective in
unbalanced classification tasks?
Andrea Dal Pozzolo, Olivier Caelen,
and Gianluca Bontempi
09/09/2015
ECML-PKDD 2015
Porto, Portugal
1/ 23
Introduction The warping effect Ranking error Experiments Conclusion
INTRODUCTION
In several binary classification problems, the two classes
are not equally represented in the dataset.
In Fraud detection for example, fraudulent transactions are
rare compared to genuine ones (less than 1% [3]).
Many classification algorithms performs poorly in with
unbalanced class distribution [7].
A standard solution to unbalanced classification is
rebalancing the classes before training a classifier.
2/ 23
Introduction The warping effect Ranking error Experiments Conclusion
UNDERSAMPLING
Undersampling is a well-know technique used to balanced
a dataset.
It consists in down-sizing the majority class by removing
observations at random until the dataset is balanced.
Some works have empirically shown that classifiers
perform better with balanced dataset [10] [6].
Other show that balanced training set do not improve
performances [2] [7].
There is not yet a theoretical framework motivating
undersampling.
3/ 23
Introduction The warping effect Ranking error Experiments Conclusion
OBJECTIVE OF THIS STUDY
We aim to analyse the role of the two side-effects of
undersampling on the final accuracy:
The warping in the posterior distribution [5, 8].
The increase in variance due to samples removal.
We analyse their impact on the final ranking of posterior
probabilities.
We show under which condition undersampling is
expected to improve classification accuracy.
4/ 23
Introduction The warping effect Ranking error Experiments Conclusion
THE PROBLEM
Let us consider a binary classification task f : Rn → {+, −}
X ∈ Rn is the input and Y ∈ {+, −} the output domain.
+ is the minority and − the majority class.
Given a classifier K and a sample (x, y), we are interested
in estimating the posterior probability p(y = +|x).
We want to study the effect of undersampling on the
posterior probability.
5/ 23
Introduction The warping effect Ranking error Experiments Conclusion
THE PROBLEM II
Let (X, Y) ⊂ (X, Y) be the balanced sample of (X, Y), i.e.
(X, Y) contains a subset of the negatives in (X, Y).
Let s be a random variable associated to each sample
(x, y) ∈ (X, Y), s = 1 if the point is in (x, y) ∈ (X, Y) and
s = 0 otherwise.
Assume that s is independent of the input x given the class
y (class-dependent selection):
p(s|y, x) = p(s|y) ⇔ p(x|y, s) = p(x|y)
.
Undersampling)
Figure : Undersampling: remove randomly majority class examples.6/ 23
Introduction The warping effect Ranking error Experiments Conclusion
POSTERIOR PROBABILITIES
p(+|x, s = 1) =
p(s = 1|+, x)p(+|x)
p(s = 1|+, x)p(+|x) + p(s = 1|−, x)p(−|x)
(1)
In undersampling we have p(s = 1|+, x) = 1, so we can write:
ps = p(+|x, s = 1) =
p(+|x)
p(+|x) + p(s = 1|−)p(−|x)
=
p
p + β(1 − p)
(2)
Figure : p and ps at different β.7/ 23
Introduction The warping effect Ranking error Experiments Conclusion
WARPING AND CLASS SEPARABILITY
(a) ps as a function of β
3 15
0
500
1000
1500
−10 0 10 20 −10 0 10 20
x
Count
class
0
1
(b) Class distribution
Figure : Class distribution and posterior probability as a function of β
for two univariate binary classification tasks with norm class
conditional densities X− ∼ N(0, σ) and X+ ∼ N(µ, σ) (on the left
µ = 3 and on the right µ = 15, in both examples σ = 3). Note that p
corresponds to β = 1 and ps to β < 1.
8/ 23
Introduction The warping effect Ranking error Experiments Conclusion
RANKING ERROR
Let ˆp (resp. ˆps) denote the estimation of p (resp. ps).
Assume p1 < p2, ∆p = p2 − p1 with ∆p > 0.
Let ˆp1 = p1 + 1 and ˆp2 = p2 + 2, with ε ∼ N(b, ν) where b
and ν are the bias and the variance of the estimator of p.
We have a wrong ranking if ˆp1 > ˆp2 and its probability is:
P(ˆp2 < ˆp1) = P(p2 + 2 < p1 + 1) = P( 1 − 2 > ∆p)
where 2 − 1 ∼ N(0, 2ν). By making an hypothesis of
normality we have
P( 1 − 2 > ∆p) = 1 − Φ
∆p
√
2ν
(3)
where Φ is the cumulative distribution function of the standard
normal distribution.
9/ 23
Introduction The warping effect Ranking error Experiments Conclusion
RANKING ERROR WITH UNDERSAMPLING
Let ˆps,1 = ps,1 + η1 and ˆps,2 = ps,2 + η2, where η ∼ N(bs, νs).
νs > ν, i.e. variance is larger given the smaller number of
samples.
ps,1 < ps,2 and ∆ps = ps,2 − ps,1 > 0 because (2) is monotone.
The probability of a ranking error with undersampling is:
P(ˆps,2 < ˆps,1) = P(η1 − η2 > ∆ps)
and
P(η1 − η2 > ∆ps) = 1 − Φ
∆ps
√
2νs
(4)
10/ 23
Introduction The warping effect Ranking error Experiments Conclusion
CONDITION FOR A BETTER RANKING WITH
UNDERSAMPLING
A classifier K has better ranking with undersampling when
P( 1 − 2 > ∆p) > P(η1 − η2 > ∆ps) (5)
or equivalently from (3) and (4) when
1 − Φ
∆p
√
2ν
> 1 − Φ
∆ps
√
2νs
⇔ Φ
∆p
√
2ν
< Φ
∆ps
√
2νs
since Φ is monotone non decreasing and νs > ν:
dps
dp
>
νs
ν
(6)
where
dps
dp is the derivative of ps w.r.t. p:
dps
dp
=
β
(p + β(1 − p))2
11/ 23
Introduction The warping effect Ranking error Experiments Conclusion
FACTORS INFLUENCING (6)
The value of inequality (6) depends on several terms:
The rate of undersampling β impacts the terms ps and νs.
The ratio of the variances νs
ν .
The posteriori probability p of the testing point.
The condition (6) is hard to verify: β can be controlled by the
designer, but
dps
dp and νs
ν vary over the input space.
This means that (6) does not necessarily hold for all the test
points.
12/ 23
Introduction The warping effect Ranking error Experiments Conclusion
UNIVARIATE SYNTHETIC DATASET
−2 −1 0 1 2
0.00.20.40.60.81.01.2
x
Posteriorprobability
(a) Class conditional distributions
(thin lines) and the posterior distri-
bution of the minority class (thicker
line).
(b) dps
dp
(solid lines), νs
ν
(dotted
lines).
Figure : Non separable case. On the right we plot both terms of
inequality 6 (solid: left-hand, dotted: right-hand term) for β = 0.1
and β = 0.413/ 23
Introduction The warping effect Ranking error Experiments Conclusion
BIVARIATE SYNTHETIC DATASET
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
0
5
10
15
20
0 5 10 15
X1
X2
class
q
q
0
1
(a) Synthetic dataset 1 (b) νs
ν
and dps
dp
for different β
Figure : Left: distribution of the testing set where the positive
samples account for 5% of the total. Right: plot of
dps
dp percentiles (25th
,
50th
and 75th
) and of νs
ν (black dashed).14/ 23
Introduction The warping effect Ranking error Experiments Conclusion
BIVARIATE SYNTHETIC DATASET II
(a) Undersampling with β = 0.053 (b) Undersampling with β = 0.323
Figure : Regions where undersampling should work. Triangles
indicate the testing samples where the condition (6) holds for the
dataset in Figure 5.
15/ 23
Introduction The warping effect Ranking error Experiments Conclusion
BIVARIATE SYNTHETIC DATASET III
Table : Classification task in Figure 5: Ranking correlation between
the posterior probability ˆp (ˆps) and p for different values of β. The
value K (Ks) denotes the Kendall rank correlation without (with)
undersampling. The first (last) five lines refer to samples for which
the condition (6) is (not) satisfied.
β K Ks Ks − K %points satisfying (6)
0.053 0.298 0.749 0.451 88.8
0.076 0.303 0.682 0.379 89.7
0.112 0.315 0.619 0.304 91.2
0.176 0.323 0.555 0.232 92.1
0.323 0.341 0.467 0.126 93.7
0.053 0.749 0.776 0.027 88.8
0.076 0.755 0.773 0.018 89.7
0.112 0.762 0.764 0.001 91.2
0.176 0.767 0.761 -0.007 92.1
0.323 0.768 0.748 -0.020 93.7
16/ 23
Introduction The warping effect Ranking error Experiments Conclusion
REAL DATASETS
Table : Selected datasets from the UCI repository [1]1
Datasets N N+ N− N+/N
ecoli 336 35 301 0.10
glass 214 17 197 0.08
letter-a 20000 789 19211 0.04
letter-vowel 20000 3878 16122 0.19
ism 11180 260 10920 0.02
letter 20000 789 19211 0.04
oil 937 41 896 0.04
page 5473 560 4913 0.10
pendigits 10992 1142 9850 0.10
PhosS 11411 613 10798 0.05
satimage 6430 625 5805 0.10
segment 2310 330 1980 0.14
boundary 3505 123 3382 0.04
estate 5322 636 4686 0.12
cam 18916 942 17974 0.05
compustat 13657 520 13137 0.04
covtype 38500 2747 35753 0.07
1
Transformed datasets are available at http:
//www.ulb.ac.be/di/map/adalpozz/imbalanced-datasets.zip
17/ 23
Introduction The warping effect Ranking error Experiments Conclusion
Figure : Difference between the Kendall rank correlation of ˆps and ˆp
with p, namely Ks and K, for points having the conditions (6)
satisfied and not. Ks and K are calculated as the mean of the
correlations over all βs.
18/ 23
Introduction The warping effect Ranking error Experiments Conclusion
Figure : Ratio between the number of sample satisfying condition 6
and all the instances available in each dataset averaged over all the
βs.
19/ 23
Introduction The warping effect Ranking error Experiments Conclusion
SUMMARY
Undersampling has two major effects:
it increases the variance of the classifier
it produces warped posterior probabilities.
Countermeasures:
averaging strategies (e.g. UnderBagging [9])
calibration of the probability to the new priors of the
testing set [8].
Despite the popularity of undersampling, it is not clear how
these two effects interact and when undersampling leads to
better accuracy in the classification task.
20/ 23
Introduction The warping effect Ranking error Experiments Conclusion
CONCLUSION
When (6) is satisfied the posterior probability obtained
after sampling returns a more accurate ordering.
Several factors influence (6) (e.g. β, variance of the
classifier, class separability)
Practical use (6) is not straightforward since it requires
knowledge of p and νs
ν (not easy to estimate).
This result warning against a naive use of undersampling
in unbalanced tasks.
We suggest the adoption of adaptive selection techniques
(e.g. racing [4]) to perform a case-by-case use of
undersampling.
21/ 23
Introduction The warping effect Ranking error Experiments Conclusion
Code: https://github.com/dalpozz/warping
Website: www.ulb.ac.be/di/map/adalpozz
Email: adalpozz@ulb.ac.be
Thank you for the attention
Research is supported by the Doctiris scholarship
funded by Innoviris, Brussels, Belgium.
22/ 23
Introduction The warping effect Ranking error Experiments Conclusion
BIBLIOGRAPHY
[1] D. N. A. Asuncion.
UCI machine learning repository, 2007.
[2] G. E. Batista, R. C. Prati, and M. C. Monard.
A study of the behavior of several methods for balancing machine learning training data.
ACM SIGKDD Explorations Newsletter, 6(1):20–29, 2004.
[3] A. Dal Pozzolo, O. Caelen, Y.-A. Le Borgne, S. Waterschoot, and G. Bontempi.
Learned lessons in credit card fraud detection from a practitioner perspective.
Expert Systems with Applications, 41(10):4915–4928, 2014.
[4] A. Dal Pozzolo, O. Caelen, S. Waterschoot, and G. Bontempi.
Racing for unbalanced methods selection.
In Proceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning.
IDEAL, 2013.
[5] C. Elkan.
The foundations of cost-sensitive learning.
In International joint conference on artificial intelligence, volume 17, pages 973–978. Citeseer, 2001.
[6] A. Estabrooks, T. Jo, and N. Japkowicz.
A multiple resampling method for learning from imbalanced data sets.
Computational Intelligence, 20(1):18–36, 2004.
[7] N. Japkowicz and S. Stephen.
The class imbalance problem: A systematic study.
Intelligent data analysis, 6(5):429–449, 2002.
[8] M. Saerens, P. Latinne, and C. Decaestecker.
Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure.
Neural computation, 14(1):21–41, 2002.
[9] S. Wang, K. Tang, and X. Yao.
Diversity exploration and negative correlation learning on imbalanced data sets.
In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 3259–3266. IEEE, 2009.
[10] G. M. Weiss and F. Provost.
The effect of class distribution on classifier learning: an empirical study.
Rutgers Univ, 2001.
23/ 23

More Related Content

PDF
Calibrating Probability with Undersampling for Unbalanced Classification
PDF
Presentation of the unbalanced R package
PPTX
Naive bayesian classification
PDF
SIAM CSE 2017 talk
PDF
Georgia Tech 2017 March Talk
PPT
Chapter11
PDF
MCQMC 2016 Tutorial
PPTX
random variables-descriptive and contincuous
Calibrating Probability with Undersampling for Unbalanced Classification
Presentation of the unbalanced R package
Naive bayesian classification
SIAM CSE 2017 talk
Georgia Tech 2017 March Talk
Chapter11
MCQMC 2016 Tutorial
random variables-descriptive and contincuous

What's hot (20)

PDF
Applied statistics and probability for engineers solution montgomery && runger
PDF
Tulane March 2017 Talk
PPT
PPT
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PPT
Chapter14
PDF
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
PPT
Chapter13
PDF
Bayes 6
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PPTX
Pattern recognition binoy 05-naive bayes classifier
PDF
A comparison of three learning methods to predict N20 fluxes and N leaching
PPT
Chapter15
PDF
Regularization and variable selection via elastic net
PPT
Sfs4e ppt 06
PPTX
Probability Distribution
PDF
Tutorial on Belief Propagation in Bayesian Networks
PDF
t-tests in R - Lab slides for UGA course FANR 6750
PPTX
Diagnostic methods for Building the regression model
PPT
Chapter 2 discrete_random_variable_2009
Applied statistics and probability for engineers solution montgomery && runger
Tulane March 2017 Talk
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Chapter14
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Chapter13
Bayes 6
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Pattern recognition binoy 05-naive bayes classifier
A comparison of three learning methods to predict N20 fluxes and N leaching
Chapter15
Regularization and variable selection via elastic net
Sfs4e ppt 06
Probability Distribution
Tutorial on Belief Propagation in Bayesian Networks
t-tests in R - Lab slides for UGA course FANR 6750
Diagnostic methods for Building the regression model
Chapter 2 discrete_random_variable_2009
Ad

Viewers also liked (20)

PDF
AlphaPy: A Data Science Pipeline in Python
PPT
Подарунки
PPTX
Board room of today presentation
PPTX
Aq workshop l01 - way of a muslim
PDF
Data Innovation Summit - Made in Belgium 2015
PDF
Shapiro capitulo 4 1° parte
PDF
Andrea Dal Pozzolo - Data Scientist
PDF
Shapiro capitulo 4 2° parte
DOCX
tttt
PDF
Taqleed following blindly
PDF
Colegio nacional nicolas esguerra
PPT
腾讯产品运营PPT-产品经理的视角
PDF
Slideshare1
PDF
Valve Handbook For New Employees
PPTX
PPS
互邀新平台产品宣介
DOCX
Confusion in manhaj
PDF
The three fundamental principles
PPSX
Raising a Mathematician Training Program 2015
AlphaPy: A Data Science Pipeline in Python
Подарунки
Board room of today presentation
Aq workshop l01 - way of a muslim
Data Innovation Summit - Made in Belgium 2015
Shapiro capitulo 4 1° parte
Andrea Dal Pozzolo - Data Scientist
Shapiro capitulo 4 2° parte
tttt
Taqleed following blindly
Colegio nacional nicolas esguerra
腾讯产品运营PPT-产品经理的视角
Slideshare1
Valve Handbook For New Employees
互邀新平台产品宣介
Confusion in manhaj
The three fundamental principles
Raising a Mathematician Training Program 2015
Ad

Similar to When is undersampling effective in unbalanced classification tasks? (20)

PDF
Big Data Analysis
PPTX
Naive Bayes Presentation
PDF
Bayesian Deep Learning
PDF
categorical data analysis Chapter 6b.pdf
PDF
Gtti 10032021
PPTX
Hypothese concerning proportion by kapil jain MNIT
PDF
Data Science Cheatsheet.pdf
PDF
MLHEP 2015: Introductory Lecture #3
PDF
Introduction to Evidential Neural Networks
PDF
Data classification sammer
PDF
Unbiased Bayes for Big Data
PDF
Linear models for classification
PDF
MLHEP 2015: Introductory Lecture #1
PDF
Talk iccf 19_ben_hammouda
PDF
Semi-Supervised Regression using Cluster Ensemble
PDF
Lecture 8: Machine Learning in Practice (1)
PPT
binomialprobabilitydistribution-160303055131.ppt
PPTX
Categorical data analysis full lecture note PPT.pptx
PDF
Intro to ABC
PDF
Lecture 5 Statistical Learning Theory
Big Data Analysis
Naive Bayes Presentation
Bayesian Deep Learning
categorical data analysis Chapter 6b.pdf
Gtti 10032021
Hypothese concerning proportion by kapil jain MNIT
Data Science Cheatsheet.pdf
MLHEP 2015: Introductory Lecture #3
Introduction to Evidential Neural Networks
Data classification sammer
Unbiased Bayes for Big Data
Linear models for classification
MLHEP 2015: Introductory Lecture #1
Talk iccf 19_ben_hammouda
Semi-Supervised Regression using Cluster Ensemble
Lecture 8: Machine Learning in Practice (1)
binomialprobabilitydistribution-160303055131.ppt
Categorical data analysis full lecture note PPT.pptx
Intro to ABC
Lecture 5 Statistical Learning Theory

More from Andrea Dal Pozzolo (7)

PDF
Andrea Dal Pozzolo's CV
PDF
Adaptive Machine Learning for Credit Card Fraud Detection
PDF
Is Machine learning useful for Fraud Prevention?
PDF
Credit card fraud detection and concept drift adaptation with delayed supervi...
PDF
Doctiris project - Innoviris, Brussels
PDF
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
PDF
Racing for unbalanced methods selection
Andrea Dal Pozzolo's CV
Adaptive Machine Learning for Credit Card Fraud Detection
Is Machine learning useful for Fraud Prevention?
Credit card fraud detection and concept drift adaptation with delayed supervi...
Doctiris project - Innoviris, Brussels
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Racing for unbalanced methods selection

Recently uploaded (20)

PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Computing-Curriculum for Schools in Ghana
PDF
Classroom Observation Tools for Teachers
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
RMMM.pdf make it easy to upload and study
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Final Presentation General Medicine 03-08-2024.pptx
GDM (1) (1).pptx small presentation for students
Final Presentation General Medicine 03-08-2024.pptx
Computing-Curriculum for Schools in Ghana
Classroom Observation Tools for Teachers
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Module 4: Burden of Disease Tutorial Slides S2 2025
RMMM.pdf make it easy to upload and study
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
A systematic review of self-coping strategies used by university students to ...
2.FourierTransform-ShortQuestionswithAnswers.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
O7-L3 Supply Chain Operations - ICLT Program
102 student loan defaulters named and shamed – Is someone you know on the list?
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pharma ospi slides which help in ospi learning
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Microbial diseases, their pathogenesis and prophylaxis
Final Presentation General Medicine 03-08-2024.pptx

When is undersampling effective in unbalanced classification tasks?

  • 1. Introduction The warping effect Ranking error Experiments Conclusion When is undersampling effective in unbalanced classification tasks? Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi 09/09/2015 ECML-PKDD 2015 Porto, Portugal 1/ 23
  • 2. Introduction The warping effect Ranking error Experiments Conclusion INTRODUCTION In several binary classification problems, the two classes are not equally represented in the dataset. In Fraud detection for example, fraudulent transactions are rare compared to genuine ones (less than 1% [3]). Many classification algorithms performs poorly in with unbalanced class distribution [7]. A standard solution to unbalanced classification is rebalancing the classes before training a classifier. 2/ 23
  • 3. Introduction The warping effect Ranking error Experiments Conclusion UNDERSAMPLING Undersampling is a well-know technique used to balanced a dataset. It consists in down-sizing the majority class by removing observations at random until the dataset is balanced. Some works have empirically shown that classifiers perform better with balanced dataset [10] [6]. Other show that balanced training set do not improve performances [2] [7]. There is not yet a theoretical framework motivating undersampling. 3/ 23
  • 4. Introduction The warping effect Ranking error Experiments Conclusion OBJECTIVE OF THIS STUDY We aim to analyse the role of the two side-effects of undersampling on the final accuracy: The warping in the posterior distribution [5, 8]. The increase in variance due to samples removal. We analyse their impact on the final ranking of posterior probabilities. We show under which condition undersampling is expected to improve classification accuracy. 4/ 23
  • 5. Introduction The warping effect Ranking error Experiments Conclusion THE PROBLEM Let us consider a binary classification task f : Rn → {+, −} X ∈ Rn is the input and Y ∈ {+, −} the output domain. + is the minority and − the majority class. Given a classifier K and a sample (x, y), we are interested in estimating the posterior probability p(y = +|x). We want to study the effect of undersampling on the posterior probability. 5/ 23
  • 6. Introduction The warping effect Ranking error Experiments Conclusion THE PROBLEM II Let (X, Y) ⊂ (X, Y) be the balanced sample of (X, Y), i.e. (X, Y) contains a subset of the negatives in (X, Y). Let s be a random variable associated to each sample (x, y) ∈ (X, Y), s = 1 if the point is in (x, y) ∈ (X, Y) and s = 0 otherwise. Assume that s is independent of the input x given the class y (class-dependent selection): p(s|y, x) = p(s|y) ⇔ p(x|y, s) = p(x|y) . Undersampling) Figure : Undersampling: remove randomly majority class examples.6/ 23
  • 7. Introduction The warping effect Ranking error Experiments Conclusion POSTERIOR PROBABILITIES p(+|x, s = 1) = p(s = 1|+, x)p(+|x) p(s = 1|+, x)p(+|x) + p(s = 1|−, x)p(−|x) (1) In undersampling we have p(s = 1|+, x) = 1, so we can write: ps = p(+|x, s = 1) = p(+|x) p(+|x) + p(s = 1|−)p(−|x) = p p + β(1 − p) (2) Figure : p and ps at different β.7/ 23
  • 8. Introduction The warping effect Ranking error Experiments Conclusion WARPING AND CLASS SEPARABILITY (a) ps as a function of β 3 15 0 500 1000 1500 −10 0 10 20 −10 0 10 20 x Count class 0 1 (b) Class distribution Figure : Class distribution and posterior probability as a function of β for two univariate binary classification tasks with norm class conditional densities X− ∼ N(0, σ) and X+ ∼ N(µ, σ) (on the left µ = 3 and on the right µ = 15, in both examples σ = 3). Note that p corresponds to β = 1 and ps to β < 1. 8/ 23
  • 9. Introduction The warping effect Ranking error Experiments Conclusion RANKING ERROR Let ˆp (resp. ˆps) denote the estimation of p (resp. ps). Assume p1 < p2, ∆p = p2 − p1 with ∆p > 0. Let ˆp1 = p1 + 1 and ˆp2 = p2 + 2, with ε ∼ N(b, ν) where b and ν are the bias and the variance of the estimator of p. We have a wrong ranking if ˆp1 > ˆp2 and its probability is: P(ˆp2 < ˆp1) = P(p2 + 2 < p1 + 1) = P( 1 − 2 > ∆p) where 2 − 1 ∼ N(0, 2ν). By making an hypothesis of normality we have P( 1 − 2 > ∆p) = 1 − Φ ∆p √ 2ν (3) where Φ is the cumulative distribution function of the standard normal distribution. 9/ 23
  • 10. Introduction The warping effect Ranking error Experiments Conclusion RANKING ERROR WITH UNDERSAMPLING Let ˆps,1 = ps,1 + η1 and ˆps,2 = ps,2 + η2, where η ∼ N(bs, νs). νs > ν, i.e. variance is larger given the smaller number of samples. ps,1 < ps,2 and ∆ps = ps,2 − ps,1 > 0 because (2) is monotone. The probability of a ranking error with undersampling is: P(ˆps,2 < ˆps,1) = P(η1 − η2 > ∆ps) and P(η1 − η2 > ∆ps) = 1 − Φ ∆ps √ 2νs (4) 10/ 23
  • 11. Introduction The warping effect Ranking error Experiments Conclusion CONDITION FOR A BETTER RANKING WITH UNDERSAMPLING A classifier K has better ranking with undersampling when P( 1 − 2 > ∆p) > P(η1 − η2 > ∆ps) (5) or equivalently from (3) and (4) when 1 − Φ ∆p √ 2ν > 1 − Φ ∆ps √ 2νs ⇔ Φ ∆p √ 2ν < Φ ∆ps √ 2νs since Φ is monotone non decreasing and νs > ν: dps dp > νs ν (6) where dps dp is the derivative of ps w.r.t. p: dps dp = β (p + β(1 − p))2 11/ 23
  • 12. Introduction The warping effect Ranking error Experiments Conclusion FACTORS INFLUENCING (6) The value of inequality (6) depends on several terms: The rate of undersampling β impacts the terms ps and νs. The ratio of the variances νs ν . The posteriori probability p of the testing point. The condition (6) is hard to verify: β can be controlled by the designer, but dps dp and νs ν vary over the input space. This means that (6) does not necessarily hold for all the test points. 12/ 23
  • 13. Introduction The warping effect Ranking error Experiments Conclusion UNIVARIATE SYNTHETIC DATASET −2 −1 0 1 2 0.00.20.40.60.81.01.2 x Posteriorprobability (a) Class conditional distributions (thin lines) and the posterior distri- bution of the minority class (thicker line). (b) dps dp (solid lines), νs ν (dotted lines). Figure : Non separable case. On the right we plot both terms of inequality 6 (solid: left-hand, dotted: right-hand term) for β = 0.1 and β = 0.413/ 23
  • 14. Introduction The warping effect Ranking error Experiments Conclusion BIVARIATE SYNTHETIC DATASET q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0 5 10 15 X1 X2 class q q 0 1 (a) Synthetic dataset 1 (b) νs ν and dps dp for different β Figure : Left: distribution of the testing set where the positive samples account for 5% of the total. Right: plot of dps dp percentiles (25th , 50th and 75th ) and of νs ν (black dashed).14/ 23
  • 15. Introduction The warping effect Ranking error Experiments Conclusion BIVARIATE SYNTHETIC DATASET II (a) Undersampling with β = 0.053 (b) Undersampling with β = 0.323 Figure : Regions where undersampling should work. Triangles indicate the testing samples where the condition (6) holds for the dataset in Figure 5. 15/ 23
  • 16. Introduction The warping effect Ranking error Experiments Conclusion BIVARIATE SYNTHETIC DATASET III Table : Classification task in Figure 5: Ranking correlation between the posterior probability ˆp (ˆps) and p for different values of β. The value K (Ks) denotes the Kendall rank correlation without (with) undersampling. The first (last) five lines refer to samples for which the condition (6) is (not) satisfied. β K Ks Ks − K %points satisfying (6) 0.053 0.298 0.749 0.451 88.8 0.076 0.303 0.682 0.379 89.7 0.112 0.315 0.619 0.304 91.2 0.176 0.323 0.555 0.232 92.1 0.323 0.341 0.467 0.126 93.7 0.053 0.749 0.776 0.027 88.8 0.076 0.755 0.773 0.018 89.7 0.112 0.762 0.764 0.001 91.2 0.176 0.767 0.761 -0.007 92.1 0.323 0.768 0.748 -0.020 93.7 16/ 23
  • 17. Introduction The warping effect Ranking error Experiments Conclusion REAL DATASETS Table : Selected datasets from the UCI repository [1]1 Datasets N N+ N− N+/N ecoli 336 35 301 0.10 glass 214 17 197 0.08 letter-a 20000 789 19211 0.04 letter-vowel 20000 3878 16122 0.19 ism 11180 260 10920 0.02 letter 20000 789 19211 0.04 oil 937 41 896 0.04 page 5473 560 4913 0.10 pendigits 10992 1142 9850 0.10 PhosS 11411 613 10798 0.05 satimage 6430 625 5805 0.10 segment 2310 330 1980 0.14 boundary 3505 123 3382 0.04 estate 5322 636 4686 0.12 cam 18916 942 17974 0.05 compustat 13657 520 13137 0.04 covtype 38500 2747 35753 0.07 1 Transformed datasets are available at http: //www.ulb.ac.be/di/map/adalpozz/imbalanced-datasets.zip 17/ 23
  • 18. Introduction The warping effect Ranking error Experiments Conclusion Figure : Difference between the Kendall rank correlation of ˆps and ˆp with p, namely Ks and K, for points having the conditions (6) satisfied and not. Ks and K are calculated as the mean of the correlations over all βs. 18/ 23
  • 19. Introduction The warping effect Ranking error Experiments Conclusion Figure : Ratio between the number of sample satisfying condition 6 and all the instances available in each dataset averaged over all the βs. 19/ 23
  • 20. Introduction The warping effect Ranking error Experiments Conclusion SUMMARY Undersampling has two major effects: it increases the variance of the classifier it produces warped posterior probabilities. Countermeasures: averaging strategies (e.g. UnderBagging [9]) calibration of the probability to the new priors of the testing set [8]. Despite the popularity of undersampling, it is not clear how these two effects interact and when undersampling leads to better accuracy in the classification task. 20/ 23
  • 21. Introduction The warping effect Ranking error Experiments Conclusion CONCLUSION When (6) is satisfied the posterior probability obtained after sampling returns a more accurate ordering. Several factors influence (6) (e.g. β, variance of the classifier, class separability) Practical use (6) is not straightforward since it requires knowledge of p and νs ν (not easy to estimate). This result warning against a naive use of undersampling in unbalanced tasks. We suggest the adoption of adaptive selection techniques (e.g. racing [4]) to perform a case-by-case use of undersampling. 21/ 23
  • 22. Introduction The warping effect Ranking error Experiments Conclusion Code: https://github.com/dalpozz/warping Website: www.ulb.ac.be/di/map/adalpozz Email: adalpozz@ulb.ac.be Thank you for the attention Research is supported by the Doctiris scholarship funded by Innoviris, Brussels, Belgium. 22/ 23
  • 23. Introduction The warping effect Ranking error Experiments Conclusion BIBLIOGRAPHY [1] D. N. A. Asuncion. UCI machine learning repository, 2007. [2] G. E. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1):20–29, 2004. [3] A. Dal Pozzolo, O. Caelen, Y.-A. Le Borgne, S. Waterschoot, and G. Bontempi. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41(10):4915–4928, 2014. [4] A. Dal Pozzolo, O. Caelen, S. Waterschoot, and G. Bontempi. Racing for unbalanced methods selection. In Proceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning. IDEAL, 2013. [5] C. Elkan. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, volume 17, pages 973–978. Citeseer, 2001. [6] A. Estabrooks, T. Jo, and N. Japkowicz. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1):18–36, 2004. [7] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002. [8] M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002. [9] S. Wang, K. Tang, and X. Yao. Diversity exploration and negative correlation learning on imbalanced data sets. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 3259–3266. IEEE, 2009. [10] G. M. Weiss and F. Provost. The effect of class distribution on classifier learning: an empirical study. Rutgers Univ, 2001. 23/ 23