The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models

chakkrit.tantithamthavorn@monash.edu @klainfohttp://chakkrit.com
Communicated by Sunghun Kim
The Impact of Class Rebalancing
Techniques on the Performance and
Interpretation of Defect Models
Chakkrit (Kla) 
Tantithamthavorn
Ahmed 
Hassan
Kenichi 
Matsumoto

Analytical  
Models
.
.
. ..
. .
.
.
..
DEFECT MODELS IN A NUTSHELL
An analytical model trained on historical data to predict and explain future software defects
BUG
CLEAN
A.java
B.java
C.java
D.java
FILE CLASSMETRICS
……..
CLEAN
Predict future  
software defects
Explain which factors 
are associated with  
defect-proneness
Lewis et al.,
ICSE’13
Mockus et al.,
BLTJ’00
Ostrand et al.,
TSE’05
Kim et al.,
FSE’15
Zimmermann et
al., FSE’09 
Naggappan et al.,
ICSE’06
Caglayan et al.,
ICSE’15
Tan et al.,
ICSE’15
Shimagaki et al.,
ICSE’16
Defect Dataset
CLEAN

Analytical  
Models
Defect Dataset .
.
. ..
. .
.
.
..
DEFECT DATASETS ARE IMBALANCED!
The proportion of defective and clean modules is not equally represented
BUG
CLEAN
A.java
B.java
C.java
D.java
FILE CLASSMETRICS
CLEAN
CLEAN
Predict future  
software defects
Explain which factors 
are associated with  
defect-pronenessTraditional classification techniques often fail
to accurately identify the minority class (i.e.,
defective modules)
……..

HOW IMBALANCED ARE DEFECT DATASETS?
A histogram of the defective ratios of the 101 defect datasets
We assess 101 publicly-available defect datasets
• 76 from PROMISE
• 12 from NASA
• 5 from Kim et al.
• 5 from D’Ambros et al
• 3 from Zimmermann et al.

0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90 100
Defective Ratio
Percentage
• 76 from PROMISE
• 12 from NASA

0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90 100
Defective Ratio
Percentage
• 76 from PROMISE
• 12 from NASA
64% of the defect datasets have a
defective ratio below 30%

0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90 100
Defective Ratio
Percentage
• 76 from PROMISE
• 12 from NASA
As little as 8% of defect
datasets have a defective
ratio between 45%-55%

0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90 100
Defective Ratio
Percentage
• 76 from PROMISE
• 12 from NASA
As little as 8% of defect
datasets have a defective
ratio between 45%-55%
Class imbalance is prominent in defect datasets, likely affecting the
performance and interpretation of defect models

TO MITIGATE THE RISK OF CLASS IMBALANCE
Class rebalancing techniques (i.e., techniques for rebalancing the proportion of defective and clean
modules of the training corpus) are often applied
Original
Dataset
MajorityClassMinorityClass
Re-sampled
Dataset
A
B
A
B
A
B
Over-Sampling 
Technique
Original
Dataset
Re-sampled
Dataset
A
B
A
B
Under-Sampling 
Technique
SMOTE 
Technique
ROSE 
Techniqu
Original
Dataset
R
A
B
A
B
Original
Dataset
Re-sampled
Dataset
A
B
…
…
A
B
…
…
SyntheticMinorityClass

SHOULD WE REBALANCE OR NOT?
Prior studies arrive at contradictory conclusions, which make it hard to derive practical guidelines
Improve the F-measure  
by 7.8%-22.4%
[Kamei et al.]
Do not improve the percentage 
of correctly classified modules  
(i.e., Accuracy) [Riquelme et al.]
Are not harmful when
defective ratio > 20%
[Mahmood et al.]
4 classification techniques, 2
datasets, 3 measures
2 classification techniques, 4
datasets, 2 measures
A meta-analysis of 42 primary
defect prediction studies

Class rebalancing techniques may lead to bias in the learned concepts (i.e., concept drift)
B. Turhan, “On the dataset shift problem in software engineering prediction models,” EMSE’11.
Knowledge
Data
Model
World
Decision/Policy 
Making

Class rebalancing techniques may lead to bias in the learned concepts (i.e., concept drift)
B. Turhan, “On the dataset shift problem in software engineering prediction models,” EMSE’11.
Decision/Policy 
Making
Knowledge
Data
Model
World
Data is not representative to
the world
The learned model
may be biased
Different knowledge
Incorrect action plans

PERFORMANCE .
.
. ..
. .
.
.
..
INTERPRETATIONTYPES OF ANALYSIS
WHAT IS THE IMPACT OF
CLASS REBALANCING
TECHNIQUES?

CASE STUDY SETUP
Study #Classification #datasets Measures
Kamei et al. 4 2 P, R, and F1
Riquelme et al 2 4 AUC
Wang et al. 2 5 PD, PF, Balance, G-mean, AUC
Tan et al. 7 7 P, R, and F1
Agrawal et al. 6 9 P, R, PF, AUC
Bennin et al. 5 40 P, R, AUC, Balance, G-mean
Our study 7 101 10 performance measures

PERFORMANCE .
.
. ..
. .
.
.
..
Class rebalancing techniques:
- Have little impact on AUC
- Improve Recall
- Decrease Precision
CLASS REBALANCING
TECHNIQUES?

PERFORMANCE .
.
. ..
. .
.
.
..
- Improve Recall
Unfortunately, class rebalancing
techniques have a large impact on
the model interpretation
CLASS REBALANCING
TECHNIQUES?

PERFORMANCE .
.
. ..
. .
.
.
..
INTERPRETATION
WHICH EXPERIMENTAL
SETTINGS YIELD THE
BEST BENEFITS?
TYPES OF ANALYSIS
- Improve Recall
CLASS REBALANCING
TECHNIQUES?

WHICH EXPERIMENTAL SETTINGS YEILD THE
BEST BENEFITS?
Defective Ratio Classification  
Techniques
Class Rebalancing  
Techniques
+ ++Metrics Family
+The Risk of Overfitting 
(Events Per Variable, EPV)
~Performance

PERFORMANCE .
.
. ..
. .
.
.
..
INTERPRETATION
WHICH EXPERIMENTAL
SETTINGS YIELD THE
BEST BENEFITS?
TYPES OF ANALYSIS
CLASS REBALANCING
TECHNIQUES?
- Improve Recall

PERFORMANCE .
.
. ..
. .
.
.
..
INTERPRETATION
WHICH EXPERIMENTAL
SETTINGS YIELD THE
BEST BENEFITS?
TYPES OF ANALYSIS
Logistic regression models with
under-sampling to defect datasets
(an EPV ratio higher than 40)
CLASS REBALANCING
TECHNIQUES?
- Improve Recall

PERFORMANCE .
.
. ..
. .
.
.
..
INTERPRETATION
WHICH EXPERIMENTAL
SETTINGS YIELD THE
BEST BENEFITS?
TYPES OF ANALYSIS
Neural network is the most sensitive
technique, while Naive Bayes is the
least sensitive technique to class
rebalancing techniques
CLASS REBALANCING
TECHNIQUES?
- Improve Recall

SMOTETUNED by  
[Agrawal and Menzies, ICSE'18]
PERFORMANCE .
.
. ..
. .
.
.
..
INTERPRETATION
WHICH EXPERIMENTAL
SETTINGS YIELD THE
BEST BENEFITS?
TYPES OF ANALYSIS
CLASS REBALANCING
TECHNIQUES?
- Improve Recall

SMOTETUNED by  
PERFORMANCE .
.
. ..
. .
.
.
..
INTERPRETATION
WHICH EXPERIMENTAL
SETTINGS YIELD THE
BEST BENEFITS?
TYPES OF ANALYSIS
Similarly, the SMOTE parameter must
be optimized to improve AUC. Works
best with NNet, GBM, RF, and C5.0
CLASS REBALANCING
TECHNIQUES?
- Improve Recall

SMOTETUNED by  
PERFORMANCE .
.
. ..
. .
.
.
..
INTERPRETATION
WHICH EXPERIMENTAL
SETTINGS YIELD THE
BEST BENEFITS?
TYPES OF ANALYSIS
Similarly, the SMOTE parameter must
be optimized to improve AUC. Works
best with NNet, GBM, RF, and C5.0
SMOTETUNED still has a large
impact on the model interpretation
CLASS REBALANCING
TECHNIQUES?
- Improve Recall

TAKE 
AWAY
For predictions
- Use optimised SMOTE for AUC
- Use under-sampling for Recall
For interpretations
- Don’t apply anything!!!!
chakkrit.tantithamthavorn@monash.edu
@klainfohttp://chakkrit.com
Dr. Chakkrit (Kla) Tantithamthavorn

The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models

More Related Content

What's hot (20)

Similar to The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models (20)

Recently uploaded (20)

The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models