Revisiting the Impact of Classification
Techniques on the Performance of
Defect Prediction Models
Baljinder
Ghotra
Ahmed E.
Hassan
Shane
McIntosh
Quality assurance teams have
limited resources
Personnel Schedules
2
Executing all test suites
takes too long
3
Often release several times
in one day!
Defect models can help QA teams to
allocate limited resources effectively
4
Defect prediction
model
Defect models are trained using historical
data to predict the defect-prone modules
5
a
b
c c
a
New!
c
Reason
for change
Changed
modules
Developer
responsible
Defect prediction
model
Defect models are trained using historical
data to predict the defect-prone modules
6
abccaNew!c
Low risk
a b
High risk
c
Defect models are trained using
various techniques
7
Simple
techniques
Advanced
techniques
Decision
Trees
Logistic
Regression
+
Logistic
Model Trees
(LMT)
Most classification techniques produce
models that achieve similar performance?
8
Decision Trees Logistic Model Trees
(LMT)
+
The performance of 17 of 22
studied techniques are
indistinguishable
Benchmarking classification
models for software defect
prediction
S. Lessmann, B. Baesens,
C. Mues, S. Pietsch
[TSE 2008]
Limitations of the prior work
9
Overlapping
statistical ranks
Noisy
data
Limited
scope
Do most techniques produce models
with similar performance, when we use:
10
Non-overlapping
statistical ranks
Clean
data
Expanded
scope
Overlapping
statistical ranks
Noisy
data
Limited
scope
Do most techniques produce models
with similar performance, when we use:
11
Non-overlapping
statistical ranks
Expanded
scope
Clean
data
Do most techniques produce models
with similar performance, when we use:
12
Non-overlapping
statistical ranks
Expanded
scope
Clean
data
Our approach to study the impact of
classification techniques on defect models
13
Train and
test models
using
different
techniques
Rank
techniques
using
statistical
clustering
11a
22b
NNz
...
Performance
scores for
each
technique
Rank Tech.
1
2
3
z, …
a,b,…
…
Repeat
100 times
Unfortunately, some projects yield
poorer results than others
14
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
CM1
JM1
KC1
KC3
KC4
MW1
PC1
PC2
PC3
PC4
0.5
0.6
0.7
0.8
0.9
AUC
Performance valuesrarely overlap!
Non-overlapping ranks using a
double Scott-Knott test
15
Project 2
Scott-Knott
test (1st run)
...Mean AUC
value
Technique 1
Mean AUC
value
Technique 1
Mean AUC
value
Technique 1
10x
Mean AUC
value
Technique 2
Mean AUC
value
Technique 2
Mean AUC
value
Technique 2
10x
Mean AUC
value
Technique N
Mean AUC
value
Technique N
Mean AUC
value
Technique N
10x
T2, T5, T7
TechniqueRank
1
T1, T102
T3, T4, T63
T8, T94
Project 1
Scott-Knott
test (1st run)
...Mean AUC
value
Technique 1
Mean AUC
value
Technique 1
Mean AUC
value
Technique 1
10x
Mean AUC
value
Technique 2
Mean AUC
value
Technique 2
Mean AUC
value
Technique 2
10x
Mean AUC
value
Technique N
Mean AUC
value
Technique N
Mean AUC
value
Technique N
10x
T3, T7, T8
TechniqueRank
1
T2, T102
T1, T4, T63
T5, T94
Project M
...
Non-overlapping ranks using a
double Scott-Knott test
16
Scott-Knott
test (2nd run)
Scott-Knott
test (1st run)
10x
T2, T5, T7
TechniqueRank
1
T1, T102
T3, T4, T63
T8, T94
T2, T5
TechniqueRank
1
T1, T7, T102
T3, T4, T63
T8, T94
Scott-Knott
test (1st run)
10x
T3, T7, T8
TechniqueRank
1
T2, T102
T1, T4, T63
T5, T94
17
Non-overlapping test:
Most techniques have similar performance
Rank
1
2
Ad+NB, EM, RBFs, …
Rsub+SMO, J48, …
Technique
Similar to the prior work,techniques are groupedinto 2 distinct ranks
Do most techniques produce models
with similar performance, when we use:
18
Non-overlapping
statistical ranks
Expanded
scope
Clean
data
Yes, techniques
are grouped into
2 distinct ranks
Do most techniques produce models
with similar performance, when we use:
19
Non-overlapping
statistical ranks
Expanded
scope
Clean
data
Yes, techniques
are grouped into
2 distinct ranks
Clean NASA dataset:
Cleaning criteria of prior work
20
Data Quality: Some Comments on the
NASA Software Defect Datasets
M. Shepperd, Q. Song, Z. Sun, C. Mair
[TSE 2013]
Identical cases
Missing values
Constraint violations
Clean NASA dataset:
Many distinct ranks of techniques
21
Rank
1
2
LMT, SL, …
KNN, RBFs, …
Technique
3 J48, K-means, …
4 SMO, Ridor, …
Unlike the prior work,techniques are groupedinto 4 distinct ranks
Top performers are LMTand logistic regression
Do most techniques produce models
with similar performance, when we use:
22
Non-overlapping
statistical ranks
Expanded
scope
Clean
data
Yes, techniques
are grouped into
2 distinct ranks
No, unlike theprior work,techniques aregrouped into 4distinct ranks
Do most techniques produce models
with similar performance, when we use:
23
Non-overlapping
statistical ranks
Expanded
scope
Clean
data
Yes, techniques
are grouped into
2 distinct ranks
No, unlike theprior work,techniques aregrouped into 4distinct ranks
Another dataset:
The PROMISE corpus
24
Another dataset:
Four significant ranks of techniques
25
Rank
1
2
LMT, SL, …
KNN, RBFs, …
Technique
3 J48, K-means, …
4 SMO, Ridor, …
Unlike the prior work,techniques are groupedinto 4 distinct ranks
Top performers are LMTand logistic regression
Do most techniques produce models
with similar performance, when we use:
26
Non-overlapping
statistical ranks
Expanded
scope
Clean
data
No, similar to the
clean data study,
techniques are
grouped into 4
distinct ranks
Yes, techniques
are grouped into
2 distinct ranks
No, unlike theprior work,techniques aregrouped into 4distinct ranks
Classification technique
matters!
27
Decision Trees Logistic Model Trees
(LMT)
+
Low-cost suggestion:
Experiment with the available techniques
28
6,618 packages
are available
on CRAN
148 packagesare available inpackage explorer
shanemcintosh@acm.org

More Related Content

PDF
Selective Gradient Boosting for Effective Learning to Rank - SIGIR 2018
PPTX
Feature Selection Techniques for Software Fault Prediction (Summary)
PPT
Learning to Search Henry Kautz
PPTX
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
PPT
slide->title; ?>
PPTX
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
PPT
Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...
PDF
Automated parameter optimization should be included in future 
defect predict...
Selective Gradient Boosting for Effective Learning to Rank - SIGIR 2018
Feature Selection Techniques for Software Fault Prediction (Summary)
Learning to Search Henry Kautz
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
slide->title; ?>
Deep Reinforcement Learning based Recommendation with Explicit User-ItemInter...
Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Predic...
Automated parameter optimization should be included in future 
defect predict...

What's hot (14)

PDF
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
PDF
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
PDF
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
PDF
Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test ...
PDF
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
PPTX
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
PPTX
An Empirical Study on the Adequacy of Testing in Open Source Projects
PPTX
A software fault localization technique based on program mutations
PDF
ELLA LC algorithm presentation in ICIP 2016
PPTX
Comparison of papers NN-filter
PDF
Diversity Maximization Speedup for Fault Localization
DOC
Testing survey by_directions
PPT
Decision Support Analyss for Software Effort Estimation by Analogy
PDF
From sensor readings to prediction: on the process of developing practical so...
The Impact of Class Rebalancing Techniques on the Performance and Interpretat...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
Testing with Fewer Resources: An Adaptive Approach to Performance-Aware Test ...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Pareto-Optimal Search-Based Software Engineering (POSBSE): A Literature Survey
An Empirical Study on the Adequacy of Testing in Open Source Projects
A software fault localization technique based on program mutations
ELLA LC algorithm presentation in ICIP 2016
Comparison of papers NN-filter
Diversity Maximization Speedup for Fault Localization
Testing survey by_directions
Decision Support Analyss for Software Effort Estimation by Analogy
From sensor readings to prediction: on the process of developing practical so...
Ad

Similar to Ghotra icse (20)

PDF
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
PDF
Towards a Better Understanding of the Impact of Experimental Components on De...
PDF
Thesis Final Report
PDF
IRJET - A Novel Approach for Software Defect Prediction based on Dimensio...
PDF
Comparative performance analysis
PDF
Comparative Performance Analysis of Machine Learning Techniques for Software ...
PDF
AI-Driven Software Quality Assurance in the Age of DevOps
PPTX
Predictive Analytics based Regression Test Optimization
PDF
Transfer defect learning
PPTX
Thesis Final Presentation
DOC
Research proposal
PDF
TOWARDS PREDICTING SOFTWARE DEFECTS WITH CLUSTERING TECHNIQUES
PDF
A Hierarchical Feature Set optimization for effective code change based Defec...
PDF
An Empirical Comparison of Model Validation Techniques for Defect Prediction ...
PDF
Fault Detection and Classification for Robotic Test-bench
PDF
TMPA-2017: Defect Report Classification in Accordance with Areas of Testing
PDF
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...
PDF
Predicting Defective Lines Using a Model-Agnostic Technique
PDF
Survey on Software Defect Prediction
DOCX
A Novel Approach to Improve Software Defect Prediction Accuracy Using Machine...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
Towards a Better Understanding of the Impact of Experimental Components on De...
Thesis Final Report
IRJET - A Novel Approach for Software Defect Prediction based on Dimensio...
Comparative performance analysis
Comparative Performance Analysis of Machine Learning Techniques for Software ...
AI-Driven Software Quality Assurance in the Age of DevOps
Predictive Analytics based Regression Test Optimization
Transfer defect learning
Thesis Final Presentation
Research proposal
TOWARDS PREDICTING SOFTWARE DEFECTS WITH CLUSTERING TECHNIQUES
A Hierarchical Feature Set optimization for effective code change based Defec...
An Empirical Comparison of Model Validation Techniques for Defect Prediction ...
Fault Detection and Classification for Robotic Test-bench
TMPA-2017: Defect Report Classification in Accordance with Areas of Testing
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...
Predicting Defective Lines Using a Model-Agnostic Technique
Survey on Software Defect Prediction
A Novel Approach to Improve Software Defect Prediction Accuracy Using Machine...
Ad

More from SAIL_QU (20)

PDF
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
PDF
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
PPTX
Improving the testing efficiency of selenium-based load tests
PDF
Studying User-Developer Interactions Through the Distribution and Reviewing M...
PDF
Studying online distribution platforms for games through the mining of data f...
PPTX
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
PDF
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
PDF
Mining Development Knowledge to Understand and Support Software Logging Pract...
PPTX
Which Log Level Should Developers Choose For a New Logging Statement?
PPTX
Towards Just-in-Time Suggestions for Log Changes
PDF
The Impact of Task Granularity on Co-evolution Analyses
PPTX
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
PPTX
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
PPTX
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
PPTX
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
PDF
What Do Programmers Know about Software Energy Consumption?
PPTX
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
PDF
Revisiting the Experimental Design Choices for Approaches for the Automated R...
PPTX
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
PPTX
On the Unreliability of Bug Severity Data
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Improving the testing efficiency of selenium-based load tests
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying online distribution platforms for games through the mining of data f...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Mining Development Knowledge to Understand and Support Software Logging Pract...
Which Log Level Should Developers Choose For a New Logging Statement?
Towards Just-in-Time Suggestions for Log Changes
The Impact of Task Granularity on Co-evolution Analyses
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
What Do Programmers Know about Software Energy Consumption?
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
On the Unreliability of Bug Severity Data

Ghotra icse

  • 1. Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models Baljinder Ghotra Ahmed E. Hassan Shane McIntosh
  • 2. Quality assurance teams have limited resources Personnel Schedules 2
  • 3. Executing all test suites takes too long 3 Often release several times in one day!
  • 4. Defect models can help QA teams to allocate limited resources effectively 4 Defect prediction model
  • 5. Defect models are trained using historical data to predict the defect-prone modules 5 a b c c a New! c Reason for change Changed modules Developer responsible
  • 6. Defect prediction model Defect models are trained using historical data to predict the defect-prone modules 6 abccaNew!c Low risk a b High risk c
  • 7. Defect models are trained using various techniques 7 Simple techniques Advanced techniques Decision Trees Logistic Regression + Logistic Model Trees (LMT)
  • 8. Most classification techniques produce models that achieve similar performance? 8 Decision Trees Logistic Model Trees (LMT) + The performance of 17 of 22 studied techniques are indistinguishable Benchmarking classification models for software defect prediction S. Lessmann, B. Baesens, C. Mues, S. Pietsch [TSE 2008]
  • 9. Limitations of the prior work 9 Overlapping statistical ranks Noisy data Limited scope
  • 10. Do most techniques produce models with similar performance, when we use: 10 Non-overlapping statistical ranks Clean data Expanded scope Overlapping statistical ranks Noisy data Limited scope
  • 11. Do most techniques produce models with similar performance, when we use: 11 Non-overlapping statistical ranks Expanded scope Clean data
  • 12. Do most techniques produce models with similar performance, when we use: 12 Non-overlapping statistical ranks Expanded scope Clean data
  • 13. Our approach to study the impact of classification techniques on defect models 13 Train and test models using different techniques Rank techniques using statistical clustering 11a 22b NNz ... Performance scores for each technique Rank Tech. 1 2 3 z, … a,b,… … Repeat 100 times
  • 14. Unfortunately, some projects yield poorer results than others 14 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● CM1 JM1 KC1 KC3 KC4 MW1 PC1 PC2 PC3 PC4 0.5 0.6 0.7 0.8 0.9 AUC Performance valuesrarely overlap!
  • 15. Non-overlapping ranks using a double Scott-Knott test 15 Project 2 Scott-Knott test (1st run) ...Mean AUC value Technique 1 Mean AUC value Technique 1 Mean AUC value Technique 1 10x Mean AUC value Technique 2 Mean AUC value Technique 2 Mean AUC value Technique 2 10x Mean AUC value Technique N Mean AUC value Technique N Mean AUC value Technique N 10x T2, T5, T7 TechniqueRank 1 T1, T102 T3, T4, T63 T8, T94 Project 1 Scott-Knott test (1st run) ...Mean AUC value Technique 1 Mean AUC value Technique 1 Mean AUC value Technique 1 10x Mean AUC value Technique 2 Mean AUC value Technique 2 Mean AUC value Technique 2 10x Mean AUC value Technique N Mean AUC value Technique N Mean AUC value Technique N 10x T3, T7, T8 TechniqueRank 1 T2, T102 T1, T4, T63 T5, T94 Project M ...
  • 16. Non-overlapping ranks using a double Scott-Knott test 16 Scott-Knott test (2nd run) Scott-Knott test (1st run) 10x T2, T5, T7 TechniqueRank 1 T1, T102 T3, T4, T63 T8, T94 T2, T5 TechniqueRank 1 T1, T7, T102 T3, T4, T63 T8, T94 Scott-Knott test (1st run) 10x T3, T7, T8 TechniqueRank 1 T2, T102 T1, T4, T63 T5, T94
  • 17. 17 Non-overlapping test: Most techniques have similar performance Rank 1 2 Ad+NB, EM, RBFs, … Rsub+SMO, J48, … Technique Similar to the prior work,techniques are groupedinto 2 distinct ranks
  • 18. Do most techniques produce models with similar performance, when we use: 18 Non-overlapping statistical ranks Expanded scope Clean data Yes, techniques are grouped into 2 distinct ranks
  • 19. Do most techniques produce models with similar performance, when we use: 19 Non-overlapping statistical ranks Expanded scope Clean data Yes, techniques are grouped into 2 distinct ranks
  • 20. Clean NASA dataset: Cleaning criteria of prior work 20 Data Quality: Some Comments on the NASA Software Defect Datasets M. Shepperd, Q. Song, Z. Sun, C. Mair [TSE 2013] Identical cases Missing values Constraint violations
  • 21. Clean NASA dataset: Many distinct ranks of techniques 21 Rank 1 2 LMT, SL, … KNN, RBFs, … Technique 3 J48, K-means, … 4 SMO, Ridor, … Unlike the prior work,techniques are groupedinto 4 distinct ranks Top performers are LMTand logistic regression
  • 22. Do most techniques produce models with similar performance, when we use: 22 Non-overlapping statistical ranks Expanded scope Clean data Yes, techniques are grouped into 2 distinct ranks No, unlike theprior work,techniques aregrouped into 4distinct ranks
  • 23. Do most techniques produce models with similar performance, when we use: 23 Non-overlapping statistical ranks Expanded scope Clean data Yes, techniques are grouped into 2 distinct ranks No, unlike theprior work,techniques aregrouped into 4distinct ranks
  • 25. Another dataset: Four significant ranks of techniques 25 Rank 1 2 LMT, SL, … KNN, RBFs, … Technique 3 J48, K-means, … 4 SMO, Ridor, … Unlike the prior work,techniques are groupedinto 4 distinct ranks Top performers are LMTand logistic regression
  • 26. Do most techniques produce models with similar performance, when we use: 26 Non-overlapping statistical ranks Expanded scope Clean data No, similar to the clean data study, techniques are grouped into 4 distinct ranks Yes, techniques are grouped into 2 distinct ranks No, unlike theprior work,techniques aregrouped into 4distinct ranks
  • 28. Low-cost suggestion: Experiment with the available techniques 28 6,618 packages are available on CRAN 148 packagesare available inpackage explorer