Performance analysis of machine learning approaches in software complexity prediction by sayed mohsin reza at tcce 2020 conference

2nd International Conference on Trends in Computational and Cognitive Engineering (TCCE)
Paper ID- xxx
Performance Analysis of Machine Learning
Approaches in Software Complexity
Prediction
Sayed Reza1, Mahfujur Rahman2, Hasnat Parvez3,
Omar Badreddin1, and Shamim Al Mamun3
1 University of Texas, 2 Daffodil International University and 3
Jahangirnagar University
1
Paper ID -
410

Introduction
• Software complexity is an undesired characteristic of a software
• Increasing complexity reduces maintainability and sustainability
• Class level complexity
• Method level complexity
• Complexity can be affected by many factors related to code
structures, object-oriented properties, and source code metrics
• Machine learning techniques can automate the process and get rid of
manual process or code rules to detect class complexity
2

Research Objectives
• Use machine learning techniques to build complexity
classifiers
• The reason behind using machine learning to get rid of
manual process or code rules to detect class complexity.
• Compare the performance of the ML classifiers
• Report the best technique based on performance
metrics
3

Motivation
• Early detection of software complexity will
empower better software maintenance
• Effective software maintenance facilitates
better quality over time
• And a well qualified software facilitates
• Enhance future software maintainability
• Ensure a sustainable software over time
• Minimize software development efforts over time
• Reduce the software development costs
4

Research Questions & Study
Design
• RQ1: How source code metrics are correlated with quality attribute:
class complexity?
• This question reveals the relationships between complexity and source code
metrics
• RQ2: How accurately can machine learning approaches predict class
complexity from source code metrics?
• This question is targeted to find out the accuracy of machine learning
approaches in class level complexity detection
5
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Study Design

Dataset Collection
• Dataset for complexity prediction needs diverse set
of repositories
• We search codebase repositories using ModelMine
tool [1] with the following criteria;
• a repository with primary language Java
• a minimum of 5000 commits (proxy of maintenance)
• at least 100 active contributors
• a minimum of 3000 stars and 500 forks (proxy for
popularity )
• 10 repositories and 38,778 classes in total are
selected
6
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
[1] Sayed Mohsin Reza, Omar Badreddin, and Khandoker Rahad. ModelMine: A tool to facilitate mining models from open-source repositories. In 2020 ACM/IEEE 23rd
International Conference on Model Driven Engineering Languages and Systems(MODELS). ACM, 2020.
Figure: Class distribution among
repositories

Dataset Collection
(Continue)
• Input Variables: Extract 18 unique source code
metrics using static analyzer tool from each class
in code repositories
• Target Variable: Extract Current Complexity using
CODEMR tool [2] from each class in code repositories
• The variables are then combined using the class name
to create a dataset for complexity classifier
7
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
[2] Asma Shaheen, Usman Qamar, Aiman Nazir, Raheela Bibi, Munazza Ansar, andIqra Zafar. Oocqm: Object oriented code quality meter. In International Conference on
Computational Science/Intelligence & Applied Informatics, pages 149–163.Springer, 2019.
Table: Source Code
Metrics
… … …

Dataset Preparation
• Remove the duplicate observations
• Find the outliers to remove the bias datapoints
• Visualize explanatory data analysis on input and
target variables
• Create training (80%) and testing dataset (20%)
8
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Relationship of some input
variables with target variable

Correlation Results
• RQ1: How source code metrics are correlated
with quality attribute: class complexity?
• The results of Pearson correlation reveals
the impact of source code metrics on
complexity.
• The following source code metrics DIT, SRFC,
RFC, WMC, CMLOC and CBO *** have moderately
high impact on complexity
9
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Correlation between source code
metrics and complexity
*** DIT = Depth Inheritance Tree, RFC = Response for a Class, CMLOC= Class-Method Lines of Code, CBO = Coupling between objects

Training & Testing
• In training, we choose 5 different Machine Learning techniques to classify
complexity
1. Naive Bayes (NB)
2. Logistic Regression (LR)
3. Decision Tree (DT)
4. Random Forest (RF) and
5. Ada Boost (AB)
• These are well known classifiers in machine learning and used in several similar
research [3,4]
• Perform 10-fold cross validation to ensure the reduction in variability of
performance results
10
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
[3] Istehad Chowdhury and Mohammad Zulkernine. Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities. Journal of Systems Architecture,
57(3):294–313, 2011
[4] Yun Zhang, David Lo, Xin Xia, Bowen Xu, Jianling Sun, and Shanping Li. Combining software metrics and text features for vulnerable file prediction. In 2015 20th
International Conference on Engineering of Complex Computer Systems (ICECCS), pages 40–49. IEEE, 2015.

Performance
Evaluation
• RQ2: How accurately can machine learning
approaches predict class complexity from
source code metrics?
• Decision Tree & Random Forest classifier
has the highest accuracy and precision
compared to other classifiers.
• Random Forest has highest recall & F1
score
• Is that all to declare best technique?
11
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Relative performance of ML
classifiers

Performance
Evaluation (Continue)
• We focus on false negative rate to reduce the risk of
false alarms
• Higher FN Rate -> High number of high complex classes are detected as
Low [Very Risky Model]
• Lower FN Rate -> low number of high complex classes are detected as
Low [Less Risky Model]
• Still, Random Forest(RF) shows lower FN rate compared to
others
• The reason behind this we find out that RF use
bootstrapping random re-sample technique and working
with significant elements which works much better in
prediction.
12
Dataset
Collection
Dataset
Preparation
Correlation
Analysis
(RQ1)
Training
Performance
Evaluation
(RQ2)
Report Best
Technique
Figure: Relative FN rate of
ML classifiers

Conclusion
• Problem in quality management: It is undoubtedly necessary to take proper action
before classes are become more complex
• Research Objective & Results
• We compare Machine Learning techniques’ performance to predict class complexity
• Our results shows that Random Forest model is doing better compared to other models
• We also find out the source code metrics which have most impact on class complexity
• Industrial Usage: Using ML automatic prediction on code quality will allow quality
managers, practitioners to take preventive actions against high complex classes
• Long-term Outcome: Ensure a sustainable software, Minimize software development
efforts, Reduce the software development costs over time
13
If you have any questions, email me at sreza3@miners.utep.edu

Performance analysis of machine learning approaches in software complexity prediction by sayed mohsin reza at tcce 2020 conference

More Related Content

What's hot (20)

Similar to Performance analysis of machine learning approaches in software complexity prediction by sayed mohsin reza at tcce 2020 conference (20)

Recently uploaded (20)

Performance analysis of machine learning approaches in software complexity prediction by sayed mohsin reza at tcce 2020 conference