SlideShare a Scribd company logo
Performance Comparison of Binary Machine
Learning Classifiers in Identifying Code
Comment Types: An Exploratory Study
Amila Indika, Peter Y. Washington, Anthony Peruma
Overview
As part of the NLBSE tool competition, we compare
the performance of 19 binary machine learning
classifiers for code comment categories that
belong to three different programming languages.
Introduction
• There has been a significant interest in leveraging Artificial
Intelligence (AI) and Machine Learning (ML) techniques to
enhance various aspects of software engineering
• Traditional software engineering activities often rely on manual
and rule-based approaches
• which can be time-consuming, error-prone, and limited in handling
complex software systems
• Prior work has shown the effectiveness in using AI/ML
techniques to automate many software engineering activities
• Code generation and refactoring recommendations, defect detection,
code comprehension and documentation, etc.
Code Comment Classification
• Code comments are written in natural language and can provide
information that may not be immediately obvious from the
source code alone.
• Grouping code comments into related categories automatically
can help developers locate pertinent information faster
• Prior work by Rani et al. and the NLBSE competition baseline
utilized Random Forest to achieve an optimal classifier
• However, there are many other types of classifiers, each with their own
set of hyperparameters
• Can we further optimize the Random Forest Classifier?
• Are there other ML models that are better than Random Forest?
Goal, Impact & Contribution
A comparison of the performance of different types of
binary machine learning classifiers to understand the
extent to which they can identify code comment types
Help the research community understand the strengths
and shortcomings of using such models for this particular
task and discover avenues for future research in this area
A publicly available set of binary machine
learning models for code comment classification
Experiment Design
Experiment Design
• Source Dataset - a dataset of 6,738 comment sentences from 20
open-source projects implemented using Java, Python, and
Pharo produced by Rani et al.
• Text Preprocessing - transform the comment sentences to a
standard and convenient format (i.e. ,text normalization)
• Removal whitespaces; Expansion of contractions; Removal of non-alpha
characters, single-character words, stopwords, convert to lowercase,
stemming, replace digits and empty strings with a token
Experiment Design
• Filter Category – for each programming language dataset, we
filter the comments for each category
• Java & Pharo – 7 categories
• Python – 5 categories
• Train/Test Split - The source dataset contains a column indicating
whether a comment sentence belongs to the training or test set,
which is used to create the train/test datasets
Experiment Design
• Feature Extraction – we only utilize the comment sentence text
as input to the classification model; the Term Frequency-Inverse
Document Frequency approach is used to convert raw comment
text into numerical values
• Oversampling – we utilized random oversampling to generate
new samples for under-represented classes in the training
dataset; the test dataset was not oversampled
Experiment Design
• Model Training & Tuning – we evaluate 8 common machine learning classification
algorithms; 10-fold cross-validation to search over hyperparameter values
• Naive Bayes (NB): Multinomial NB and Bernoulli NB
• Support Vector Machines: Linear Support Vector Classifier
• Trees: Decision Tree and Random Forest
• Nearest Neighbors: K-Nearest Neighbors
• Linear Model: Logistic Regression
• Neural Network: Multi-Layer Perceptron
• Optimized Model – the model that performs the best when built on the training
data using hypermeter tuning
Experiment Design
• Test Prediction – The best model for each algorithm is used to
predict values for test data that the model has not seen before
• Model Performance Scoring – The precision, recall, and F1 scores
are calculated for each model and category. A total of 190
instances of precision, recall, and F1 scores are calculated for the
evaluation.
Results
Performance Comparison of Binary Machine Learning Classifiers in Identifying Code Comment Types: An Exploratory Study
Competition Ranking Scores
Model Avg. F1 % Outperformed Categories Ranking Score
LogisticRegression 0.5465 1 0.6599
LinearSVC 0.5474 0.9474 0.6474
RandomForestClassifier 0.5366 0.8947 0.6261
DecisionTreeClassifier 0.4931 0.9474 0.6067
MultinomialNB 0.5249 0.8421 0.6042
MLPClassifier 0.5227 0.8421 0.6026
BernoulliNB 0.5225 0.8421 0.6024
KNeighborsClassifier 0.5033 0.8421 0.588
• All 19 Logistic Regression models outperformed the baseline scores
• Our Random Forest Classifier outperforms the baseline RF model
• This can be due to more exhaustive hyperparameter tuning
• Alternate preprocessing activities
Summary
• We examined the effectiveness of eight machine learning
models to classify code comments
• Workflow steps include: text preprocessing, oversampling, and
hyperparameter tuning using grid search with 10-fold cross validation
• All models achieve a higher average F1-Score than the baseline
• Logistic Regression was the only classifier that successfully
outperformed all baseline classifiers
• Linear SVC and Decision Tree outperformed 18 of 19 baseline classifiers
Thank You!
Anthony Peruma
https://www.peruma.me

More Related Content

PPTX
Multi-modal sources for predictive modeling using deep learning
PDF
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PPTX
IMDB Movie Reviews made by any organisation.pptx
PDF
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
PDF
SA_MSA_ICBAI_2016_presentation_v1.0
PDF
Resume_Apoorva
PDF
Guiding through a typical Machine Learning Pipeline
Multi-modal sources for predictive modeling using deep learning
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Python for Machine Learning_ A Comprehensive Overview.pptx
IMDB Movie Reviews made by any organisation.pptx
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
SA_MSA_ICBAI_2016_presentation_v1.0
Resume_Apoorva
Guiding through a typical Machine Learning Pipeline

Similar to Performance Comparison of Binary Machine Learning Classifiers in Identifying Code Comment Types: An Exploratory Study (20)

PPTX
Applications of Generative Artificial intelligence
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Combining IR with Relevance Feedback for Concept Location
PDF
NLP and Deep Learning for non_experts
PPTX
The recommendations system for source code components retrieval
PDF
AI for Software Engineering
PDF
Code Inspection
PPT
compiler construvtion aaaaaaaaaaaaaaaaaads
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
PPT
Presentation
PPTX
Building High Available and Scalable Machine Learning Applications
PDF
PhD Proposal talk
PPTX
Open, Secure & Transparent AI Pipelines
PDF
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and TensorFlow
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Aakanksha_Agnani_j2016
PDF
Strata San Jose 2016: Scalable Ensemble Learning with H2O
DOC
Resume - NarasimhaReddy
PDF
Enabling Automated Software Testing with Artificial Intelligence
Applications of Generative Artificial intelligence
Combining Machine Learning frameworks with Apache Spark
Combining IR with Relevance Feedback for Concept Location
NLP and Deep Learning for non_experts
The recommendations system for source code components retrieval
AI for Software Engineering
Code Inspection
compiler construvtion aaaaaaaaaaaaaaaaaads
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Presentation
Building High Available and Scalable Machine Learning Applications
PhD Proposal talk
Open, Secure & Transparent AI Pipelines
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and TensorFlow
Lessons Learned from Building Machine Learning Software at Netflix
Combining Machine Learning Frameworks with Apache Spark
Aakanksha_Agnani_j2016
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Resume - NarasimhaReddy
Enabling Automated Software Testing with Artificial Intelligence
Ad

More from University of Hawai‘i at Mānoa (20)

PDF
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
PDF
Exploring Accessibility Trends and Challenges in Mobile App Development: A St...
PDF
The Impact of Generative AI-Powered Code Generation Tools on Software Enginee...
PDF
Mobile App Security Trends and Topics: An Examination of Questions From Stack...
PDF
On the Rationale and Use of Assertion Messages in Test Code: Insights from So...
PDF
A Developer-Centric Study Exploring Mobile Application Security Practices and...
PDF
Building Hawaii’s IT Future Together CIO Council & UH Manoa ICS Collaboration
PDF
Impostor Syndrome in Final Year Computer Science Students: An Eye Tracking an...
PDF
An Exploratory Study on the Occurrence of Self-Admitted Technical Debt in And...
PDF
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
PDF
A Primer on High-Quality Identifier Naming [ASE 2022]
PDF
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
PDF
Preparing for the Academic Job Market: Experience and Tips from a Recent F...
PDF
Refactoring Debt: Myth or Reality? An Exploratory Study on the Relationship B...
PDF
A Primer on High-Quality Identifier Naming
PDF
Test Anti-Patterns: From Definition to Detection
PDF
Refactoring Debt: Myth or Reality? An Exploratory Study on the Relationship B...
PDF
Understanding Digits in Identifier Names: An Exploratory Study
PDF
How Do I Refactor This? An Empirical Study on Refactoring Trends and Topics i...
PDF
IDEAL: An Open-Source Identifier Name Appraisal Tool
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Accessibility Trends and Challenges in Mobile App Development: A St...
The Impact of Generative AI-Powered Code Generation Tools on Software Enginee...
Mobile App Security Trends and Topics: An Examination of Questions From Stack...
On the Rationale and Use of Assertion Messages in Test Code: Insights from So...
A Developer-Centric Study Exploring Mobile Application Security Practices and...
Building Hawaii’s IT Future Together CIO Council & UH Manoa ICS Collaboration
Impostor Syndrome in Final Year Computer Science Students: An Eye Tracking an...
An Exploratory Study on the Occurrence of Self-Admitted Technical Debt in And...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
A Primer on High-Quality Identifier Naming [ASE 2022]
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Q...
Preparing for the Academic Job Market: Experience and Tips from a Recent F...
Refactoring Debt: Myth or Reality? An Exploratory Study on the Relationship B...
A Primer on High-Quality Identifier Naming
Test Anti-Patterns: From Definition to Detection
Refactoring Debt: Myth or Reality? An Exploratory Study on the Relationship B...
Understanding Digits in Identifier Names: An Exploratory Study
How Do I Refactor This? An Empirical Study on Refactoring Trends and Topics i...
IDEAL: An Open-Source Identifier Name Appraisal Tool
Ad

Recently uploaded (20)

PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Nekopoi APK 2025 free lastest update
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
L1 - Introduction to python Backend.pptx
PDF
System and Network Administraation Chapter 3
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Introduction to Artificial Intelligence
PDF
medical staffing services at VALiNTRY
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
ai tools demonstartion for schools and inter college
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Nekopoi APK 2025 free lastest update
Operating system designcfffgfgggggggvggggggggg
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Understanding Forklifts - TECH EHS Solution
Odoo POS Development Services by CandidRoot Solutions
L1 - Introduction to python Backend.pptx
System and Network Administraation Chapter 3
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Which alternative to Crystal Reports is best for small or large businesses.pdf
ManageIQ - Sprint 268 Review - Slide Deck
PTS Company Brochure 2025 (1).pdf.......
Design an Analysis of Algorithms II-SECS-1021-03
Introduction to Artificial Intelligence
medical staffing services at VALiNTRY
ISO 45001 Occupational Health and Safety Management System
Wondershare Filmora 15 Crack With Activation Key [2025
Navsoft: AI-Powered Business Solutions & Custom Software Development
ai tools demonstartion for schools and inter college
Upgrade and Innovation Strategies for SAP ERP Customers

Performance Comparison of Binary Machine Learning Classifiers in Identifying Code Comment Types: An Exploratory Study

  • 1. Performance Comparison of Binary Machine Learning Classifiers in Identifying Code Comment Types: An Exploratory Study Amila Indika, Peter Y. Washington, Anthony Peruma
  • 2. Overview As part of the NLBSE tool competition, we compare the performance of 19 binary machine learning classifiers for code comment categories that belong to three different programming languages.
  • 3. Introduction • There has been a significant interest in leveraging Artificial Intelligence (AI) and Machine Learning (ML) techniques to enhance various aspects of software engineering • Traditional software engineering activities often rely on manual and rule-based approaches • which can be time-consuming, error-prone, and limited in handling complex software systems • Prior work has shown the effectiveness in using AI/ML techniques to automate many software engineering activities • Code generation and refactoring recommendations, defect detection, code comprehension and documentation, etc.
  • 4. Code Comment Classification • Code comments are written in natural language and can provide information that may not be immediately obvious from the source code alone. • Grouping code comments into related categories automatically can help developers locate pertinent information faster • Prior work by Rani et al. and the NLBSE competition baseline utilized Random Forest to achieve an optimal classifier • However, there are many other types of classifiers, each with their own set of hyperparameters • Can we further optimize the Random Forest Classifier? • Are there other ML models that are better than Random Forest?
  • 5. Goal, Impact & Contribution A comparison of the performance of different types of binary machine learning classifiers to understand the extent to which they can identify code comment types Help the research community understand the strengths and shortcomings of using such models for this particular task and discover avenues for future research in this area A publicly available set of binary machine learning models for code comment classification
  • 7. Experiment Design • Source Dataset - a dataset of 6,738 comment sentences from 20 open-source projects implemented using Java, Python, and Pharo produced by Rani et al. • Text Preprocessing - transform the comment sentences to a standard and convenient format (i.e. ,text normalization) • Removal whitespaces; Expansion of contractions; Removal of non-alpha characters, single-character words, stopwords, convert to lowercase, stemming, replace digits and empty strings with a token
  • 8. Experiment Design • Filter Category – for each programming language dataset, we filter the comments for each category • Java & Pharo – 7 categories • Python – 5 categories • Train/Test Split - The source dataset contains a column indicating whether a comment sentence belongs to the training or test set, which is used to create the train/test datasets
  • 9. Experiment Design • Feature Extraction – we only utilize the comment sentence text as input to the classification model; the Term Frequency-Inverse Document Frequency approach is used to convert raw comment text into numerical values • Oversampling – we utilized random oversampling to generate new samples for under-represented classes in the training dataset; the test dataset was not oversampled
  • 10. Experiment Design • Model Training & Tuning – we evaluate 8 common machine learning classification algorithms; 10-fold cross-validation to search over hyperparameter values • Naive Bayes (NB): Multinomial NB and Bernoulli NB • Support Vector Machines: Linear Support Vector Classifier • Trees: Decision Tree and Random Forest • Nearest Neighbors: K-Nearest Neighbors • Linear Model: Logistic Regression • Neural Network: Multi-Layer Perceptron • Optimized Model – the model that performs the best when built on the training data using hypermeter tuning
  • 11. Experiment Design • Test Prediction – The best model for each algorithm is used to predict values for test data that the model has not seen before • Model Performance Scoring – The precision, recall, and F1 scores are calculated for each model and category. A total of 190 instances of precision, recall, and F1 scores are calculated for the evaluation.
  • 14. Competition Ranking Scores Model Avg. F1 % Outperformed Categories Ranking Score LogisticRegression 0.5465 1 0.6599 LinearSVC 0.5474 0.9474 0.6474 RandomForestClassifier 0.5366 0.8947 0.6261 DecisionTreeClassifier 0.4931 0.9474 0.6067 MultinomialNB 0.5249 0.8421 0.6042 MLPClassifier 0.5227 0.8421 0.6026 BernoulliNB 0.5225 0.8421 0.6024 KNeighborsClassifier 0.5033 0.8421 0.588 • All 19 Logistic Regression models outperformed the baseline scores • Our Random Forest Classifier outperforms the baseline RF model • This can be due to more exhaustive hyperparameter tuning • Alternate preprocessing activities
  • 15. Summary • We examined the effectiveness of eight machine learning models to classify code comments • Workflow steps include: text preprocessing, oversampling, and hyperparameter tuning using grid search with 10-fold cross validation • All models achieve a higher average F1-Score than the baseline • Logistic Regression was the only classifier that successfully outperformed all baseline classifiers • Linear SVC and Decision Tree outperformed 18 of 19 baseline classifiers