SlideShare a Scribd company logo
Predicting Test Results
without Execution
Andre Hora
DCC/UFMG
andrehora@dcc.ufmg.br
1
FSE 2024
Ideas, Visions and Reflections
Motivation & Problem
Software testing is a key practice in modern software development
Developers rely on tests for multiple reasons: avoid regressions, provide fast
feedback, ensure sustainable software evolution, etc.
2
Motivation & Problem
Software testing is a key practice in modern software development
Developers rely on tests for multiple reasons: avoid regressions, provide fast
feedback, ensure sustainable software evolution, etc.
3
Over time, as software systems grow, test suites may
become complex, making it challenging to run the
tests frequently (and locally)
4
CPython testing documentation
“There could be platform-specific code that simply
will not execute for you, errors in the output, etc”
Ray testing documentation
“The full suite of tests is too large to run on a
single machine”
pip
pip testing documentation
“Running pip’s entire test suite requires supported
version control tools to be installed”
It would be important to have the possibility to predict test
results without actually executing test suites,
bypassing any challenge that may exist during test run
5
Large Language Models for Software Engineering
Large Language Models (LLMs) have been adopted in multiple software
engineering tasks [4, 6, 11, 13, 16], mainly related to code generation
However, it is not clear whether LLMs understand code execution
● A recent study performed by Microsoft evaluated the capability of LLMs in
understanding code execution by exploring code coverage prediction tasks
● GPT-4 achieved the highest performance: ~24% in the best-tested scenario
6
7
So far, it is unclear whether LLMs can be used to
predict test results, and, potentially, overcome the issues
of running real-world tests
8
Proposed Work
To shed some light on this problem, we explore the capability of LLMs to predict
test results without execution
We evaluate the performance of GPT-4 in predicting the execution of 200 test
cases of the Python Standard Library
9
Study Design
10
Study Design
1. Selecting Test Cases
2. Creating Prompts and Assessing Answers
3. Evaluation: Precision, Recall, and Accuracy
4. Research Questions:
a. RQ1: All test cases
b. RQ2: Test case complexity
c. RQ3: Test suite
11
Study Design: Selecting Test Cases
1. Five Python Standard Library (ast, calendar, csv, gzip, and string)
2. Two tests per library (total of 10 unique tests)
3. For each test, 20 manually modified tests (10 passing and 10 failing tests)
4. Total of 200 tests (100 passing tests + 100 failing tests)
12
13
Original test
Passing test version
Failing test version
14
Original test
Passing test version
Failing test version
valid input "foo{0}{0}-{1}" and
valid output "foobarbar-6"
15
Original test
Passing test version
Failing test version
invalid output " foo ", i.e.,
with extra blank spaces
valid input "foo{0}{0}-{1}" and
valid output "foobarbar-6"
Study Design: Creating Prompts and Assessing Answers
1. GPT-4: model with the best results in code coverage prediction [16]
2. Create a prompt for each test case and submit them to GPT-4
3. Read the prompt answers to assess the test result prediction
16
Study Design: Evaluation: Precision, Recall, and Accuracy
We evaluate the performance of test result prediction tasks by computing
precision, recall, and accuracy
● True Positive (TP): correctly predicted failing test case
● False Positive (FP): incorrectly predicted failing test case (wrong alert)
● True Negative (TN): correctly predicted passing test case
● False Negative (FN): incorrectly predicted passing test case (missing alert)
17
Study Design: Research Questions
● What is the performance of GPT-4 to predict test results?
● RQ1: All test cases (200 tests)
● RQ2: Test case complexity (100 simple vs. 100 complex)
● RQ3: Test suite (ast, calendar, csv, gzip, and string)
18
Results
19
What is the performance of GPT-4 to predict test results?
20
What is the performance of GPT-4 to predict test results?
21
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).
What is the performance of GPT-4 to predict test results?
22
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).
RQ2: GPT-4 presented better precision and recall
when predicting simpler tests than complex ones.
Results are still far from 100% even for simple tests.
What is the performance of GPT-4 to predict test results?
23
RQ1: GPT-4 has a precision of 88.8% and recall of
71% in the test result prediction. FN (missing alert)
are more problematic than FP (wrong alerts).
RQ2: GPT-4 presented better precision and recall
when predicting simpler tests than complex ones.
Results are still far from 100% even for simple tests.
RQ3: GPT-4 presented differences among the
analyzed test suites, with the precision ranging from
77.8% to 94.7% and recall between 60% and 90%.
Discussion and Observations
● Correct analysis but incorrect conclusions
○ Correct explanation for a passing or failing test, however, the final verdict was incorrect
● Reliance on comments rather than on test code
○ Comments may be wrong or outdated
● Explanations based “general knowledge” (rather than on code)
○ In some cases, GPT-4 provided explanations based on “general knowledge” to complement
the rationales instead of solely based on source code
24
Summary
We evaluate the performance of GPT-4 in predicting the execution of 200 test
cases of the Python Standard Library
RQ1: GPT-4 presented a low precision and recall in the test result prediction. FN
(missing alert) are more problematic than FP (wrong alerts)
RQ2: GPT-4 presented better precision and recall when predicting simpler tests
than complex ones. However, results are still far from 100%, even for simple tests
RQ3: GPT-4 presented large differences of precision and recall among the
analyzed test suites
25
Predicting Test Results
without Execution
Andre Hora
DCC/UFMG
andrehora@dcc.ufmg.br
26
FSE 2024
Ideas, Visions and Reflections

More Related Content

PPTX
How Machine learning Integration supports testing automation in software
PPTX
Continuous test suite failure prediction
PPTX
Machine learning testing survey, landscapes and horizons, the Cliff Notes
PDF
Test for AI model
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
PPTX
Lecture06_Version Space Algorithm Part2.pptx
PPTX
2018-Sogeti-TestExpo-Intelligent_Predictive_Models.pptx
PDF
Testing Machine Learning-enabled Systems: A Personal Perspective
How Machine learning Integration supports testing automation in software
Continuous test suite failure prediction
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Test for AI model
Automated Testing and Safety Analysis of Deep Neural Networks
Lecture06_Version Space Algorithm Part2.pptx
2018-Sogeti-TestExpo-Intelligent_Predictive_Models.pptx
Testing Machine Learning-enabled Systems: A Personal Perspective

Similar to Predicting Test Results without Execution (FSE 2024) (20)

PPTX
Lecture3-eval.pptx
PPTX
Lecture 3 for the AI course in A university
PPTX
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
PDF
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
PDF
Keynote presentation at DeepTest Workshop 2025
PDF
Testing and Deployment - Full Stack Deep Learning
PPTX
Workshop: Unit Testing in Python
PDF
Software Testing:
 A Research Travelogue 
(2000–2014)
PPTX
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
PPTX
A comparison of apache spark supervised machine learning algorithms for dna s...
PPTX
Enhancing Software Testing using Machine Learning Techniques
PPTX
Semantic-Aware Code Model: Elevating the Future of Software Development
PDF
Revisiting the Notion of Diversity in Software Testing
ODT
Testing in-python-and-pytest-framework
PPTX
Metamorphic Testing Thesis Defense.pptx
PDF
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
PPTX
Significance of AI in Testing
PDF
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
PDF
DSR Testing (Part 1)
PDF
Towards a Better Understanding of the Impact of Experimental Components on De...
Lecture3-eval.pptx
Lecture 3 for the AI course in A university
C++ Corehard Autumn 2018. Обучаем на Python, применяем на C++ - Павел Филонов
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
Keynote presentation at DeepTest Workshop 2025
Testing and Deployment - Full Stack Deep Learning
Workshop: Unit Testing in Python
Software Testing:
 A Research Travelogue 
(2000–2014)
#Interactive Session by Vivek Patle and Jahnavi Umarji, "Empowering Functiona...
A comparison of apache spark supervised machine learning algorithms for dna s...
Enhancing Software Testing using Machine Learning Techniques
Semantic-Aware Code Model: Elevating the Future of Software Development
Revisiting the Notion of Diversity in Software Testing
Testing in-python-and-pytest-framework
Metamorphic Testing Thesis Defense.pptx
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Significance of AI in Testing
Master tester AI toolbox - Kari Kakkonen at Testaus ja AI 2025 Professio
DSR Testing (Part 1)
Towards a Better Understanding of the Impact of Experimental Components on De...
Ad

More from Andre Hora (15)

PDF
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
PDF
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
PDF
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
PDF
When should internal interfaces be promoted to public? (FSE 2016)
PDF
Assessing the Threat of Untracked Changes in Software Evolution (ICSE 2018)
PDF
JavaScript API Deprecation in the Wild: A First Assessment (SANER 2020)
PDF
Assessing Mock Classes: An Empirical Study (ICSME 2020)
PDF
What Code Is Deliberately Excluded from Test Coverage and Why? (MSR 2021)
PDF
Googling for Software Development: What Developers Search For and What They F...
PDF
Availability and Usage of Platform-Specific APIs: A First Empirical Study (MS...
PDF
How and Why Developers Migrate Python Tests (SANER 2022)
PDF
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
PDF
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
PDF
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PDF
SpotFlow: Tracking Method Calls and States at Runtime (ICSE 2024)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
When should internal interfaces be promoted to public? (FSE 2016)
Assessing the Threat of Untracked Changes in Software Evolution (ICSE 2018)
JavaScript API Deprecation in the Wild: A First Assessment (SANER 2020)
Assessing Mock Classes: An Empirical Study (ICSME 2020)
What Code Is Deliberately Excluded from Test Coverage and Why? (MSR 2021)
Googling for Software Development: What Developers Search For and What They F...
Availability and Usage of Platform-Specific APIs: A First Empirical Study (MS...
How and Why Developers Migrate Python Tests (SANER 2022)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
SpotFlow: Tracking Method Calls and States at Runtime (ICSE 2024)
Ad

Recently uploaded (20)

PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
history of c programming in notes for students .pptx
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
AI in Product Development-omnex systems
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Digital Strategies for Manufacturing Companies
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Transform Your Business with a Software ERP System
PDF
medical staffing services at VALiNTRY
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Odoo POS Development Services by CandidRoot Solutions
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Reimagine Home Health with the Power of Agentic AI​
CHAPTER 2 - PM Management and IT Context
history of c programming in notes for students .pptx
wealthsignaloriginal-com-DS-text-... (1).pdf
How Creative Agencies Leverage Project Management Software.pdf
Understanding Forklifts - TECH EHS Solution
AI in Product Development-omnex systems
Navsoft: AI-Powered Business Solutions & Custom Software Development
Digital Strategies for Manufacturing Companies
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
How to Migrate SBCGlobal Email to Yahoo Easily
Transform Your Business with a Software ERP System
medical staffing services at VALiNTRY
VVF-Customer-Presentation2025-Ver1.9.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus

Predicting Test Results without Execution (FSE 2024)

  • 1. Predicting Test Results without Execution Andre Hora DCC/UFMG andrehora@dcc.ufmg.br 1 FSE 2024 Ideas, Visions and Reflections
  • 2. Motivation & Problem Software testing is a key practice in modern software development Developers rely on tests for multiple reasons: avoid regressions, provide fast feedback, ensure sustainable software evolution, etc. 2
  • 3. Motivation & Problem Software testing is a key practice in modern software development Developers rely on tests for multiple reasons: avoid regressions, provide fast feedback, ensure sustainable software evolution, etc. 3 Over time, as software systems grow, test suites may become complex, making it challenging to run the tests frequently (and locally)
  • 4. 4 CPython testing documentation “There could be platform-specific code that simply will not execute for you, errors in the output, etc” Ray testing documentation “The full suite of tests is too large to run on a single machine” pip pip testing documentation “Running pip’s entire test suite requires supported version control tools to be installed”
  • 5. It would be important to have the possibility to predict test results without actually executing test suites, bypassing any challenge that may exist during test run 5
  • 6. Large Language Models for Software Engineering Large Language Models (LLMs) have been adopted in multiple software engineering tasks [4, 6, 11, 13, 16], mainly related to code generation However, it is not clear whether LLMs understand code execution ● A recent study performed by Microsoft evaluated the capability of LLMs in understanding code execution by exploring code coverage prediction tasks ● GPT-4 achieved the highest performance: ~24% in the best-tested scenario 6
  • 7. 7
  • 8. So far, it is unclear whether LLMs can be used to predict test results, and, potentially, overcome the issues of running real-world tests 8
  • 9. Proposed Work To shed some light on this problem, we explore the capability of LLMs to predict test results without execution We evaluate the performance of GPT-4 in predicting the execution of 200 test cases of the Python Standard Library 9
  • 11. Study Design 1. Selecting Test Cases 2. Creating Prompts and Assessing Answers 3. Evaluation: Precision, Recall, and Accuracy 4. Research Questions: a. RQ1: All test cases b. RQ2: Test case complexity c. RQ3: Test suite 11
  • 12. Study Design: Selecting Test Cases 1. Five Python Standard Library (ast, calendar, csv, gzip, and string) 2. Two tests per library (total of 10 unique tests) 3. For each test, 20 manually modified tests (10 passing and 10 failing tests) 4. Total of 200 tests (100 passing tests + 100 failing tests) 12
  • 13. 13 Original test Passing test version Failing test version
  • 14. 14 Original test Passing test version Failing test version valid input "foo{0}{0}-{1}" and valid output "foobarbar-6"
  • 15. 15 Original test Passing test version Failing test version invalid output " foo ", i.e., with extra blank spaces valid input "foo{0}{0}-{1}" and valid output "foobarbar-6"
  • 16. Study Design: Creating Prompts and Assessing Answers 1. GPT-4: model with the best results in code coverage prediction [16] 2. Create a prompt for each test case and submit them to GPT-4 3. Read the prompt answers to assess the test result prediction 16
  • 17. Study Design: Evaluation: Precision, Recall, and Accuracy We evaluate the performance of test result prediction tasks by computing precision, recall, and accuracy ● True Positive (TP): correctly predicted failing test case ● False Positive (FP): incorrectly predicted failing test case (wrong alert) ● True Negative (TN): correctly predicted passing test case ● False Negative (FN): incorrectly predicted passing test case (missing alert) 17
  • 18. Study Design: Research Questions ● What is the performance of GPT-4 to predict test results? ● RQ1: All test cases (200 tests) ● RQ2: Test case complexity (100 simple vs. 100 complex) ● RQ3: Test suite (ast, calendar, csv, gzip, and string) 18
  • 20. What is the performance of GPT-4 to predict test results? 20
  • 21. What is the performance of GPT-4 to predict test results? 21 RQ1: GPT-4 has a precision of 88.8% and recall of 71% in the test result prediction. FN (missing alert) are more problematic than FP (wrong alerts).
  • 22. What is the performance of GPT-4 to predict test results? 22 RQ1: GPT-4 has a precision of 88.8% and recall of 71% in the test result prediction. FN (missing alert) are more problematic than FP (wrong alerts). RQ2: GPT-4 presented better precision and recall when predicting simpler tests than complex ones. Results are still far from 100% even for simple tests.
  • 23. What is the performance of GPT-4 to predict test results? 23 RQ1: GPT-4 has a precision of 88.8% and recall of 71% in the test result prediction. FN (missing alert) are more problematic than FP (wrong alerts). RQ2: GPT-4 presented better precision and recall when predicting simpler tests than complex ones. Results are still far from 100% even for simple tests. RQ3: GPT-4 presented differences among the analyzed test suites, with the precision ranging from 77.8% to 94.7% and recall between 60% and 90%.
  • 24. Discussion and Observations ● Correct analysis but incorrect conclusions ○ Correct explanation for a passing or failing test, however, the final verdict was incorrect ● Reliance on comments rather than on test code ○ Comments may be wrong or outdated ● Explanations based “general knowledge” (rather than on code) ○ In some cases, GPT-4 provided explanations based on “general knowledge” to complement the rationales instead of solely based on source code 24
  • 25. Summary We evaluate the performance of GPT-4 in predicting the execution of 200 test cases of the Python Standard Library RQ1: GPT-4 presented a low precision and recall in the test result prediction. FN (missing alert) are more problematic than FP (wrong alerts) RQ2: GPT-4 presented better precision and recall when predicting simpler tests than complex ones. However, results are still far from 100%, even for simple tests RQ3: GPT-4 presented large differences of precision and recall among the analyzed test suites 25
  • 26. Predicting Test Results without Execution Andre Hora DCC/UFMG andrehora@dcc.ufmg.br 26 FSE 2024 Ideas, Visions and Reflections