SlideShare a Scribd company logo
Revisiting Test Smells in Automatically
Generated Tests: Limitations, Pitfalls,
and Opportunities
A. Panichella, S. Panichella, G. Fraser,
A. A. Sawant, and V. J. Hellendoorn
1
Related Work
2
[Grano et al., JSS 2019]
JTExpert
Test Case Generation Tools
Test Smell Detection Tool from previous
work [EMSE 2015]
GPD
Grano Palomba Di Nucci
Related Work
3
[Grano et al., JSS 2019]
Main Results
81%
GPD precision in detecting test
smells (100% recall)
The tests [by EvoSuite] are scented since the
beginning as "crossover and mutation operations
[…] do not change the structure of the tests
of the JUnit test suites by
EvoSuite contain test smells
88%
Threats To Validity
Warnings raised by GPD are not manually
validated
EvoSuite was misconfigured:

- Old search algorithm

- Tests and Assertions are not minimization
Mutation and crossover alter the test
structure by adding/removing statements
[Arcuri and Fraser, TSE 2012]
Time To Revisit These Results
4
Our Study
• RQ1: How widespread are test smells in
automatically generated tests?
• RQ2: How accurate are automated tools in
detecting code smells in automatically generated
tests?
• RQ3: How well do test smells reflects real
problem in test suites?
5
Manually analysing
generated tests rather then
relying on detection tools
Assessing smell detection
accuracy based on the
manual oracle
Manual Analysis
6
100 Java
classes from
SF110
The same
classes used by
Grano at al.
100
Generated
Test Suites
Validator 2 Validator 3 Validator 4Validator 1
Validator 3 Validator 4 Validator 2Validator 1
Cross-
validated
Oracle
RQ1: Distributions of Test Smells
7
Eager Test
Assertion Roulette
Indirect Testing
Sensitive Equality
Mystery Guest
Resource Optimism
% Smelly Test Suites
0 25 50 75 100
Our results based on a
manually validated dataset
Results by Grano et al.
(based on automated tools
warning)
RQ2: Accuracy of Smell Detection Tools
8
Large False Positive Rate for Assertion
Roulette and Eager Tests
TABLE IV: Detection performance of different automated test smell detection tools for test cases generated by EVOSUITE.
FPR denotes the False Positive Rate and FNR is the False Negative Rate. The best values are highlighted in grey colour.
Test smell
Tool used by Grano et al. [6] TSDETECT calibrated by Spadini et al. [2]
FPR FNR Precision Recall F-measure FPR FNR Precision Recall F-measure
Assertion Roulette 0.72 0.00 0.22 1.00 0.36 0.05 0.50 0.67 0.5 0.57
Eager Test 0.53 0.05 0.33 0.95 0.49 0.05 0.45 0.73 0.55 0.63
Mystery Guest 0.12 — — — — 0.03 — — — —
Sensitive Equality 0.00 0.67 1.00 0.33 0.50 0.00 0.67 1.00 0.33 0.50
Resource Optimism 0.02 — — — — 0.02 — — — —
Indirect Testing 0.00 1.00 — 0.00 — — — — — —
@Test(timeout = 4000)
public void test07() throws Throwable {
ScriptOrFnScope s0 = new ScriptOrFnScope((-806),
(ScriptOrFnScope) null);
ScriptOrFnScope s1 = new ScriptOrFnScope((-330), s0);
s1.preventMunging();
s1.munge();
assertNotSame(s0, s1);
}
Fig. 2: Example of false positive for the tool used by Grano
et al. for Eager Test
@Test(timeout = 4000)
public void test00() throws Throwable {
Show show0 = new Show();
File file0 = MockFile.createTempFile("...");
Mystery Guest and Resource Optimism. For these two
types of smells, both detection tools raise several warnings.
However, they are all false positives by definition, as our gold
standard does not contain any instances of such smells. The
detection tools both annotate test methods that contain specific
strings or objects, such as: “File”, “FileOutputStream”
“DB”, “HttpClient” as smelly; however, EVOSUITE sep-
arates the test code from environmental dependencies (e.g.,
external files) in a fully automated fashion through byte-
code instrumentation [43]. In particular, it uses two mech-
anisms: (1) mocking, and (2) customized test runners. For
one, classes that access the filesystem (e.g., java.io.File)
(GPD)
Large False Negative Rate for Sensitive
Equality and Indirect Testing
GPD
Low False Positive Rate
Large False Negative Rate for most of
the test smells
TsDetector
Limitations of Test Smell Detection Tools
9
According to GPD warnings
12% of the JUnit test suites by EvoSuite
contain Mystery Guest
2% of the JUnit test suites by EvoSuite
contain Resource Optimism
EvoSuite does not use external
resources or files thanks to:
• Sandbox and scaffolding
• Automated mocks generation
• The use a customized JUnit runner
FALSE
POSITIVES
Limitations of Test Smell Detection Tools
10
GPD and TsDetector fail to detect instances of Sensitive Equality
@Test(timeout = 4000)
public void test62() throws Throwable {
SubstringLabeler.Match substringLabeler_Match0 = new SubstringLabeler.Match();
String string0 = substringLabeler_Match0.toString();
assertEquals("Substring: [Atts: ]", string0);
}
public void test62() throws Throwable {
SubstringLabeler.Match substringLabeler_Match0 = new SubstringLabeler.Match();
assertEquals("Substring: [Atts: ]", substringLabeler_Match0.toString());
}
Test generated
by EvoSuite but
not detected by
the two tools
This test would
be detected
Discussion
• In the paper we further discuss the limitations of test smell detection
tools (GDP and TsDetector) with more examples
• Our results disagree with the conclusions by Grano et al. 

• Only 80% 32% of generated tests contain test smells

• Researchers should avoid self-assessing their test smell detection tools

• The involvement of human participants (preferably in industrial contexts)
is critical for improving the accuracy of detection tools
11
Revisiting Test Smells in Automatically
Generated Tests: Limitations, Pitfalls,
and Opportunities
A. Panichella, S. Panichella, G. Fraser,
A. A. Sawant, and V. J. Hellendoorn
12

More Related Content

PPT
Software testing basics and its types
PDF
Noninvesive Lie Detection
PDF
On the Diffusion of Test Smells in Automatically Generated Test Code: An Empi...
PDF
On the Distribution of Test Smells in Open Source Android Applications: An Ex...
PDF
Test Anti-Patterns: From Definition to Detection
PDF
Strategies to Avoid Test Fixture Smells durin Software Evolution
PDF
On The Relation of Test Smells to Software Code Quality
PPTX
Test Smells Learning by a Gamification Approach
Software testing basics and its types
Noninvesive Lie Detection
On the Diffusion of Test Smells in Automatically Generated Test Code: An Empi...
On the Distribution of Test Smells in Open Source Android Applications: An Ex...
Test Anti-Patterns: From Definition to Detection
Strategies to Avoid Test Fixture Smells durin Software Evolution
On The Relation of Test Smells to Software Code Quality
Test Smells Learning by a Gamification Approach

Similar to Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities (7)

PDF
Practices and Tools for Better Software Testing
PDF
UI Testing
PDF
SBST 2015 - 3rd Tool Competition for Java Junit test Tools
PDF
Finding Bad Code Smells with Neural Network Models
PDF
FUNCTIONAL OVER-RELATED CLASSES BAD SMELL DETECTION AND REFACTORING SUGGESTIONS
DOCX
Synopsis minor project
PDF
Synopsis ( Code Smells)
Practices and Tools for Better Software Testing
UI Testing
SBST 2015 - 3rd Tool Competition for Java Junit test Tools
Finding Bad Code Smells with Neural Network Models
FUNCTIONAL OVER-RELATED CLASSES BAD SMELL DETECTION AND REFACTORING SUGGESTIONS
Synopsis minor project
Synopsis ( Code Smells)
Ad

More from Sebastiano Panichella (20)

PDF
ICST/SBFT Tool Competition 2025 - UAV Testing Track
PDF
NL-based Software Engineering (NLBSE) '25
PDF
ICST Tool Competition 2025 Self-driving Car Testing Track
PDF
ICST Awards: 18th IEEE International Conference on Software Testing, Verifica...
PDF
ICST Panel: 18th IEEE International Conference on Software Testing, Verificat...
PDF
ICST Closing: 18th IEEE International Conference on Software Testing, Verific...
PDF
ICST Opening: 18th IEEE International Conference on Software Testing, Verific...
PDF
ICST/SBFT Tool Competition 2025 UAV Testing Track
PDF
Announcement of 18th IEEE International Conference on Software Testing, Verif...
PDF
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
PDF
International Workshop on Artificial Intelligence in Software Testing
PDF
The 3rd Intl. Workshop on NL-based Software Engineering
PDF
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
PDF
SBFT Tool Competition 2024 -- Python Test Case Generation Track
PDF
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
PDF
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
PDF
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
PDF
COSMOS: DevOps for Complex Cyber-physical Systems
PDF
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
PDF
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
ICST/SBFT Tool Competition 2025 - UAV Testing Track
NL-based Software Engineering (NLBSE) '25
ICST Tool Competition 2025 Self-driving Car Testing Track
ICST Awards: 18th IEEE International Conference on Software Testing, Verifica...
ICST Panel: 18th IEEE International Conference on Software Testing, Verificat...
ICST Closing: 18th IEEE International Conference on Software Testing, Verific...
ICST Opening: 18th IEEE International Conference on Software Testing, Verific...
ICST/SBFT Tool Competition 2025 UAV Testing Track
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
International Workshop on Artificial Intelligence in Software Testing
The 3rd Intl. Workshop on NL-based Software Engineering
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
COSMOS: DevOps for Complex Cyber-physical Systems
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
Ad

Recently uploaded (20)

PPTX
Emphasizing It's Not The End 08 06 2025.pptx
PPTX
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
PPTX
Human Mind & its character Characteristics
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
PPTX
water for all cao bang - a charity project
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
PPTX
Anesthesia and it's stage with mnemonic and images
PDF
COLEAD A2F approach and Theory of Change
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
chapter8-180915055454bycuufucdghrwtrt.pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
PPTX
Sustainable Forest Management ..SFM.pptx
PPTX
fundraisepro pitch deck elegant and modern
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
Emphasizing It's Not The End 08 06 2025.pptx
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
Human Mind & its character Characteristics
Tablets And Capsule Preformulation Of Paracetamol
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
Impressionism_PostImpressionism_Presentation.pptx
ART-APP-REPORT-FINctrwxsg f fuy L-na.pptx
water for all cao bang - a charity project
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
Anesthesia and it's stage with mnemonic and images
COLEAD A2F approach and Theory of Change
Presentation1 [Autosaved].pdf diagnosiss
chapter8-180915055454bycuufucdghrwtrt.pptx
Instagram's Product Secrets Unveiled with this PPT
Intro to ISO 9001 2015.pptx wareness raising
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
Sustainable Forest Management ..SFM.pptx
fundraisepro pitch deck elegant and modern
2025-08-10 Joseph 02 (shared slides).pptx

Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities

  • 1. Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and V. J. Hellendoorn 1
  • 2. Related Work 2 [Grano et al., JSS 2019] JTExpert Test Case Generation Tools Test Smell Detection Tool from previous work [EMSE 2015] GPD Grano Palomba Di Nucci
  • 3. Related Work 3 [Grano et al., JSS 2019] Main Results 81% GPD precision in detecting test smells (100% recall) The tests [by EvoSuite] are scented since the beginning as "crossover and mutation operations […] do not change the structure of the tests of the JUnit test suites by EvoSuite contain test smells 88% Threats To Validity Warnings raised by GPD are not manually validated EvoSuite was misconfigured: - Old search algorithm - Tests and Assertions are not minimization Mutation and crossover alter the test structure by adding/removing statements [Arcuri and Fraser, TSE 2012]
  • 4. Time To Revisit These Results 4
  • 5. Our Study • RQ1: How widespread are test smells in automatically generated tests? • RQ2: How accurate are automated tools in detecting code smells in automatically generated tests? • RQ3: How well do test smells reflects real problem in test suites? 5 Manually analysing generated tests rather then relying on detection tools Assessing smell detection accuracy based on the manual oracle
  • 6. Manual Analysis 6 100 Java classes from SF110 The same classes used by Grano at al. 100 Generated Test Suites Validator 2 Validator 3 Validator 4Validator 1 Validator 3 Validator 4 Validator 2Validator 1 Cross- validated Oracle
  • 7. RQ1: Distributions of Test Smells 7 Eager Test Assertion Roulette Indirect Testing Sensitive Equality Mystery Guest Resource Optimism % Smelly Test Suites 0 25 50 75 100 Our results based on a manually validated dataset Results by Grano et al. (based on automated tools warning)
  • 8. RQ2: Accuracy of Smell Detection Tools 8 Large False Positive Rate for Assertion Roulette and Eager Tests TABLE IV: Detection performance of different automated test smell detection tools for test cases generated by EVOSUITE. FPR denotes the False Positive Rate and FNR is the False Negative Rate. The best values are highlighted in grey colour. Test smell Tool used by Grano et al. [6] TSDETECT calibrated by Spadini et al. [2] FPR FNR Precision Recall F-measure FPR FNR Precision Recall F-measure Assertion Roulette 0.72 0.00 0.22 1.00 0.36 0.05 0.50 0.67 0.5 0.57 Eager Test 0.53 0.05 0.33 0.95 0.49 0.05 0.45 0.73 0.55 0.63 Mystery Guest 0.12 — — — — 0.03 — — — — Sensitive Equality 0.00 0.67 1.00 0.33 0.50 0.00 0.67 1.00 0.33 0.50 Resource Optimism 0.02 — — — — 0.02 — — — — Indirect Testing 0.00 1.00 — 0.00 — — — — — — @Test(timeout = 4000) public void test07() throws Throwable { ScriptOrFnScope s0 = new ScriptOrFnScope((-806), (ScriptOrFnScope) null); ScriptOrFnScope s1 = new ScriptOrFnScope((-330), s0); s1.preventMunging(); s1.munge(); assertNotSame(s0, s1); } Fig. 2: Example of false positive for the tool used by Grano et al. for Eager Test @Test(timeout = 4000) public void test00() throws Throwable { Show show0 = new Show(); File file0 = MockFile.createTempFile("..."); Mystery Guest and Resource Optimism. For these two types of smells, both detection tools raise several warnings. However, they are all false positives by definition, as our gold standard does not contain any instances of such smells. The detection tools both annotate test methods that contain specific strings or objects, such as: “File”, “FileOutputStream” “DB”, “HttpClient” as smelly; however, EVOSUITE sep- arates the test code from environmental dependencies (e.g., external files) in a fully automated fashion through byte- code instrumentation [43]. In particular, it uses two mech- anisms: (1) mocking, and (2) customized test runners. For one, classes that access the filesystem (e.g., java.io.File) (GPD) Large False Negative Rate for Sensitive Equality and Indirect Testing GPD Low False Positive Rate Large False Negative Rate for most of the test smells TsDetector
  • 9. Limitations of Test Smell Detection Tools 9 According to GPD warnings 12% of the JUnit test suites by EvoSuite contain Mystery Guest 2% of the JUnit test suites by EvoSuite contain Resource Optimism EvoSuite does not use external resources or files thanks to: • Sandbox and scaffolding • Automated mocks generation • The use a customized JUnit runner FALSE POSITIVES
  • 10. Limitations of Test Smell Detection Tools 10 GPD and TsDetector fail to detect instances of Sensitive Equality @Test(timeout = 4000) public void test62() throws Throwable { SubstringLabeler.Match substringLabeler_Match0 = new SubstringLabeler.Match(); String string0 = substringLabeler_Match0.toString(); assertEquals("Substring: [Atts: ]", string0); } public void test62() throws Throwable { SubstringLabeler.Match substringLabeler_Match0 = new SubstringLabeler.Match(); assertEquals("Substring: [Atts: ]", substringLabeler_Match0.toString()); } Test generated by EvoSuite but not detected by the two tools This test would be detected
  • 11. Discussion • In the paper we further discuss the limitations of test smell detection tools (GDP and TsDetector) with more examples • Our results disagree with the conclusions by Grano et al. • Only 80% 32% of generated tests contain test smells • Researchers should avoid self-assessing their test smell detection tools • The involvement of human participants (preferably in industrial contexts) is critical for improving the accuracy of detection tools 11
  • 12. Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and V. J. Hellendoorn 12