Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities

Revisiting Test Smells in Automatically
Generated Tests: Limitations, Pitfalls,
and Opportunities
A. Panichella, S. Panichella, G. Fraser,
A. A. Sawant, and V. J. Hellendoorn
1

Related Work
2
[Grano et al., JSS 2019]
JTExpert
Test Case Generation Tools
Test Smell Detection Tool from previous
work [EMSE 2015]
GPD
Grano Palomba Di Nucci

Related Work
3
[Grano et al., JSS 2019]
Main Results
81%
GPD precision in detecting test
smells (100% recall)
The tests [by EvoSuite] are scented since the
beginning as "crossover and mutation operations
[…] do not change the structure of the tests
of the JUnit test suites by
EvoSuite contain test smells
88%
Threats To Validity
Warnings raised by GPD are not manually
validated
EvoSuite was misconﬁgured:

- Old search algorithm

- Tests and Assertions are not minimization
Mutation and crossover alter the test
structure by adding/removing statements
[Arcuri and Fraser, TSE 2012]

Time To Revisit These Results
4

Our Study
• RQ1: How widespread are test smells in
automatically generated tests?
• RQ2: How accurate are automated tools in
detecting code smells in automatically generated
tests?
• RQ3: How well do test smells reflects real
problem in test suites?
5
Manually analysing
generated tests rather then
relying on detection tools
Assessing smell detection
accuracy based on the
manual oracle

Manual Analysis
6
100 Java
classes from
SF110
The same
classes used by
Grano at al.
100
Generated
Test Suites
Validator 2 Validator 3 Validator 4Validator 1
Validator 3 Validator 4 Validator 2Validator 1
Cross-
validated
Oracle

RQ1: Distributions of Test Smells
7
Eager Test
Assertion Roulette
Indirect Testing
Sensitive Equality
Mystery Guest
Resource Optimism
% Smelly Test Suites
0 25 50 75 100
Our results based on a
manually validated dataset
Results by Grano et al.
(based on automated tools
warning)

RQ2: Accuracy of Smell Detection Tools
8
Large False Positive Rate for Assertion
Roulette and Eager Tests
TABLE IV: Detection performance of different automated test smell detection tools for test cases generated by EVOSUITE.
FPR denotes the False Positive Rate and FNR is the False Negative Rate. The best values are highlighted in grey colour.
Test smell
Tool used by Grano et al. [6] TSDETECT calibrated by Spadini et al. [2]
FPR FNR Precision Recall F-measure FPR FNR Precision Recall F-measure
Assertion Roulette 0.72 0.00 0.22 1.00 0.36 0.05 0.50 0.67 0.5 0.57
Eager Test 0.53 0.05 0.33 0.95 0.49 0.05 0.45 0.73 0.55 0.63
Mystery Guest 0.12 — — — — 0.03 — — — —
Sensitive Equality 0.00 0.67 1.00 0.33 0.50 0.00 0.67 1.00 0.33 0.50
Resource Optimism 0.02 — — — — 0.02 — — — —
Indirect Testing 0.00 1.00 — 0.00 — — — — — —
@Test(timeout = 4000)
public void test07() throws Throwable {
ScriptOrFnScope s0 = new ScriptOrFnScope((-806),
(ScriptOrFnScope) null);
ScriptOrFnScope s1 = new ScriptOrFnScope((-330), s0);
s1.preventMunging();
s1.munge();
assertNotSame(s0, s1);
}
Fig. 2: Example of false positive for the tool used by Grano
et al. for Eager Test
Show show0 = new Show();
File file0 = MockFile.createTempFile("...");
Mystery Guest and Resource Optimism. For these two
types of smells, both detection tools raise several warnings.
However, they are all false positives by definition, as our gold
standard does not contain any instances of such smells. The
detection tools both annotate test methods that contain specific
strings or objects, such as: “File”, “FileOutputStream”
“DB”, “HttpClient” as smelly; however, EVOSUITE sep-
arates the test code from environmental dependencies (e.g.,
external files) in a fully automated fashion through byte-
code instrumentation [43]. In particular, it uses two mech-
anisms: (1) mocking, and (2) customized test runners. For
one, classes that access the filesystem (e.g., java.io.File)
(GPD)
Large False Negative Rate for Sensitive
Equality and Indirect Testing
GPD
Low False Positive Rate
Large False Negative Rate for most of
the test smells
TsDetector

Limitations of Test Smell Detection Tools
9
According to GPD warnings
12% of the JUnit test suites by EvoSuite
contain Mystery Guest
2% of the JUnit test suites by EvoSuite
contain Resource Optimism
EvoSuite does not use external
resources or files thanks to:
• Sandbox and scaffolding
• Automated mocks generation
• The use a customized JUnit runner
FALSE
POSITIVES

Limitations of Test Smell Detection Tools
10
GPD and TsDetector fail to detect instances of Sensitive Equality
SubstringLabeler.Match substringLabeler_Match0 = new SubstringLabeler.Match();
String string0 = substringLabeler_Match0.toString();
assertEquals("Substring: [Atts: ]", string0);
}
SubstringLabeler.Match substringLabeler_Match0 = new SubstringLabeler.Match();
assertEquals("Substring: [Atts: ]", substringLabeler_Match0.toString());
}
Test generated
by EvoSuite but
not detected by
the two tools
This test would
be detected

Discussion
• In the paper we further discuss the limitations of test smell detection
tools (GDP and TsDetector) with more examples
• Our results disagree with the conclusions by Grano et al.

• Only 80% 32% of generated tests contain test smells

• Researchers should avoid self-assessing their test smell detection tools

• The involvement of human participants (preferably in industrial contexts)
is critical for improving the accuracy of detection tools
11

Revisiting Test Smells in Automatically
Generated Tests: Limitations, Pitfalls,
and Opportunities
A. Panichella, S. Panichella, G. Fraser,
A. A. Sawant, and V. J. Hellendoorn
12

Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities

More Related Content

Similar to Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities (7)

More from Sebastiano Panichella (20)

Recently uploaded (20)

Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities