Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is Significantly More Executed (FSE 2024)

Monitoring the Execution of 14K Tests:
Methods Tend to Have One Path That Is
Significantly More Executed
Andre Hora
DCC/UFMG
andrehora@dcc.ufmg.br
1
FSE 2024
Ideas, Visions and Reflections

Motivation & Problem
Having a good test suite is fundamental to ensuring software quality and
sustainable software evolution
Developers should focus on testing both the expected and unexpected behaviors
of the program to catch more bugs and protect against regressions
● Expected behavior: the normal execution, simpler to test
● Unexpected behavior: the abnormal execution, harder to test
2

Having a good test suite is fundamental to ensuring software quality and
sustainable software evolution
Developers should focus on testing both the expected and unexpected behaviors
of the program to catch more bugs and protect against regressions
● Expected behavior: the normal execution, simpler to test
● Unexpected behavior: the abnormal execution, harder to test
3
In practice, it is well-known that developers are more
likely to test expected behaviors than unexpected ones

However, existing research is mostly restricted to controlled experiments, like case
studies with students and developers
- Students are likely to (naively) test the “happy cases” [7]
- Expert developers may test the “sad cases” [25]
We still lack empirical evidence extracted from
real-world software systems and their test suites
4

5
Email Python Standard Library

6
Three possible behaviors at runtime:
1. Entering in both the for and if blocks
2. Entering in the for block and not in the if block
3. Not entering in the for block

7
Three possible behaviors at runtime:
1. Entering in both the for and if blocks
2. Entering in the for block and not in the if block
3. Not entering in the for block
At this point, it is unclear what
behaviors are the most and least
frequently tested by developers
Can you guess?

9
Interesting: the large
discrepancy between the
execution frequency of
different paths
Path 1 concentrates most
of the calls (70.9%)
Path 3 receives only 4.4%

Open Question
Are tested paths of real software likely to concentrate calls or do
calls tend to be more distributed among the tested paths?
Provide insights for developers to improve existing test suites
Support the creation of novel testing tools to better understand test suites
Reveal novel empirical data for researchers to quantify the difference between the
execution frequency of distinct paths in real-world software
10

Proposed Work
We propose an empirical study to assess the tested paths quantitatively
We monitor the execution of 14K tests from 25 real-world Python systems,
assessing 11K tested paths from 2,357 methods
11

Study Design
1. Detecting the tested paths
2. Selecting software systems
3. Research questions
13

Study Design: Detecting the Tested Paths
1. Collecting executed lines of code
We execute an instrumented version of the
test suite that monitors the tests and collect
data from the execution trace
2. Detecting the tested paths
A tested path represents a set of input
values that make the method execute the
same lines of code
3. Ranking the tested paths
For each method with one or more tested
paths, we sort their paths in descending
order of path frequency
14

Study Design: Selecting Software Systems
25 Python systems
2,357 methods
14,177 tests
11,425 tested paths
15

Study Design: Research Questions
RQ1: Frequency of the most tested paths (top 1 vs. top 2)
RQ2: Frequency of the least tested paths (top 1 vs. top 3+)
16

RQ1: Frequency of the Most Tested Paths
18
Top 1 vs. Top 2

19
Top 1 vs. Top 2 Finding 1: Overall, one tested path tends
to receive most of the calls. Top 1 receives
4x more calls than the Top 2.

20
Finding 1: Overall, one tested path tends
Top 1 vs. Top 2
Finding 2: In methods with two tested
paths, one path tends receive close to 5x
more calls than the second one.

21
Finding 2: In methods with two tested
paths, one path tends receive close to 5x
more calls than the second one.
Finding 3: Even methods with four or more
tested paths have one path that receives
the majority of the calls.
Top 1 vs. Top 2 Finding 1: Overall, one tested path tends

RQ2: Frequency of the Least Tested Paths
22
Top 1 vs. Top 3+

23
Top 1 vs. Top 3+

24
Top 1 vs. Top 3+
Finding 4: The top 3+ tested paths receive a
minority of the calls, ranging from 4% to 24%.
Overall, the most tested path of a method has
6.5x more calls than the top 3+.

Summary
We presented an empirical study to assess the tested paths quantitatively
We monitored the execution of over 14K tests and 11K tested paths
Overall, we found that one tested path is prevalent and receives most of the calls,
while others are significantly less executed
Possible applications:
● Provide insights for developers to improve existing test suites
● Support the creation of novel testing tools
● Reveal novel empirical data for researchers
25

Monitoring the Execution of 14K Tests:
Methods Tend to Have One Path That Is
Significantly More Executed
Andre Hora
DCC/UFMG
andrehora@dcc.ufmg.br
26
FSE 2024
Ideas, Visions and Reflections

Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is Significantly More Executed (FSE 2024)

More Related Content

Similar to Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is Significantly More Executed (FSE 2024) (20)

More from Andre Hora (13)

Recently uploaded (20)

Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is Significantly More Executed (FSE 2024)