Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology

by Federico Lochbaum
EVREF Team
EvaluatingBenchmark
Quality:aMutation-Testing-
BasedMethodology
IWST2025

FedericoLochbaum,GuillermoPolito
IWST 2025
1/33

Test Cases
IWST 2025
Let’s suppose we have a broken house....
Hi! I’m a house!
2 / 33

Test Cases
IWST 2025
And we want to repair it...
Hi! I am the
repair program!
I am a Test
case
3 / 33

IWST 2025
Howwellwedidit?
4/33

How well we did it?
IWST 2025
Passed!
5 / 33

Benchmarks
IWST 2025
Measure execution time ( wall-clock time
Often made by averaging benchmarking results, looking to reduce contextual varianc
Some work proposes using other kinds of metrics, like energy consumption or memory usage

6 / 33

Benchmarks
IWST 2025
I am the
benchmark
7 / 33

IWST 2025
Howfastwedidit?
8/33

How fast we did it?
IWST 2025
450 ms!
9 / 33

TestCasesvsBenchmarks
byFedericoLochbaum
IWST2025
TestCases
Executesaseriesofstepstovalidatetheprogram’sbehavio
Checkcorrectness(Pass/Fail
Self-validatin
Oneexecutionisenoug
Resultsarearchitecture-independent
Stresstheprogramtoassessperformanc
MeasurePerformancemetrics(Elapsedtime/CPU
Notself-validatin
Requiremultiplerunstocopewithnois
Resultsarearchitecture-dependen
Expensivetorun
Benchmarks
1
2
3
10/33

How fast we did it?
IWST 2025
450 ms!
Is this
measuring
good enough?
11 / 33

IWST 2025
Howdoweknowthatabenchmarkis“good”?
450
ms!
312
ms!
12/33

Problem: Assessing Benchmark Quality
IWST 2025
A lack of systematic methodologies to assess benchmark effectiveness
What does it mean benchmark quality
How to measure benchmark effectiveness detecting performance bugs
How are introduced performance issues in a target program to detect them ?

13 / 33

Mutation Testing Benchmark Methodology - Proposal
IWST 2025
Assumption:A benchmark is good if it detects performance bugs
14 / 33

Mutation Testing ?
IWST 2025
Mutation testing measures test quality in relation to it capability to detects bugs
It introduces simulatedbugs(mutants) and assess if the test catches the
If the test fails → the mutant was killed(detected
If not → the mutant survived ( undetected)

Test
Test
Test
Bug introduction
Test

broke?
Test
15 / 33

AdaptingMutationTestingforPerformance
byFedericoLochbaum
IWST2025
Itintroducesperformancebugs(mutants)andassessifthebenchmarkcatchesthe
Istheoraclewhodeterminesifthebenchmarkkillsornotthemutant
16/33

Mutation Strategy
IWST 2025
Introduce a controlled performance bug mutant in the original program
Program
Mutated
Program
Mutant operator
Benchmark’s target
17 / 33

AdaptingMutationTestingforPerformance
byFedericoLochbaum
IWST2025
Introduceperformanceperturbation
Aperformanceoracledeterminesifthebughadaperformanceimpact
RQ2
RQ1
18/33

RQ1 - What is a Performance Bug ?
IWST 2025
RQ1
19 / 33

RQ1- Performance Bug
IWST 2025
Perturbation on program execution

( E.g.Latency, Locality issues
Excessive consumption of time or space by design

( E.g.Long iterations, Bad implementation decisions
Nooptimal data structure used for a problem

( E.g. Use an array instead of a dictionary to index a dataset )

20 / 33

RQ2 - How do we assess a Benchmark ?
IWST 2025
RQ2
21 / 33

RQ2 - The Benchmark Oracle
IWST 2025
Let’s define a benchmark quality as “How sensible is the benchmark to detect a mutant”
What it is a benchmark sensibility ?
Where is the threshold, and what do we compare it against?
22 / 33

Experimental Mutants: Sleepstatements
IWST 2025
Why? - Represents latenc
How? - Three mutantoperators, 10, 100, 500 millisecond
Where? - At the beginning of every statement block
23 / 33

Experimental Oracle
IWST 2025
Baseline: Average + stdev of 30 iterations to reduce
external nois
Metric: Execution tim
Mutant detection: A mutant is killed if the execution time
> baseline average + stdev
median
stdev
2
stdev
killed
killed
survivor
24 / 33

CaseStudy: Regular Expressions in Pharo
IWST 2025
WhyRegexes? → They are well know
100 Regexes are generated via grammar-basedfuzzing( MCTS
We benchmark the regex matches:method with the generated regexe
‘b+a(b)*c+(b+c*b|b?(b+b)c?(b)*ab+a?aa)?a*’ matches: ‘bac’
‘(bb(((b)b+)))+b+|b+’ matches: ‘bbbbb’
We assessthequalityofbenchmarks to find performance bugs on matches: method

25 / 33

Results
IWST 2025
We introduce 62 mutants per mutant operator ( 3
We execute every benchmark once per mutant ( 186 times )
26 / 33

Average Benchmark Behavior
IWST 2025
On average, the mutation score
per benchmark is 51.48%
%
Mutation
score
Benchmarks
27 / 33

High Score Benchmarks
IWST 2025
11% of benchmarks have an
score > 60%
Benchmarks
%
Mutation
score
28 / 33

Performance Perturbation Sensibility
IWST 2025
There are some benchmarks
more sensible than others
Benchmarks
%
Mutation
score
29 / 33

Baseline Characterization
Average stdev is 42.48% ( relative )
IWST 2025
30 / 33

Baseline Characterization
13% have high variance: Can
not detect small perturbations
IWST 2025
31 / 33

Futurework
IWST 2025
Improvethebenchmarkselection,filteringthosewithhighvarianc
Studytechniquestominimizeexternalnois
ExperimentwithdifferentOracle’sthreshold
Experimentwithdifferentperformancemutantoperator
Studyalternativemetricstoreducethenumberofneededexecutionstohaveastablemeasure
32 / 33

IWST 2025

Federico Lochbaum, Guillermo Polito
Conclusion
IWST 2025
Systematic methodology to evaluate benchmarks’s effectiveness
Introduce artificial performance bugs by extending mutation testing
Instance of the framework in a real setting showing results
33 / 33

Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology

More Related Content

Similar to Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology (20)

More from ESUG (20)

Recently uploaded (20)

Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology