SlideShare a Scribd company logo
by Federico Lochbaum
EVREF Team
EvaluatingBenchmark
Quality:aMutation-Testing-
BasedMethodology
IWST2025

FedericoLochbaum,GuillermoPolito
by Federico Lochbaum
IWST 2025
1/33
Test Cases
by Federico Lochbaum
IWST 2025
Let’s suppose we have a broken house....
Hi! I’m a house!
2 / 33
Test Cases
by Federico Lochbaum
IWST 2025
And we want to repair it...
Hi! I am the
repair program!
I am a Test
case
3 / 33
by Federico Lochbaum
IWST 2025
Howwellwedidit?
4/33
How well we did it?
by Federico Lochbaum
IWST 2025
Passed!
5 / 33
Benchmarks
by Federico Lochbaum
IWST 2025
Measure execution time ( wall-clock time
Often made by averaging benchmarking results, looking to reduce contextual varianc
Some work proposes using other kinds of metrics, like energy consumption or memory usage

6 / 33
Benchmarks
by Federico Lochbaum
IWST 2025
I am the
benchmark
7 / 33
by Federico Lochbaum
IWST 2025
Howfastwedidit?
8/33
How fast we did it?
by Federico Lochbaum
IWST 2025
450 ms!
9 / 33
TestCasesvsBenchmarks
byFedericoLochbaum
IWST2025
TestCases
Executesaseriesofstepstovalidatetheprogram’sbehavio
Checkcorrectness(Pass/Fail
Self-validatin
Oneexecutionisenoug
Resultsarearchitecture-independent
Stresstheprogramtoassessperformanc
MeasurePerformancemetrics(Elapsedtime/CPU
Notself-validatin
Requiremultiplerunstocopewithnois
Resultsarearchitecture-dependen
Expensivetorun
Benchmarks
1
2
3
10/33
How fast we did it?
by Federico Lochbaum
IWST 2025
450 ms!
Is this
measuring
good enough?
11 / 33
by Federico Lochbaum
IWST 2025
Howdoweknowthatabenchmarkis“good”?
450
ms!
312
ms!
12/33
Problem: Assessing Benchmark Quality
by Federico Lochbaum
IWST 2025
A lack of systematic methodologies to assess benchmark effectiveness
What does it mean benchmark quality
How to measure benchmark effectiveness detecting performance bugs
How are introduced performance issues in a target program to detect them ? 

13 / 33
Mutation Testing Benchmark Methodology - Proposal
by Federico Lochbaum
IWST 2025
Assumption:A benchmark is good if it detects performance bugs
14 / 33
Mutation Testing ?
by Federico Lochbaum
IWST 2025
Mutation testing measures test quality in relation to it capability to detects bugs
It introduces simulatedbugs(mutants) and assess if the test catches the
If the test fails → the mutant was killed(detected
If not → the mutant survived ( undetected)

Test
Test
Test
Bug introduction
Test

broke?
Test
15 / 33
AdaptingMutationTestingforPerformance
byFedericoLochbaum
IWST2025
Itintroducesperformancebugs(mutants)andassessifthebenchmarkcatchesthe
Istheoraclewhodeterminesifthebenchmarkkillsornotthemutant
16/33
Mutation Strategy
by Federico Lochbaum
IWST 2025
Introduce a controlled performance bug mutant in the original program
Program
Mutated
Program
Mutant operator
Benchmark’s target
17 / 33
AdaptingMutationTestingforPerformance
byFedericoLochbaum
IWST2025
Introduceperformanceperturbation
Aperformanceoracledeterminesifthebughadaperformanceimpact
RQ2
RQ1
18/33
RQ1 - What is a Performance Bug ?
by Federico Lochbaum
IWST 2025
RQ1
19 / 33
RQ1- Performance Bug
by Federico Lochbaum
IWST 2025
Perturbation on program execution


( E.g.Latency, Locality issues
Excessive consumption of time or space by design


( E.g.Long iterations, Bad implementation decisions
Nooptimal data structure used for a problem


( E.g. Use an array instead of a dictionary to index a dataset )

20 / 33
RQ2 - How do we assess a Benchmark ?
by Federico Lochbaum
IWST 2025
RQ2
21 / 33
RQ2 - The Benchmark Oracle
by Federico Lochbaum
IWST 2025
Let’s define a benchmark quality as “How sensible is the benchmark to detect a mutant”
What it is a benchmark sensibility ?
Where is the threshold, and what do we compare it against?
22 / 33
Experimental Mutants: Sleepstatements
by Federico Lochbaum
IWST 2025
Why? - Represents latenc
How? - Three mutantoperators, 10, 100, 500 millisecond
Where? - At the beginning of every statement block
23 / 33
Experimental Oracle
by Federico Lochbaum
IWST 2025
Baseline: Average + stdev of 30 iterations to reduce
external nois
Metric: Execution tim
Mutant detection: A mutant is killed if the execution time
> baseline average + stdev
median
stdev
2
stdev
killed
killed
survivor
24 / 33
CaseStudy: Regular Expressions in Pharo
by Federico Lochbaum
IWST 2025
WhyRegexes? → They are well know
100 Regexes are generated via grammar-basedfuzzing( MCTS
We benchmark the regex matches:method with the generated regexe
‘b+a(b)*c+(b+c*b|b?(b+b)c?(b)*ab+a?aa)?a*’ matches: ‘bac’
‘(bb(((b)b+)))+b+|b+’ matches: ‘bbbbb’
We assessthequalityofbenchmarks to find performance bugs on matches: method

25 / 33
Results
by Federico Lochbaum
IWST 2025
We introduce 62 mutants per mutant operator ( 3
We execute every benchmark once per mutant ( 186 times )
26 / 33
Average Benchmark Behavior
by Federico Lochbaum
IWST 2025
On average, the mutation score
per benchmark is 51.48%
%
Mutation
score
Benchmarks
27 / 33
High Score Benchmarks
by Federico Lochbaum
IWST 2025
11% of benchmarks have an
score > 60%
Benchmarks
%
Mutation
score
28 / 33
Performance Perturbation Sensibility
by Federico Lochbaum
IWST 2025
There are some benchmarks
more sensible than others
Benchmarks
%
Mutation
score
29 / 33
Baseline Characterization
Average stdev is 42.48% ( relative )
by Federico Lochbaum
IWST 2025
30 / 33
Baseline Characterization
13% have high variance: Can
not detect small perturbations
by Federico Lochbaum
IWST 2025
31 / 33
Futurework
by Federico Lochbaum
IWST 2025
Improvethebenchmarkselection,filteringthosewithhighvarianc
Studytechniquestominimizeexternalnois
ExperimentwithdifferentOracle’sthreshold
Experimentwithdifferentperformancemutantoperator
Studyalternativemetricstoreducethenumberofneededexecutionstohaveastablemeasure
32 / 33
IWST 2025

Federico Lochbaum, Guillermo Polito
Conclusion
by Federico Lochbaum
IWST 2025
Systematic methodology to evaluate benchmarks’s effectiveness
Introduce artificial performance bugs by extending mutation testing
Instance of the framework in a real setting showing results
33 / 33

More Related Content

PPTX
GeeCON - Improve your tests with Mutation Testing
PDF
Fuzzing for CPS Mutation Testing
PDF
Sattose 2020 presentation
PDF
Madaari : Ordering For The Monkeys
PDF
Must.Kill.Mutants. Agile Testing Days 2017
PPTX
Ruby3x3: How are we going to measure 3x
PDF
Search-based testing of procedural programs:iterative single-target or multi-...
GeeCON - Improve your tests with Mutation Testing
Fuzzing for CPS Mutation Testing
Sattose 2020 presentation
Madaari : Ordering For The Monkeys
Must.Kill.Mutants. Agile Testing Days 2017
Ruby3x3: How are we going to measure 3x
Search-based testing of procedural programs:iterative single-target or multi-...

Similar to Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology (20)

PDF
A look inside Babelfy: Examining the bubble
PDF
Measure and Improve code quality. Using automation.
PDF
MRG Effitas certification for TRAPMINE
PDF
FASTEST: Test Case Generation from Z Specifications
PPTX
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
PDF
La préservation des logiciels: défis et opportunités pour la reproductibilité...
PDF
Using Diversity for Automated Boundary Value Testing
ODP
Why Do Computational Scientists Trust Their So
PPTX
Icsm2010 kamei
PDF
HotFuzz: Discovering Algorithmic Denial-of-Service Vulnerabilities Through Gu...
PDF
Software Testing: Test Design and the Project Life Cycle
PPTX
Testing mobile apps
PDF
Agile Engineering Practices
PPTX
Joker - Improve your tests with mutation testing
PDF
Workshop unit test
PDF
Keynote AST 2016
PPTX
Muffler a tool using mutation to facilitate fault localization 2.0
PDF
Hyper-pragmatic Pure FP testing with distage-testkit
PDF
MUTANTS KILLER - PIT: state of the art of mutation testing system
PDF
Automock: Interaction-Based Mock Code Generation
A look inside Babelfy: Examining the bubble
Measure and Improve code quality. Using automation.
MRG Effitas certification for TRAPMINE
FASTEST: Test Case Generation from Z Specifications
Model-Based Testing: Theory and Practice. Keynote @ MoTiP (ISSRE) 2012.
La préservation des logiciels: défis et opportunités pour la reproductibilité...
Using Diversity for Automated Boundary Value Testing
Why Do Computational Scientists Trust Their So
Icsm2010 kamei
HotFuzz: Discovering Algorithmic Denial-of-Service Vulnerabilities Through Gu...
Software Testing: Test Design and the Project Life Cycle
Testing mobile apps
Agile Engineering Practices
Joker - Improve your tests with mutation testing
Workshop unit test
Keynote AST 2016
Muffler a tool using mutation to facilitate fault localization 2.0
Hyper-pragmatic Pure FP testing with distage-testkit
MUTANTS KILLER - PIT: state of the art of mutation testing system
Automock: Interaction-Based Mock Code Generation
Ad

More from ESUG (20)

PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
PDF
Micromaid: A simple Mermaid-like chart generator for Pharo
PDF
Directing Generative AI for Pharo Documentation
PDF
Even Lighter Than Lightweiht: Augmenting Type Inference with Primitive Heuris...
PDF
Composing and Performing Electronic Music on-the-Fly with Pharo and Coypu
PDF
Gamifying Agent-Based Models in Cormas: Towards the Playable Architecture for...
PDF
Analysing Python Machine Learning Notebooks with Moose
PDF
FASTTypeScript metamodel generation using FAST traits and TreeSitter project
PDF
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
PDF
Package-Aware Approach for Repository-Level Code Completion in Pharo
PDF
An Analysis of Inline Method Refactoring
PDF
Identification of unnecessary object allocations using static escape analysis
PDF
Control flow-sensitive optimizations In the Druid Meta-Compiler
PDF
Clean Blocks (IWST 2025, Gdansk, Poland)
PDF
Encoding for Objects Matters (IWST 2025)
PDF
Challenges of Transpiling Smalltalk to JavaScript
PDF
Immersive experiences: what Pharo users do!
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
PDF
Cavrois - an Organic Window Management (ESUG 2025)
PDF
Fluid Class Definitions in Pharo (ESUG 2025)
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
Micromaid: A simple Mermaid-like chart generator for Pharo
Directing Generative AI for Pharo Documentation
Even Lighter Than Lightweiht: Augmenting Type Inference with Primitive Heuris...
Composing and Performing Electronic Music on-the-Fly with Pharo and Coypu
Gamifying Agent-Based Models in Cormas: Towards the Playable Architecture for...
Analysing Python Machine Learning Notebooks with Moose
FASTTypeScript metamodel generation using FAST traits and TreeSitter project
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
Package-Aware Approach for Repository-Level Code Completion in Pharo
An Analysis of Inline Method Refactoring
Identification of unnecessary object allocations using static escape analysis
Control flow-sensitive optimizations In the Druid Meta-Compiler
Clean Blocks (IWST 2025, Gdansk, Poland)
Encoding for Objects Matters (IWST 2025)
Challenges of Transpiling Smalltalk to JavaScript
Immersive experiences: what Pharo users do!
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
Cavrois - an Organic Window Management (ESUG 2025)
Fluid Class Definitions in Pharo (ESUG 2025)
Ad

Recently uploaded (20)

PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
. Radiology Case Scenariosssssssssssssss
PDF
An interstellar mission to test astrophysical black holes
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Microbiology with diagram medical studies .pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
2. Earth - The Living Planet earth and life
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
2. Earth - The Living Planet Module 2ELS
TOTAL hIP ARTHROPLASTY Presentation.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
AlphaEarth Foundations and the Satellite Embedding dataset
Introduction to Fisheries Biotechnology_Lesson 1.pptx
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Derivatives of integument scales, beaks, horns,.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
. Radiology Case Scenariosssssssssssssss
An interstellar mission to test astrophysical black holes
7. General Toxicologyfor clinical phrmacy.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Microbiology with diagram medical studies .pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
2. Earth - The Living Planet earth and life
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx

Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology