Deep Software Variability and Frictionless Reproducibility

Deep Software Variability and
Frictionless Reproducibility
Mathieu Acher @acherm

Deep Software Variability and Frictionless Reproducibility
Abstract: The ability to recreate computational results with minimal effort and actionable metrics provides a solid
foundation for scientific research and software development. When people can replicate an analysis at the touch of a
button using open-source software, open data, and methods to assess and compare proposals, it significantly eases
verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully
achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data
sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input
data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential
for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence
of how the complex variability interactions across these layers affect qualitative and quantitative software properties,
thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability
spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform,
random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction
methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software
science community to develop new methods and tools to manage variability and foster reproducibility in software
systems.
Exposé invité, 5 juin 2024 @ GDRGPL

Special thanks to* Aaron Randrianaina,
Jean-Marc Jézéquel, Benoit Combemale, Luc
Lesoil, Arnaud Gotlieb, Helge Spieker, Quentin
Mazouni, Paul Temple, Gauthier Le Bartz Lyan,
Xhevahire Tërnava, Olivier Barais, and the
whole DiverSE and RIPOST teams
*random order, incomplete

Frictionless Reproducibility and (Deep) Software (Variability)
Problem: Variability and Frictions
Solution: Variability and Exploration
Discussions
AGENDA

SOFTWARE VARIANTS
ARE EATING THE WORLD
5

Science is changing:
Computation-based research
6

Computational science
depends on software and its engineering
7
design of mathematical model
mining and analysis of data
executions of large simulations
problem solving
executable paper
from a set of scripts to automate the deployment to… a
comprehensive system containing several features that
help researchers exploring various hypotheses

8
Dealing with software collapse: software stops working eventually
Konrad Hinsen 2019
Configuration failures represent one of the most common types of
software failures Sayagh et al. TSE 2018
multi-million line of code base
multi-dependencies
multi-systems
multi-layer
multi-version
multi-person
multi-variant

“Insanity is doing the same thing over and over again
and expecting different results”
9
http://throwgrammarfromthetrain.blogspot.com/2010/10/definition-of-insanity.html

Reproducibility
10
“Authors provide all the necessary data and the computer
codes to run the analysis again, re-creating the results.”
(Claerbout/Donoho/Peng definition)
“The actual scholarship is the complete software development environment and the
complete set of instructions which generated the figures.” (~executable paper)

Reproducibility and Replicability
11
Reproducible: Authors provide all the necessary data and the computer
codes to run the analysis again, re-creating the results.
Replication: A study that arrives at the same scientific findings as another
study, collecting new data (possibly with different methods) and
completing new analyses. “Terminologies for Reproducible
Research”, Lorena A. Barba, 2018

12
Reproducible: Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.
Replication: A study that arrives at the same scientific findings as another study, collecting new
data (possibly with different methods) and completing new analyses.
“Terminologies for Reproducible
Research”, Lorena A. Barba, 2018

13
Methods Reproducibility: A method is reproducible if reusing the original code leads to the same
results.
Results Reproducibility: A result is reproducible if a reimplementation of the method generates
statistically similar values.
Inferential Reproducibility: A finding or a conclusion is reproducible if one can draw it from a
different experimental setup.
“Unreproducible Research is Reproducible”, Bouthillier et al., ICML 2019

Reproducible science
14
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
Socio-technical issues: open science, open source software, multi-disciplinary
collaboration, incentives/rewards, initiatives, etc.
with many challenges related to data acquisition, knowledge organization/sharing, etc.

15
https://github.com/emsejournal/openscience https://rescience.github.io/
https://reproducible-research.inria.fr/

16

Lamb and Zacchiroli “Reproducible Builds: Increasing the Integrity
of Software Supply Chains” IEEE Software 2022
https://arxiv.org/pdf/2104.06020
(best paper award IEEE Software for year 2022)
“The build process of a software product is reproducible if,
after designating a specific version of its source code and all
of its build dependencies, every build produces bit-for-bit
identical artifacts, no matter the environment in which the
build is performed.”

Frictionless reproducibility
18
https://arxiv.org/abs/2310.00865
https://hdsr.mitpress.mit.edu/pub/g9mau4m0/release/2
“Computation-driven research really has changed in the last 10 years, driven by three principles of
data science, which, after longstanding partial efforts, are finally available in mature form for daily
practice, as frictionless open services offering data sharing, code sharing, and competitive
challenges.”
[FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges]
“We are entering an era of frictionless research exchange, in which research algorithmically builds
on the digital artifacts created by earlier research, and any good ideas that are found get spread
rapidly, everywhere. The collective behavior induced by frictionless research exchange is the
emergent superpower driving many events that are so striking today.”

19
[FR-1: Data] “Datafication of everything, with a culture of research data sharing.”
[FR-2: Re-execution (code)]: “Research code sharing including the ability to exactly
re-execute the same complete workflow by different researchers.”
[FR-3: Challenges] “a shared public dataset, a prescribed and quantified task
performance metric, a set of enrolled competitors seeking to outperform each other on
the task, and a public leaderboard.”
performance
metric

20
[FR-1: Data] “Datafication of everything, with a culture of research data sharing.”
[FR-2: Re-execution (code)]: “Research code sharing including the ability to exactly re-execute the same complete
workflow by different researchers.”
[FR-3: Challenges] “a shared public dataset, a prescribed and quantified task performance metric, a set of enrolled
competitors seeking to outperform each other on the task, and a public leaderboard.”
frictionless reproducibility = [FR-1] + [FR-2] + [FR-3] performance
metric

21
frictionless reproducibility = [FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges]
[FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important and
original piece
On the one hand, [FR-3] is a way to objectively assess a contribution, compare solutions, and measure
progress (if any). [FR-3] sounds legit to provide a “task definition that formalized a specific research
problem and made it an object of study”. [FR-3] is “the competitive element that attracted our attention in
the first place”.
Think about the absence of [FR-3]. The “challenge paradigm” is a big ongoing shift (see Isabelle Guyon
and Evelyne Viegas - "AI Competitions and the Science Behind Contests")
● Many success stories (mainly in empirical machine learning): speech processing, biometric
recognition, facial recognition, protein structure prediction problem (CASP), etc.
● More and more leaderboard (eg https://evalplus.github.io/leaderboard.html
https://robustbench.github.io/) or competition (eg SAT competition)
● Many platforms, services, and events supporting the shift (eg Kaggle)

22
[FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important and original piece
On the one hand, [FR-3] is a way to objectively assess a contribution, compare solutions, and measure progress (if any).
[FR-3] sounds legit to provide a “task definition that formalized a specific research problem and made it an object of
study”. [FR-3] is “the competitive element that attracted our attention in the first place”. The performance measurement
crystallized a specific project’s contribution, boiling down an entire research contribution essentially to a single number,
which can be reproduced. Think about the absence of [FR-3]
The “challenge paradigm” is a big ongoing shift (see Isabelle Guyon and Evelyne Viegas - "AI Competitions and the
Science Behind Contests")
● Many success stories (mainly in empirical machine learning): speech processing, biometric recognition, facial
recognition, protein structure prediction problem (CASP), etc.
● More and more leaderboard (eg https://evalplus.github.io/leaderboard.html https://robustbench.github.io/) or
competition (eg SAT competition)
● Many platforms, services, and events supporting the shift (eg Kaggle)

23
[FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important but discussable piece
On the other hand, we know that the power of a simple scoring function is dangerous (e.g., Goodhart's law)
“What if the metric is wrong? What if the subtleties of a complex problem are not amenable to representation by a single
scalar? What happens when metrics for locally optimal solutions are apparent, but ones for globally optimal solutions are
not? What happens when the community is not (yet) mature enough to rally around a consensus-scoring function? I think
it is important to recognize that finding an appropriate scoring function, let alone an objectively best one, is an ongoing
task and might evolve as FR-1 and FR-2 provide a deeper understanding of the problem space.”
Overcoming Potential Obstacles as We Strive for Frictionless Reproducibility by Adam D. Schuyler (2024)
performance
metric

Are we frictionless?
Reading a paper in 2024 is sometimes like in 1970:
● Where is the source code? (eg implementation of the solution, scripts to
compute metrics)
● Where is the data? (eg to test the solution)
● Contacting authors?
○ no response?
○ code not consistent with the PDF
○ …
● It does not work on my machine; results are completely different…
There are lots of socio-technical frictions… even when you have the code and data!
=> When people can replicate an analysis at the touch of a button using open-source software, open
data, and methods to assess and compare proposals, it significantly eases verification of results,
engagement with a diverse range of contributors, and progress

Frictionless reproducibility (an example)

Reproducible science… with frictions
26
Despite the availability of data and code, several studies report that the
same data analyzed with diﬀerent software can lead to diﬀerent results.

Can a coupled ESM simulation be restarted from a diﬀerent machine without causing
climate-changing modiﬁcations in the results? Using two versions of EC-Earth: one “non-replicable”
case (see below) and one replicable case.

Can a coupled ESM simulation be restarted from a diﬀerent machine without causing climate-changing modiﬁcations in the results? Using
two versions of EC-Earth: one “non-replicable” case (see below) and one replicable case.

Can a coupled ESM simulation be restarted from a different machine
without causing climate-changing modifications in the results?
A study involving eight institutions and seven different supercomputers in Europe is
currently ongoing with EC-Earth. This ongoing study aims to do the following:
● evaluate different computational environments that are used in collaboration
to produce CMIP6 experiments (can we safely create large ensembles
composed of subsets that emanate from different partners of the
consortium?);
● detect if the same CMIP6 configuration is replicable among platforms of the
EC-Earth consortium (that is, can we safely exchange restarts with EC-Earth
partners in order to initialize simulations and to avoid long spin-ups?); and
● systematically evaluate the impact of different compilation flag options (that
is, what is the highest acceptable level of optimization that will not break the
replicability of EC-Earth for a given environment?).

Should software version numbers determine science?
Significant differences were revealed between
FreeSurfer version v5.0.0 and the two earlier versions.
[...] About a factor two smaller differences were detected
between Macintosh and Hewlett-Packard workstations
and between OSX 10.5 and OSX 10.6. The observed
differences are similar in magnitude as effect sizes
reported in accuracy evaluations and neurodegenerative
studies.
see also Krefting, D., Scheel, M., Freing, A., Specovius, S., Paul, F., and
Brandt, A. (2011). “Reliability of quantitative neuroimage analysis using
freesurfer in distributed environments,” in MICCAI Workshop on
High-Performance and Distributed Computing for Medical Imaging.

“Neuroimaging pipelines are known to generate different results
depending on the computing platform where they are compiled and
executed.”
Reproducibility of neuroimaging
analyses across operating systems,
Glatard et al., Front. Neuroinform., 24
April 2015
The implementation of mathematical functions manipulating single-precision floating-point
numbers in libmath has evolved during the last years, leading to numerical differences in
computational results. While these differences have little or no impact on simple analysis
pipelines such as brain extraction and cortical tissue classification, their accumulation
creates important differences in longer pipelines such as the subcortical tissue
classification, RSfMRI analysis, and cortical thickness extraction.

“Neuroimaging pipelines are known to generate different results
depending on the computing platform where they are compiled and
executed.”
Statically building programs improves reproducibility across OSes, but small
differences may still remain when dynamic libraries are loaded by static
executables[...]. When static builds are not an option, software heterogeneity might
be addressed using virtual machines. However, such solutions are only
workarounds: differences may still arise between static executables built on
different OSes, or between dynamic executables executed in different VMs.
Reproducibility of neuroimaging
analyses across operating systems,
Glatard et al., Front. Neuroinform., 24
April 2015

Reproducible science as a
(deep) software variability problem
34
Despite the availability of data and code, several studies report that the
same data analyzed with diﬀerent software can lead to diﬀerent results.

35
Despite the availability of data and
code, several studies report that the
same data analyzed with different
software can lead to different results
Many layers (operating system,
third-party libraries, versions, workloads,
compile-time options and flags, etc.)
themselves subject to variability can
alter the results.
Reproducible science and deep
software variability: a threat and
opportunity for scientific knowledge!
hardware variability
operating system variability
compiler variability
build variability
hypervisor variability
software application variability
v
e
r
s
i
o
n
v
a
r
i
a
b
i
l
i
t
y
input data variability
container variability
deep software variability

How often (x+y)+z == x+(y+z) ?
https://github.com/FAMILIAR-project/reproducibility-associativity/

Frictionless Reproducibility and (Deep) Software (Variability)
Problem (cont’d): Variability and Frictions
Solution: Variability and Exploration
Discussions
AGENDA

15,000+ options
thousands of compiler
flags and compile-time
options
dozens of
preferences
100+ command-line
parameters
1000+ feature toggles
38
Non-functional properties
execution
time
energy
consumption
accuracy
security

15,000+ options
thousands of compiler flags
and compile-time options
dozens of preferences
100+ command-line parameters
39
System under
Study
(reproducible
and
replicable)
Variability
Output
(scientific result;
most of the time
quantitative
information)
input data
performance
metric

Deep Software Variability and Frictionless Reproducibility

We demonstrate that effects of parameter, hardware, and software variation are
detectable, complex, and interacting. However, we find most of the effects of
parameter variation are caused by a small subset of parameters. Notably, the
entrainment coefficient in clouds is associated with 30% of the variation seen in
climate sensitivity, although both low and high values can give high climate
sensitivity. We demonstrate that the effect of hardware and software is small relative
to the effect of parameter variation and, over the wide range of systems tested, may
be treated as equivalent to that caused by changes in initial conditions.
57,067 climate model runs. These runs sample parameter space for 10 parameters
with between two and four levels of each, covering 12,487 parameter combinations
(24% of possible combinations) and a range of initial conditions

Joelle Pineau “Building Reproducible, Reusable, and Robust Machine Learning Software” ICSE’19 keynote “[...] results
can be brittle to even minor perturbations in the domain or experimental procedure”
What is the magnitude of the effect
hyperparameter settings can have on baseline
performance?
How does the choice of network architecture for
the policy and value function approximation affect
performance?
How can the reward scale affect results?
Can random seeds drastically alter performance?
How do the environment properties affect
variability in reported RL algorithm performance?
Are commonly used baseline implementations
comparable?

“Completing a full replication study of our previously published findings on bluff-body
aerodynamics was harder than we thought. Despite the fact that we have good
reproducible-research practices, sharing our code and data openly.”

Data analysis workflows in many scientific domains have become increasingly complex and flexible (=
subject to variability). Here we assess the effect of this flexibility on the results of functional magnetic
resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9
ex-ante hypotheses. The flexibility of analytical approaches is exemplified by the fact that no two teams
chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of
hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of
the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology.
Notably, a meta-analytical approach that aggregated information across teams yielded a significant
consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an
overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the
dataset. Our findings show that analytical flexibility can have substantial effects on scientific conclusions,
and identify factors that may be related to variability in the analysis of functional magnetic resonance
imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and
demonstrate the need for performing and reporting multiple analyses of the same data. Potential
approaches that could be used to mitigate issues related to analytical variability are discussed.

Can Machine Learning Pipelines Be Better
Configured? Wang et al. FSE’2023
“A pipeline is subject to misconfiguration if
it exhibits significantly inconsistent performance upon changes in
the versions of its configured libraries or the combination of these
libraries. We refer to such performance inconsistency as a pipeline
configuration (PLC) issue.”

Deep software variability: Are layers/features
orthogonal or are there interactions?
Luc Lesoil, Mathieu Acher, Arnaud Blouin, Jean-Marc Jézéquel:
Deep Software Variability: Towards Handling Cross-Layer Configuration.

Configuration is hard: numerous options, informal knowledge
?????

Hardware
Operating
System
Software
Input Data
10.4
x264
--mbtree
...
x264
--no-mbtree
...
x264
--no-mbtree
...
x264
--mbtree
...
20.04
Dell latitude
7400
Raspberry Pi
4 model B
vertical
animation vertical
animation vertical
animation vertical
animation
Duration (s) 22 25 73 72
6 6 351 359
Size (MB) 28 34 33 21
33 21 28 34
A B
2
1 2
1
REAL WORLD Example (x264)

Hardware
Operating
System
Software
Input Data
10.4
x264
--mbtree
...
x264
--no-mbtree
...
x264
--no-mbtree
...
x264
--mbtree
...
20.04
Dell latitude
7400
Raspberry Pi
4 model B
vertical
animation vertical
animation vertical
animation vertical
animation
Duration (s) 22 25 73 72
6 6 351 359
Size (MB) 28 34 33 21
33 21 28 34
A B
2
1 2
1

Hardware
Operating
System
Software
Input Data
10.4
x264
--mbtree
...
x264
--no-mbtree
...
x264
--no-mbtree
...
x264
--mbtree
...
20.04
Dell latitude
7400
Raspberry Pi
4 model B
vertical
animation vertical
animation vertical
animation vertical
animation
Duration (s) 22 25 73 72
6 6 351 359
Size (MB) 28 34 33 21
33 21 28 34
A B
2
1 2
1
≈*16
≈*12

Age # Cores GPU
SOFTWARE
Variant
Compil. Version
Version Option Distrib.
Size Length Res.
Hardware
Operating
System
Software
Input Data
Bug
Perf. ↗
Perf. ↘
deep variability
L. Lesoil, M. Acher, A. Blouin and J.-M. Jézéquel,
“Deep Software Variability: Towards
Handling Cross-Layer Configuration” in VaMoS 2021
The “best”/default software
variant might be a bad one.
Influential software options
and their interactions vary.
Performance prediction
models and variability
knowledge may not
generalize

Let’s go deep with input data!
Intuition: video encoder behavior (and thus runtime configurations) hugely depends
on the input video (different compression ratio, encoding size/type etc.)
Is the best software configuration still the best?
Are influential options always influential?
Does the configuration knowledge generalize?
?
YouTube User General Content dataset: 1397 videos
Measurements of 201 soft. configurations (with same hardware,
compiler, version, etc.): encoding time, bitrate, etc.

configurations’ measurements over input_1
Inputs = …

Inputs = …
Generalization/transfer:
what’s the relationship between
perf_pred_1 and
perf_pred_42?
● with perf_pred_i
a performance model
capable of predicting
performance of any
configuration on input_i
● linear relationship?
○ eg Pearson/Spearman
linear correlation
● influential
features/options:
same?

Do x264 software performances
stay consistent across inputs?
●Encoding time: very strong correlations
○ low input sensitivity
●FPS: very strong correlations
○ low input sensitivity
●CPU usage : moderate correlation, a few negative correlations
○ medium input sensitivity
●Bitrate: medium-low correlation, many negative correlations
○ High input sensitivity
●Encoding size: medium-low correlation, many negative correlations
○ High input sensitivity
?
1397 videos x 201 software
configurations

Are there some configuration options
more sensitive to input videos? (bitrate)

Practical impacts for users, developers,
scientists, and self-adaptive systems
Threats to variability knowledge: predicting, tuning, or understanding configurable systems without being
aware of inputs can be inaccurate and… pointless
Opportunities: for some performance properties (P) and subject systems, some stability is observed and
performance remains consistent!
L. Lesoil, M. Acher, A. Blouin and J.-M. Jézéquel “The Interaction between
Inputs and Configurations fed to Software Systems: an Empirical Study”
https://arxiv.org/abs/2112.07279

Age # Cores GPU
SOFTWARE
Variant
Compil. Version
Size Length Res.
Hardware
Operating
System
Software
Input Data
Bug
Perf. ↗
Perf. ↘
deep variability Sometimes, variability is
consistent/stable and
knowledge transfer is
immediate.
But there are also
interactions among
variability layers and
variability knowledge
may not generalize

Age # Cores GPU
Compil. Version
Size Length Res.
Hardware
Operating
System
Software
Input Data
Does deep software variability affect previous scientific,
software-based studies? (a graphical template)
List all details…
and questions:
what iF we run the
experiments on
different:
OS?
version/commit?
PARAMETERS?
INPUT?
SOFTWARE
Variant

Deep variability problem (statement)
Fundamentally, we have a huge multi-dimensional variant space (eg 10^6000)
run (source_code) => result
run (hardware, operating_system, build_environment, input_data, source_code, …) =>
results
Fixing variability once and for all, in all dimensions/layers, is the obvious solution…
But it is either impossible (eg the ages of processor can have an impact on execution
time)...
Or not desirable
● non-robust result
● generalization/transferability of the results/ﬁndings
● kill innovation 64

Replicability is the holy grail!
Exploring various configurations:
● Make more robust scientific findings
● Define and assess the validity enveloppe
● Enable exploration and optimization
● Innovation and new hypothesis, insights, knowledge
⇒ We propose to embrace deep variability for the sake of
replicability
65

Embrace deep variability!
Explicit modeling of the variability points
and their relationships, such as:
1. Get insights into the variability “factors” and
their possible interactions
2. Capture and document conﬁgurations for
the sake of reproducibility
3. Explore diverse conﬁgurations to replicate,
and hence optimize, validate, increase the
robustness, or provide better resilience
Our Vision
ACM REP 2024
⇒ We aim to address the complexities associated
with reproducibility and replicability in modern
software systems and environments, facilitating a
more comprehensive and nuanced perspective on
these critical “factors”.
66

Solution #1: Variability model
● Abstractions are definitely needed to…
○ reason about logical constraints and interactions
○ integrate domain knowledge
○ synthesize domain knowledge
○ automate and guide the exploration of variants
○ scope and prioritize experiments
● Language and formalism: feature model (widely applicable!)
○ translation to logics
○ reasoning with SAT/CP/SMT solvers
ᵩ ⋃ ⋂ |

Solution #1: Variability model
● Abstractions are definitely needed…
● Yes, but how to obtain a feature model?
○ modelling
○ reverse engineering (out of command-line parameters, source code, logs, configurations, etc.)
○ learning (next slide!)
○ modeling+reverse engineering+learning (HDR)

Whole
Population of
Configurations
Performance
Prediction
Training
Sample
Performance
Measurements
Prediction
Model
J. Alves Pereira, H. Martin, M. Acher, J.-M. Jézéquel, G. Botterweck and A. Ventresque
“Learning Software Configuration Spaces: A Systematic Literature Review” JSS, 2021
Solution #2: sampling and learning
(regression, classification)
69

x264 --me dia
--ref 5
…
-o output_1.x264

15,000+ options
thousands of compiler flags
and compile-time options
dozens of preferences
100+ command-line parameters
71
System under
Study
(reproducible)
Variability
Output
(binary)
input data “The build process of a software product is reproducible if,
after designating a specific version of its source code and all
of its build dependencies, every build produces bit-for-bit
identical artifacts, no matter the environment in which the
build is performed.”
Lamb and Zacchiroli “Reproducible Builds: Increasing the
Integrity of Software Supply Chains” IEEE Software 2022

15,000+
compile-time options
72
System under
Study
Variability
Output
(binary)
“The build process of a software product is reproducible if, after designating a
specific version of its source code and all of its build dependencies, every
build produces bit-for-bit identical artifacts, no matter the environment in
which the build is performed.” Lamb and Zacchiroli “Reproducible Builds:
Increasing the Integrity of Software Supply Chains” IEEE Software 2022
make defconfig # configuration
make # build the kernel (binary) out of config
make # should be the same, right?

Options Matter: Documenting and Fixing Non-Reproducible Builds in Highly-Configurable
Systems Randrianaina, Khelladi, Zendra, Acher MSR’2024
also at FOSDEM 2024 https://fosdem.org/2024/schedule/event/fosdem-2024-2848-documenting-and-fixing-non-reproducible-builds-due-to-configuration-options/

#1 take away message: look at every variability layer when you want a
bit-to-bit reproducibility; don’t ignore compile-time options!
“The build process of a software
product is reproducible if, after
designating a specific version and
a specific variant of its source
code and all of its build
dependencies, every build produces
bit-for-bit identical artifacts, no
matter the environment in which the
build is performed.” Lamb and
Zacchiroli “Reproducible Builds:
Increasing the Integrity of Software
Supply Chains” IEEE Software 2022

#2 take away message: interactions across variability layers exist (eg
compile-time option with build path) and may hamper reproducibility
“The build process of a software
product is reproducible if, after
designating a specific version and
a specific variant of its source
code and all of its build
dependencies, every build produces
bit-for-bit identical artifacts, no
matter the environment in which the
build is performed.” Lamb and
Zacchiroli “Reproducible Builds:
Increasing the Integrity of Software
Supply Chains” IEEE Software 2022

● Linux as a subject software system (not as an OS interacting with other layers)
● Targeted non-functional, quantitative property: binary size
○ interest for maintainers/users of the Linux kernel (embedded systems, cloud, etc.)
○ challenging to predict (cross-cutting options, interplay with compilers/build
systems, etc/.)
● Dataset: version 4.13.3 (september 2017), x86_64 arch,
measurements of 95K+ random configurations
○ paranoiac about deep variability since 2017, Docker to control the build
environment and scale
○ diversity of binary sizes: from 7Mb to 1.9Gb
○ 6% MAPE errors: quite good, though costly…
2
76
H. Martin, M. Acher, J. A. Pereira, L. Lesoil, J. Jézéquel and D. E. Khelladi, “Transfer learning across variants
and versions: The case of linux kernel size” Transactions on Software Engineering (TSE), 2021

4.13 version (sep 2017): 6%. What about evolution? Can we reuse the 4.13 Linux prediction
model? No, accuracy quickly decreases: 4.15 (5 months after): 20%; 5.7 (3 years after): 35%
3
77

Solution #3 Transfer learning (reuse of knowledge)
● Mission Impossible: Saving variability knowledge and
prediction model 4.13 (15K hours of computation)
● Heterogeneous transfer learning: the feature space is
different
● TEAMS: transfer evolution-aware model shifting
5
78
H. Martin, M. Acher, J. A. Pereira, L. Lesoil, J. Jézéquel and D. E. Khelladi, “Transfer learning across variants
and versions: The case of linux kernel size” Transactions on Software Engineering (TSE), 2021
3
78

Luc Lesoil, Helge Spieker, Arnaud Gotlieb, Mathieu Acher, Paul Temple, Arnaud Blouin, Jean-Marc Jézéquel:
Learning input-aware performance models of configurable systems: An empirical evaluation. J. Syst. Softw. 208: 111883 (2024)
Solution #3 Transfer learning (con’t)

Is there an interplay between compile-time and
runtime options?
L. Lesoil, M. Acher, X. Tërnava, A. Blouin and
J.-M. Jézéquel “The Interplay of Compile-
time and Run-time Options for Performance
Prediction” in SPLC ’21

Solution #4: Leverage stability
across variability layers!
First good news: Worth tuning software at compile-time!
Second good news: For all the execution time distributions of x264 and all the input videos, the worst
correlation is greater than 0.97. If the compile-time options change the scale of the distribution, they do not
change the rankings of run-time configurations (i.e., they do not truly interact with the run-time options).
It has three practical implications:
1. Reuse of configuration knowledge: transfer learning of prediction models boils down to apply a linear
transformation among distributions. Users can also trust the documentation of run-time options,
consistent whatever the compile-time configuration is.
2. Tuning at lower cost: finding the best compile-time configuration among all the possible ones allows
one to immediately find the best configuration at run time. We can remove away one dimension!
3. Measuring at lower cost: do not use a default compile-time configuration, use the less costly once since
it will generalize!
Did we recommend to use two binaries? YES, one for measuring, another for reaching optimal
performances!

Key results (for x264)
First good news: Worth tuning software at compile-time!
Second good news: For all the execution time distributions of x264 and all the input videos, the worst
correlation is greater than 0.97. If the compile-time options change the scale of the distribution, they do not
change the rankings of run-time configurations (i.e., they do not truly interact with the run-time options).
It has three practical implications:
1. Reuse of configuration knowledge: transfer learning of prediction models boils down to apply a linear
transformation among distributions. Users can also trust the documentation of run-time options,
2. Tuning at lower cost: finding the best compile-time configuration among all the possible ones allows
one to immediately find the best configuration at run time. We can remove away one dimension!
3. Measuring at lower cost: do not use a default compile-time configuration, use the less costly once since
it will generalize!
Did we recommend to use two binaries? YES, one for measuring, another for reaching optimal
performances!
interplay between
compile-time and runtime
options and even input!

Solution #5: Strategic exploration with
modelling and learning

Solution #6 Identification of root causes of variability
(testing and verification)

https://github.com/acherm/progvary-withgpt/blob/main/varyfloatinC/ChatGPT-C_Variations_with_%23ifdef.md
https://github.com/acherm/progvary-withgpt/blob/main/varyfloatinC/approx.c
Solution #7: LLMs to support
exploration of variants space

https://github.com/acherm/progvary-withgpt/blob/main/varyfloatinC/ChatGPT-C_Variations_with_%23ifdef.md
https://github.com/acherm/progvary-withgpt/blob/main/varyfloatinC/approx_eval.py

Retrieve the result of S. Boldo et al.
M. Acher, J. Galindo, J.M Jézéquel, “On Programming Variability with Large
Language Model-based Assistant”, SPLC’2023

▸ Some solutions
▸ abstractions/models
▸ learning and sampling
▸ reuse of configuration knowledge
▸ leveraging stability
▸ systematic exploration
▸ identification of root causes
▸ LLMs to support exploration of variants’ space
▸ incremental build of configuration space (Randrianaina et al. ICSE’22)
▸ debloating variability (Ternava et al. SAC’23)
▸ feature subset selection (Martin et al. SPLC’23)
▸ Essentially, we want to reduce the dimensionality of the problem
as well as the computational and human cost to foster
verification of results and innovation
▸ Frictionless reproducibility: code+data+metrics
▸ Deep variability is a problem (frictions!)
▸ evidence in many scientific domains
▸ Deep variability is a solution (exploration!)
▸ fixing variability once and for all is not
▸ Replicability is the holy grail!
▸ explore variants for robustness, validation, optimization and knowledge finding
93

Backup slides (disclaimer: don’t try to understand
everything ;))

What can we do? (robustness)
Robustness (trustworthiness) of scientific results to sources of variability
I have shown many examples of sources of variations and non-robust results…
Robustness should be rigorously defined (hint: it’s not the definition as given in computer
science)
How to verify the effect of sources of variations on the robustness of given conclusions?
● actionable metrics?
● methodology? (eg when to stop?)
● variability can actually be leveraged to augment confidence

96
deep
software
variability
different
methods
different
assumptions
different analyses
different data

97
Deep software variability is…
a threat for reproducible research
an opportunity for replication
“A study that arrives at the same scientific findings as another study,
collecting new data (possibly with different methods) and completing new
analyses.”
“A study that refutes some scientific findings of another study, through the
collection of new data (possibly with different methods) and completion of
new analyses.”
robustifying and augmenting
scientific knowledge

Reproducible Science as a Testing Problem
#1 Test Generation Problem (input)
inputs: computing environment, parameters of an algorithm, versions of
a library or tool, choice of a programming language
#2 Oracle Problem (output)
we usually ignore the outcome! (open problems; open questions; new
knowledge)
System under
Study
(replicable)
Input Output
(scientific
result)

Reproduction vs replication http://rescience.github.io/faq/
“Reproduction of a computational study means running the same computation on the same input data, and then checking if the
results are the same, or at least “close enough” when it comes to numerical approximations. Reproduction can be considered as
software testing at the level of a complete study.”
We don’t “test” in one run, in one computing environment, with one kind of input data, etc.
“Replication of a scientific study (computational or other) means repeating a published protocol, respecting its spirit and intentions
but varying the technical details. For computational work, this would mean using different software, running a simulation from
different initial conditions, etc. The idea is to change something that everyone believes shouldn’t matter, and see if the scientific
conclusions are affected or not.”
It is the most interesting direction, basically for synthesizing new scientific knowledge!
In both cases, there is the need to
harness the combinatorial explosion
of deep software variability
99

Reproducible Science and Software Engineering
@acherm
aka Deep Software Variability for Replicability in Computational Science
Deep Questions?

Transferring Performance Prediction Models Across Different Hardware Platforms
Valov et al. ICPE 2017
“Linear model provides a good approximation of
transformation between performance distributions
of a system deployed in different hardware
environments”
what about
variability of
input data?
compile-time options?
version?

Transfer Learning for Software Performance Analysis: An Exploratory Analysis
Jamshidi et al. ASE 2017

mixing deep variability: hard to assess the specific
influence of each layer
very few hardware, version, and input data… but lots
of runtime configurations (variants)
Let’s go deep with input data!
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
Jamshidi et al. ASE 2017

Threats to variability knowledge for performance property bitrate
● optimal configuration is specific to an input; a good configuration can be a bad one
● some options’ values have an opposite effect depending on the input
● effectiveness of sampling strategies (random, 2-wise, etc.) is input specific (somehow
confirming Pereira et al. ICPE 2020)
● predicting, tuning, or understanding configurable systems
without being aware of inputs can be inaccurate and… pointless
Practical impacts for users, developers,
scientists, and self-adaptive systems

106
multi-million line of code base
multi-dependencies
multi-systems
multi-layer
multi-version
multi-person
multi-variant

x264 video encoder (compilation/build)
compile-time
options

What can we do? (#1 studies)
Empirical studies about deep software variability
● more subject systems
● more variability layers, including interactions
● more quantitative (e.g., performance) properties
with challenges for gathering measurements data:
● how to scale experiments? Variant space is huge!
● how to fix/isolate some layers? (eg hardware)
● how to measure in a reliable way?
Expected outcomes:
● significance of deep software variability in the wild
● identification of stable layers: sources of variability that should not affect the conclusion and that can
be eliminated/forgotten
● identification/quantification of sensitive layers and interactions that matter
● variability knowledge

What can we do? (#2 cost)
Reducing the cost of exploring the variability spaces
Many directions here (references at the end of the slides):
● learning
○ many algorithms/techniques with tradeoffs interpretability/accuracy
○ transfer learning (instead of learning from scratch)
● sampling strategies
○ uniform random sampling? t-wise? distance-based? …
○ sample of hardware? input data?
● incremental build of configurations
● white-box approaches
● …

Key results (for x264)
Worth tuning software at compile-time: gain about 10 % of execution time with the
tuning of compile-time options (compared to the default compile-time configuration).
The improvements can be larger for some inputs and some runtime configurations.
Stability of variability knowledge: For all the execution time distributions of x264
and all the input videos, the worst correlation is greater than 0.97. If the compile-time
options change the scale of the distribution, they do not change the rankings of
run-time configurations (i.e., they do not truly interact with the run-time options).
Reuse of configuration knowledge:
● Linear transformation among distributions
● Users can also trust the documentation of run-time options,

Embrace deep variability!
Explicit modeling of the variability points
and their relationships, such as:
1. Get insights into the variability “factors” and
their possible interactions
2. Capture and document configurations for
the sake of reproducibility
3. Explore diverse configurations to replicate,
and hence optimize, validate, increase the
robustness, or provide better resilience
Our Vision
ACM REP 2024
⇒ We aim to address the complexities associated
with reproducibility and replicability in modern
software systems and environments, facilitating a
more comprehensive and nuanced perspective on these
critical “factors”.
111
https://hal.science/hal-04582287

exec (software) = exec_repro (software)
or
exec(software) ~= exec(software_repro)
(difference: exec_repro is another execution environment… and so somehow differs or not with exec; or we consider that software differs…)
(exec: execution? what’s the outcome then? in fact execution can be replaced by “build”... which is another kind of execution)
exec (software) ?= exec_repro (software)
software ~= software_repro
exec (software, hardware)
exec (software, hardware, compiler, input_data, operating_system, bios, container, hypervisor, dependencies_versions)
exec (v1, v2, …, vN) ~= exec_repro (v1’, v2’, …, vN’)
for i in [1, n], v_{i} ~= v_{i} (or not!)
~= is specific to a domain, to a usage, etc.
~= can be over the N layers or over N’ layers (N’ < N)
~= can be specific to some pairs elements (eg we know that with this hardware, the exec time is multiplied by 2)
for instance, we know the ~= between a software configuration with any hardware (but if the compiler changes, then the ~= should be “tuned” accordingly)
also ~= can be defined between a configuration set and an hardware set (eg performance distribution)

Frictionless reproducibility (annotated bibliography; grey literature)
https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1 The Mechanics of Frictionless
Reproducibility, B Recht
interesting historical perspective on research in neural networks (NeurIPs 87 titles are shockingly
still relevant); really love some parts about random experiments, science as a “massively parallel
genetic algorithm” or the discussions on the difficulty of using ML/DL software (completely
aligned with my terrible experience of Weka GUI in ~2006)
https://www.argmin.net/p/the-department-of-frictionless-reproducibilty
https://statmodeling.stat.columbia.edu/2023/10/13/frictionless-reproducibility-methods-as-proto-al
gorithms-division-of-labor-as-a-characteristic-of-statistical-methods-statistics-as-the-science-of-d
efaults-statisticians-well-prepared-to-think-abo/

Progress and frictionless reproducibility
Inspired by Thomas Kuhn (1962), we can think of the scientific and engineering process as a massively parallel genetic algorithm. If
we want to improve upon the systems we currently have, we might try a small perturbation to see if we get an improvement. If we
can find a small change that improves some desired outcome, we could change our systems to reflect this improvement. If we
continually search for these improvements and work hard to demonstrate their value, we may head in a better direction over time.
For scientific endeavors, we could perhaps gauge ‘better’ or ‘worse’ by performing random experiments—not randomized
experiments per se, but random experiments in the sense of trying potentially surprising improvements. If our small tweak results in
better outcomes, we can attempt to convince a journal editor or conference program committee to publish it. And this
communication gives everyone else a new starting point for their own random experimentation.
A single investigator can only make so much progress by random searching alone, but random search is pleasantly parallelizable.
Competing scientists can independently try their own random ideas and publish their results. Sometimes an individual result is so
promising that the herd of experimenters all flock around the good idea, hoping to strike gold on a nearby improvement and bring
home bragging rights. To some, this looks like an inefficient mess. To others, it looks like science.
https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1 The Mechanics of Frictionless
Reproducibility, B Recht

Data sharing and frictions
“Data set benchmarking and competitive testing took over machine learning in the late 1980s. Email and
file transfer were becoming more accessible. The current specification of FTP was finalized in 1985. In
1987, a PhD student at UC Irvine named David Aha put up an FTP server to host data sets for empirically
testing machine learning methods. Aha was motivated by service to the community, but he also wanted to
show his nearest-neighbor methods would outperform Ross Quinlan’s decision tree induction algorithms.
He formatted his data sets using the ‘attribute-value’ representation that a rival researcher, Ross Quinlan
(1986), had used. And, so, the UC Irvine Machine Learning Repository was born.”
“Improvements in computing greased the wheels, giving us faster computers, faster data transfer, and
smaller storage footprints. But computing technology alone was not sufficient to drive progress. Friendly
competition with Quinlan inspired Aha to build the UCI repository. And more explicit competitions were
also crucial components of the success.”
The Mechanics of Frictionless Reproducibility, B Recht, 2024
https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1

https://twitter.com/
StasBekman/statu
s/1749480373283
905611

https://github.com/FAMILIAR-project/reproducibility-associativity/

Deep Software Variability and Frictionless Reproducibility

More Related Content

Similar to Deep Software Variability and Frictionless Reproducibility (20)

More from University of Rennes, INSA Rennes, Inria/IRISA, CNRS (20)

Recently uploaded (20)

Deep Software Variability and Frictionless Reproducibility