SlideShare a Scribd company logo
Deep Software Variability and
Frictionless Reproducibility
Mathieu Acher @acherm
Deep Software Variability and Frictionless Reproducibility
Abstract: The ability to recreate computational results with minimal effort and actionable metrics provides a solid
foundation for scientific research and software development. When people can replicate an analysis at the touch of a
button using open-source software, open data, and methods to assess and compare proposals, it significantly eases
verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully
achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data
sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input
data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential
for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence
of how the complex variability interactions across these layers affect qualitative and quantitative software properties,
thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability
spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform,
random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction
methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software
science community to develop new methods and tools to manage variability and foster reproducibility in software
systems.
Exposé invité, 5 juin 2024 @ GDRGPL
Special thanks to* Aaron Randrianaina,
Jean-Marc Jézéquel, Benoit Combemale, Luc
Lesoil, Arnaud Gotlieb, Helge Spieker, Quentin
Mazouni, Paul Temple, Gauthier Le Bartz Lyan,
Xhevahire Tërnava, Olivier Barais, and the
whole DiverSE and RIPOST teams
*random order, incomplete
Frictionless Reproducibility and (Deep) Software (Variability)
Problem: Variability and Frictions
Solution: Variability and Exploration
Discussions
AGENDA
SOFTWARE VARIANTS
ARE EATING THE WORLD
5
Science is changing:
Computation-based research
6
Computational science
depends on software and its engineering
7
design of mathematical model
mining and analysis of data
executions of large simulations
problem solving
executable paper
from a set of scripts to automate the deployment to… a
comprehensive system containing several features that
help researchers exploring various hypotheses
Computational science
depends on software and its engineering
8
Dealing with software collapse: software stops working eventually
Konrad Hinsen 2019
Configuration failures represent one of the most common types of
software failures Sayagh et al. TSE 2018
multi-million line of code base
multi-dependencies
multi-systems
multi-layer
multi-version
multi-person
multi-variant
“Insanity is doing the same thing over and over again
and expecting different results”
9
http://throwgrammarfromthetrain.blogspot.com/2010/10/definition-of-insanity.html
Reproducibility
10
“Authors provide all the necessary data and the computer
codes to run the analysis again, re-creating the results.”
(Claerbout/Donoho/Peng definition)
“The actual scholarship is the complete software development environment and the
complete set of instructions which generated the figures.” (~executable paper)
Reproducibility and Replicability
11
Reproducible: Authors provide all the necessary data and the computer
codes to run the analysis again, re-creating the results.
Replication: A study that arrives at the same scientific findings as another
study, collecting new data (possibly with different methods) and
completing new analyses. “Terminologies for Reproducible
Research”, Lorena A. Barba, 2018
Reproducibility and Replicability
12
Reproducible: Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.
Replication: A study that arrives at the same scientific findings as another study, collecting new
data (possibly with different methods) and completing new analyses.
“Terminologies for Reproducible
Research”, Lorena A. Barba, 2018
Reproducibility and Replicability
13
Methods Reproducibility: A method is reproducible if reusing the original code leads to the same
results.
Results Reproducibility: A result is reproducible if a reimplementation of the method generates
statistically similar values.
Inferential Reproducibility: A finding or a conclusion is reproducible if one can draw it from a
different experimental setup.
“Unreproducible Research is Reproducible”, Bouthillier et al., ICML 2019
Reproducible science
14
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
Socio-technical issues: open science, open source software, multi-disciplinary
collaboration, incentives/rewards, initiatives, etc.
with many challenges related to data acquisition, knowledge organization/sharing, etc.
Reproducible science
15
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
Socio-technical issues: open science, open source software, multi-disciplinary
collaboration, incentives/rewards, initiatives, etc.
with many challenges related to data acquisition, knowledge organization/sharing, etc.
https://github.com/emsejournal/openscience https://rescience.github.io/
https://reproducible-research.inria.fr/
Reproducible science
16
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
Socio-technical issues: open science, open source software, multi-disciplinary
collaboration, incentives/rewards, initiatives, etc.
with many challenges related to data acquisition, knowledge organization/sharing, etc.
Lamb and Zacchiroli “Reproducible Builds: Increasing the Integrity
of Software Supply Chains” IEEE Software 2022
https://arxiv.org/pdf/2104.06020
(best paper award IEEE Software for year 2022)
“The build process of a software product is reproducible if,
after designating a specific version of its source code and all
of its build dependencies, every build produces bit-for-bit
identical artifacts, no matter the environment in which the
build is performed.”
Frictionless reproducibility
18
https://arxiv.org/abs/2310.00865
https://hdsr.mitpress.mit.edu/pub/g9mau4m0/release/2
“Computation-driven research really has changed in the last 10 years, driven by three principles of
data science, which, after longstanding partial efforts, are finally available in mature form for daily
practice, as frictionless open services offering data sharing, code sharing, and competitive
challenges.”
[FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges]
“We are entering an era of frictionless research exchange, in which research algorithmically builds
on the digital artifacts created by earlier research, and any good ideas that are found get spread
rapidly, everywhere. The collective behavior induced by frictionless research exchange is the
emergent superpower driving many events that are so striking today.”
Frictionless reproducibility
19
[FR-1: Data] “Datafication of everything, with a culture of research data sharing.”
[FR-2: Re-execution (code)]: “Research code sharing including the ability to exactly
re-execute the same complete workflow by different researchers.”
[FR-3: Challenges] “a shared public dataset, a prescribed and quantified task
performance metric, a set of enrolled competitors seeking to outperform each other on
the task, and a public leaderboard.”
performance
metric
Frictionless reproducibility
20
[FR-1: Data] “Datafication of everything, with a culture of research data sharing.”
[FR-2: Re-execution (code)]: “Research code sharing including the ability to exactly re-execute the same complete
workflow by different researchers.”
[FR-3: Challenges] “a shared public dataset, a prescribed and quantified task performance metric, a set of enrolled
competitors seeking to outperform each other on the task, and a public leaderboard.”
frictionless reproducibility = [FR-1] + [FR-2] + [FR-3] performance
metric
Frictionless reproducibility
21
frictionless reproducibility = [FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges]
[FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important and
original piece
On the one hand, [FR-3] is a way to objectively assess a contribution, compare solutions, and measure
progress (if any). [FR-3] sounds legit to provide a “task definition that formalized a specific research
problem and made it an object of study”. [FR-3] is “the competitive element that attracted our attention in
the first place”.
Think about the absence of [FR-3]. The “challenge paradigm” is a big ongoing shift (see Isabelle Guyon
and Evelyne Viegas - "AI Competitions and the Science Behind Contests")
● Many success stories (mainly in empirical machine learning): speech processing, biometric
recognition, facial recognition, protein structure prediction problem (CASP), etc.
● More and more leaderboard (eg https://evalplus.github.io/leaderboard.html
https://robustbench.github.io/) or competition (eg SAT competition)
● Many platforms, services, and events supporting the shift (eg Kaggle)
Frictionless reproducibility
22
frictionless reproducibility = [FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges]
[FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important and original piece
On the one hand, [FR-3] is a way to objectively assess a contribution, compare solutions, and measure progress (if any).
[FR-3] sounds legit to provide a “task definition that formalized a specific research problem and made it an object of
study”. [FR-3] is “the competitive element that attracted our attention in the first place”. The performance measurement
crystallized a specific project’s contribution, boiling down an entire research contribution essentially to a single number,
which can be reproduced. Think about the absence of [FR-3]
The “challenge paradigm” is a big ongoing shift (see Isabelle Guyon and Evelyne Viegas - "AI Competitions and the
Science Behind Contests")
● Many success stories (mainly in empirical machine learning): speech processing, biometric recognition, facial
recognition, protein structure prediction problem (CASP), etc.
● More and more leaderboard (eg https://evalplus.github.io/leaderboard.html https://robustbench.github.io/) or
competition (eg SAT competition)
● Many platforms, services, and events supporting the shift (eg Kaggle)
Frictionless reproducibility
23
frictionless reproducibility = [FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges]
[FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important but discussable piece
On the other hand, we know that the power of a simple scoring function is dangerous (e.g., Goodhart's law)
“What if the metric is wrong? What if the subtleties of a complex problem are not amenable to representation by a single
scalar? What happens when metrics for locally optimal solutions are apparent, but ones for globally optimal solutions are
not? What happens when the community is not (yet) mature enough to rally around a consensus-scoring function? I think
it is important to recognize that finding an appropriate scoring function, let alone an objectively best one, is an ongoing
task and might evolve as FR-1 and FR-2 provide a deeper understanding of the problem space.”
Overcoming Potential Obstacles as We Strive for Frictionless Reproducibility by Adam D. Schuyler (2024)
performance
metric
Are we frictionless?
Reading a paper in 2024 is sometimes like in 1970:
● Where is the source code? (eg implementation of the solution, scripts to
compute metrics)
● Where is the data? (eg to test the solution)
● Contacting authors?
○ no response?
○ code not consistent with the PDF
○ …
● It does not work on my machine; results are completely different…
There are lots of socio-technical frictions… even when you have the code and data!
=> When people can replicate an analysis at the touch of a button using open-source software, open
data, and methods to assess and compare proposals, it significantly eases verification of results,
engagement with a diverse range of contributors, and progress
Frictionless reproducibility (an example)
Reproducible science… with frictions
26
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
Despite the availability of data and code, several studies report that the
same data analyzed with different software can lead to different results.
from a set of scripts to automate the deployment to… a
comprehensive system containing several features that
help researchers exploring various hypotheses
Can a coupled ESM simulation be restarted from a different machine without causing
climate-changing modifications in the results? Using two versions of EC-Earth: one “non-replicable”
case (see below) and one replicable case.
Can a coupled ESM simulation be restarted from a different machine without causing
climate-changing modifications in the results? Using two versions of EC-Earth: one “non-replicable”
case (see below) and one replicable case.
Can a coupled ESM simulation be restarted from a different machine without causing climate-changing modifications in the results? Using
two versions of EC-Earth: one “non-replicable” case (see below) and one replicable case.
Can a coupled ESM simulation be restarted from a different machine
without causing climate-changing modifications in the results?
A study involving eight institutions and seven different supercomputers in Europe is
currently ongoing with EC-Earth. This ongoing study aims to do the following:
● evaluate different computational environments that are used in collaboration
to produce CMIP6 experiments (can we safely create large ensembles
composed of subsets that emanate from different partners of the
consortium?);
● detect if the same CMIP6 configuration is replicable among platforms of the
EC-Earth consortium (that is, can we safely exchange restarts with EC-Earth
partners in order to initialize simulations and to avoid long spin-ups?); and
● systematically evaluate the impact of different compilation flag options (that
is, what is the highest acceptable level of optimization that will not break the
replicability of EC-Earth for a given environment?).
Should software version numbers determine science?
Significant differences were revealed between
FreeSurfer version v5.0.0 and the two earlier versions.
[...] About a factor two smaller differences were detected
between Macintosh and Hewlett-Packard workstations
and between OSX 10.5 and OSX 10.6. The observed
differences are similar in magnitude as effect sizes
reported in accuracy evaluations and neurodegenerative
studies.
see also Krefting, D., Scheel, M., Freing, A., Specovius, S., Paul, F., and
Brandt, A. (2011). “Reliability of quantitative neuroimage analysis using
freesurfer in distributed environments,” in MICCAI Workshop on
High-Performance and Distributed Computing for Medical Imaging.
“Neuroimaging pipelines are known to generate different results
depending on the computing platform where they are compiled and
executed.”
Reproducibility of neuroimaging
analyses across operating systems,
Glatard et al., Front. Neuroinform., 24
April 2015
The implementation of mathematical functions manipulating single-precision floating-point
numbers in libmath has evolved during the last years, leading to numerical differences in
computational results. While these differences have little or no impact on simple analysis
pipelines such as brain extraction and cortical tissue classification, their accumulation
creates important differences in longer pipelines such as the subcortical tissue
classification, RSfMRI analysis, and cortical thickness extraction.
“Neuroimaging pipelines are known to generate different results
depending on the computing platform where they are compiled and
executed.”
Statically building programs improves reproducibility across OSes, but small
differences may still remain when dynamic libraries are loaded by static
executables[...]. When static builds are not an option, software heterogeneity might
be addressed using virtual machines. However, such solutions are only
workarounds: differences may still arise between static executables built on
different OSes, or between dynamic executables executed in different VMs.
Reproducibility of neuroimaging
analyses across operating systems,
Glatard et al., Front. Neuroinform., 24
April 2015
Reproducible science as a
(deep) software variability problem
34
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
Despite the availability of data and code, several studies report that the
same data analyzed with different software can lead to different results.
from a set of scripts to automate the deployment to… a
comprehensive system containing several features that
help researchers exploring various hypotheses
35
Despite the availability of data and
code, several studies report that the
same data analyzed with different
software can lead to different results
Many layers (operating system,
third-party libraries, versions, workloads,
compile-time options and flags, etc.)
themselves subject to variability can
alter the results.
Reproducible science and deep
software variability: a threat and
opportunity for scientific knowledge!
hardware variability
operating system variability
compiler variability
build variability
hypervisor variability
software application variability
v
e
r
s
i
o
n
v
a
r
i
a
b
i
l
i
t
y
input data variability
container variability
deep software variability
How often (x+y)+z == x+(y+z) ?
https://github.com/FAMILIAR-project/reproducibility-associativity/
Frictionless Reproducibility and (Deep) Software (Variability)
Problem (cont’d): Variability and Frictions
Solution: Variability and Exploration
Discussions
AGENDA
15,000+ options
thousands of compiler
flags and compile-time
options
dozens of
preferences
100+ command-line
parameters
1000+ feature toggles
38
hardware variability
deep software variability
Non-functional properties
execution
time
energy
consumption
accuracy
security
15,000+ options
thousands of compiler flags
and compile-time options
dozens of preferences
100+ command-line parameters
1000+ feature toggles
39
hardware variability
deep software variability
System under
Study
(reproducible
and
replicable)
Variability
Output
(scientific result;
most of the time
quantitative
information)
input data
performance
metric
Deep Software Variability and Frictionless Reproducibility
Can a coupled ESM simulation be restarted from a different machine without causing climate-changing modifications in the results? Using
two versions of EC-Earth: one “non-replicable” case (see below) and one replicable case.
We demonstrate that effects of parameter, hardware, and software variation are
detectable, complex, and interacting. However, we find most of the effects of
parameter variation are caused by a small subset of parameters. Notably, the
entrainment coefficient in clouds is associated with 30% of the variation seen in
climate sensitivity, although both low and high values can give high climate
sensitivity. We demonstrate that the effect of hardware and software is small relative
to the effect of parameter variation and, over the wide range of systems tested, may
be treated as equivalent to that caused by changes in initial conditions.
57,067 climate model runs. These runs sample parameter space for 10 parameters
with between two and four levels of each, covering 12,487 parameter combinations
(24% of possible combinations) and a range of initial conditions
Joelle Pineau “Building Reproducible, Reusable, and Robust Machine Learning Software” ICSE’19 keynote “[...] results
can be brittle to even minor perturbations in the domain or experimental procedure”
What is the magnitude of the effect
hyperparameter settings can have on baseline
performance?
How does the choice of network architecture for
the policy and value function approximation affect
performance?
How can the reward scale affect results?
Can random seeds drastically alter performance?
How do the environment properties affect
variability in reported RL algorithm performance?
Are commonly used baseline implementations
comparable?
“Completing a full replication study of our previously published findings on bluff-body
aerodynamics was harder than we thought. Despite the fact that we have good
reproducible-research practices, sharing our code and data openly.”
Data analysis workflows in many scientific domains have become increasingly complex and flexible (=
subject to variability). Here we assess the effect of this flexibility on the results of functional magnetic
resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9
ex-ante hypotheses. The flexibility of analytical approaches is exemplified by the fact that no two teams
chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of
hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of
the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology.
Notably, a meta-analytical approach that aggregated information across teams yielded a significant
consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an
overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the
dataset. Our findings show that analytical flexibility can have substantial effects on scientific conclusions,
and identify factors that may be related to variability in the analysis of functional magnetic resonance
imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and
demonstrate the need for performing and reporting multiple analyses of the same data. Potential
approaches that could be used to mitigate issues related to analytical variability are discussed.
Can Machine Learning Pipelines Be Better
Configured? Wang et al. FSE’2023
“A pipeline is subject to misconfiguration if
it exhibits significantly inconsistent performance upon changes in
the versions of its configured libraries or the combination of these
libraries. We refer to such performance inconsistency as a pipeline
configuration (PLC) issue.”
Deep software variability: Are layers/features
orthogonal or are there interactions?
Luc Lesoil, Mathieu Acher, Arnaud Blouin, Jean-Marc Jézéquel:
Deep Software Variability: Towards Handling Cross-Layer Configuration.
Configuration is hard: numerous options, informal knowledge
?????
Hardware
Operating
System
Software
Input Data
10.4
x264
--mbtree
...
x264
--no-mbtree
...
x264
--no-mbtree
...
x264
--mbtree
...
20.04
Dell latitude
7400
Raspberry Pi
4 model B
vertical
animation vertical
animation vertical
animation vertical
animation
Duration (s) 22 25 73 72
6 6 351 359
Size (MB) 28 34 33 21
33 21 28 34
A B
2
1 2
1
REAL WORLD Example (x264)
REAL WORLD Example (x264)
Hardware
Operating
System
Software
Input Data
10.4
x264
--mbtree
...
x264
--no-mbtree
...
x264
--no-mbtree
...
x264
--mbtree
...
20.04
Dell latitude
7400
Raspberry Pi
4 model B
vertical
animation vertical
animation vertical
animation vertical
animation
Duration (s) 22 25 73 72
6 6 351 359
Size (MB) 28 34 33 21
33 21 28 34
A B
2
1 2
1
Hardware
Operating
System
Software
Input Data
10.4
x264
--mbtree
...
x264
--no-mbtree
...
x264
--no-mbtree
...
x264
--mbtree
...
20.04
Dell latitude
7400
Raspberry Pi
4 model B
vertical
animation vertical
animation vertical
animation vertical
animation
Duration (s) 22 25 73 72
6 6 351 359
Size (MB) 28 34 33 21
33 21 28 34
A B
2
1 2
1
≈*16
≈*12
REAL WORLD Example (x264)
Age # Cores GPU
SOFTWARE
Variant
Compil. Version
Version Option Distrib.
Size Length Res.
Hardware
Operating
System
Software
Input Data
Bug
Perf. ↗
Perf. ↘
deep variability
L. Lesoil, M. Acher, A. Blouin and J.-M. Jézéquel,
“Deep Software Variability: Towards
Handling Cross-Layer Configuration” in VaMoS 2021
The “best”/default software
variant might be a bad one.
Influential software options
and their interactions vary.
Performance prediction
models and variability
knowledge may not
generalize
Let’s go deep with input data!
Intuition: video encoder behavior (and thus runtime configurations) hugely depends
on the input video (different compression ratio, encoding size/type etc.)
Is the best software configuration still the best?
Are influential options always influential?
Does the configuration knowledge generalize?
?
YouTube User General Content dataset: 1397 videos
Measurements of 201 soft. configurations (with same hardware,
compiler, version, etc.): encoding time, bitrate, etc.
configurations’ measurements over input_1
configurations’ measurements over input_42
Inputs = …
configurations’ measurements over input_1
configurations’ measurements over input_42
Inputs = …
Generalization/transfer:
what’s the relationship between
perf_pred_1 and
perf_pred_42?
● with perf_pred_i
a performance model
capable of predicting
performance of any
configuration on input_i
● linear relationship?
○ eg Pearson/Spearman
linear correlation
● influential
features/options:
same?
Let’s go deep with input data!
Intuition: video encoder behavior (and thus runtime configurations) hugely depends
on the input video (different compression ratio, encoding size/type etc.)
Is the best software configuration still the best?
Are influential options always influential?
Does the configuration knowledge generalize?
?
YouTube User General Content dataset: 1397 videos
Measurements of 201 soft. configurations (with same hardware,
compiler, version, etc.): encoding time, bitrate, etc.
Do x264 software performances
stay consistent across inputs?
●Encoding time: very strong correlations
○ low input sensitivity
●FPS: very strong correlations
○ low input sensitivity
●CPU usage : moderate correlation, a few negative correlations
○ medium input sensitivity
●Bitrate: medium-low correlation, many negative correlations
○ High input sensitivity
●Encoding size: medium-low correlation, many negative correlations
○ High input sensitivity
?
1397 videos x 201 software
configurations
Are there some configuration options
more sensitive to input videos? (bitrate)
Are there some configuration options
more sensitive to input videos? (bitrate)
Practical impacts for users, developers,
scientists, and self-adaptive systems
Threats to variability knowledge: predicting, tuning, or understanding configurable systems without being
aware of inputs can be inaccurate and… pointless
Opportunities: for some performance properties (P) and subject systems, some stability is observed and
performance remains consistent!
L. Lesoil, M. Acher, A. Blouin and J.-M. Jézéquel “The Interaction between
Inputs and Configurations fed to Software Systems: an Empirical Study”
https://arxiv.org/abs/2112.07279
Age # Cores GPU
SOFTWARE
Variant
Compil. Version
Version Option Distrib.
Size Length Res.
Hardware
Operating
System
Software
Input Data
Bug
Perf. ↗
Perf. ↘
deep variability Sometimes, variability is
consistent/stable and
knowledge transfer is
immediate.
But there are also
interactions among
variability layers and
variability knowledge
may not generalize
Age # Cores GPU
Compil. Version
Version Option Distrib.
Size Length Res.
Hardware
Operating
System
Software
Input Data
Does deep software variability affect previous scientific,
software-based studies? (a graphical template)
List all details…
and questions:
what iF we run the
experiments on
different:
OS?
version/commit?
PARAMETERS?
INPUT?
SOFTWARE
Variant
Frictionless Reproducibility and (Deep) Software (Variability)
Problem: Variability and Frictions
Solution: Variability and Exploration
Discussions
AGENDA
Deep variability problem (statement)
Fundamentally, we have a huge multi-dimensional variant space (eg 10^6000)
run (source_code) => result
run (hardware, operating_system, build_environment, input_data, source_code, …) =>
results
Fixing variability once and for all, in all dimensions/layers, is the obvious solution…
But it is either impossible (eg the ages of processor can have an impact on execution
time)...
Or not desirable
● non-robust result
● generalization/transferability of the results/findings
● kill innovation 64
Replicability is the holy grail!
Exploring various configurations:
● Make more robust scientific findings
● Define and assess the validity enveloppe
● Enable exploration and optimization
● Innovation and new hypothesis, insights, knowledge
⇒ We propose to embrace deep variability for the sake of
replicability
65
Embrace deep variability!
Explicit modeling of the variability points
and their relationships, such as:
1. Get insights into the variability “factors” and
their possible interactions
2. Capture and document configurations for
the sake of reproducibility
3. Explore diverse configurations to replicate,
and hence optimize, validate, increase the
robustness, or provide better resilience
Our Vision
ACM REP 2024
⇒ We aim to address the complexities associated
with reproducibility and replicability in modern
software systems and environments, facilitating a
more comprehensive and nuanced perspective on
these critical “factors”.
66
Solution #1: Variability model
● Abstractions are definitely needed to…
○ reason about logical constraints and interactions
○ integrate domain knowledge
○ synthesize domain knowledge
○ automate and guide the exploration of variants
○ scope and prioritize experiments
● Language and formalism: feature model (widely applicable!)
○ translation to logics
○ reasoning with SAT/CP/SMT solvers
ᵩ ⋃ ⋂  |
Solution #1: Variability model
● Abstractions are definitely needed…
● Yes, but how to obtain a feature model?
○ modelling
○ reverse engineering (out of command-line parameters, source code, logs, configurations, etc.)
○ learning (next slide!)
○ modeling+reverse engineering+learning (HDR)
Whole
Population of
Configurations
Performance
Prediction
Training
Sample
Performance
Measurements
Prediction
Model
J. Alves Pereira, H. Martin, M. Acher, J.-M. Jézéquel, G. Botterweck and A. Ventresque
“Learning Software Configuration Spaces: A Systematic Literature Review” JSS, 2021
Solution #2: sampling and learning
(regression, classification)
69
x264 --me dia
--ref 5
…
-o output_1.x264
15,000+ options
thousands of compiler flags
and compile-time options
dozens of preferences
100+ command-line parameters
1000+ feature toggles
71
hardware variability
deep software variability
System under
Study
(reproducible)
Variability
Output
(binary)
input data “The build process of a software product is reproducible if,
after designating a specific version of its source code and all
of its build dependencies, every build produces bit-for-bit
identical artifacts, no matter the environment in which the
build is performed.”
Lamb and Zacchiroli “Reproducible Builds: Increasing the
Integrity of Software Supply Chains” IEEE Software 2022
15,000+
compile-time options
72
deep software variability
System under
Study
Variability
Output
(binary)
“The build process of a software product is reproducible if, after designating a
specific version of its source code and all of its build dependencies, every
build produces bit-for-bit identical artifacts, no matter the environment in
which the build is performed.” Lamb and Zacchiroli “Reproducible Builds:
Increasing the Integrity of Software Supply Chains” IEEE Software 2022
make defconfig # configuration
make # build the kernel (binary) out of config
make # should be the same, right?
Options Matter: Documenting and Fixing Non-Reproducible Builds in Highly-Configurable
Systems Randrianaina, Khelladi, Zendra, Acher MSR’2024
also at FOSDEM 2024 https://fosdem.org/2024/schedule/event/fosdem-2024-2848-documenting-and-fixing-non-reproducible-builds-due-to-configuration-options/
Options Matter: Documenting and Fixing Non-Reproducible Builds in Highly-Configurable
Systems Randrianaina, Khelladi, Zendra, Acher MSR’2024
also at FOSDEM 2024 https://fosdem.org/2024/schedule/event/fosdem-2024-2848-documenting-and-fixing-non-reproducible-builds-due-to-configuration-options/
#1 take away message: look at every variability layer when you want a
bit-to-bit reproducibility; don’t ignore compile-time options!
“The build process of a software
product is reproducible if, after
designating a specific version and
a specific variant of its source
code and all of its build
dependencies, every build produces
bit-for-bit identical artifacts, no
matter the environment in which the
build is performed.” Lamb and
Zacchiroli “Reproducible Builds:
Increasing the Integrity of Software
Supply Chains” IEEE Software 2022
Options Matter: Documenting and Fixing Non-Reproducible Builds in Highly-Configurable
Systems Randrianaina, Khelladi, Zendra, Acher MSR’2024
also at FOSDEM 2024 https://fosdem.org/2024/schedule/event/fosdem-2024-2848-documenting-and-fixing-non-reproducible-builds-due-to-configuration-options/
#2 take away message: interactions across variability layers exist (eg
compile-time option with build path) and may hamper reproducibility
“The build process of a software
product is reproducible if, after
designating a specific version and
a specific variant of its source
code and all of its build
dependencies, every build produces
bit-for-bit identical artifacts, no
matter the environment in which the
build is performed.” Lamb and
Zacchiroli “Reproducible Builds:
Increasing the Integrity of Software
Supply Chains” IEEE Software 2022
● Linux as a subject software system (not as an OS interacting with other layers)
● Targeted non-functional, quantitative property: binary size
○ interest for maintainers/users of the Linux kernel (embedded systems, cloud, etc.)
○ challenging to predict (cross-cutting options, interplay with compilers/build
systems, etc/.)
● Dataset: version 4.13.3 (september 2017), x86_64 arch,
measurements of 95K+ random configurations
○ paranoiac about deep variability since 2017, Docker to control the build
environment and scale
○ diversity of binary sizes: from 7Mb to 1.9Gb
○ 6% MAPE errors: quite good, though costly…
2
76
H. Martin, M. Acher, J. A. Pereira, L. Lesoil, J. Jézéquel and D. E. Khelladi, “Transfer learning across variants
and versions: The case of linux kernel size” Transactions on Software Engineering (TSE), 2021
4.13 version (sep 2017): 6%. What about evolution? Can we reuse the 4.13 Linux prediction
model? No, accuracy quickly decreases: 4.15 (5 months after): 20%; 5.7 (3 years after): 35%
3
77
Solution #3 Transfer learning (reuse of knowledge)
● Mission Impossible: Saving variability knowledge and
prediction model 4.13 (15K hours of computation)
● Heterogeneous transfer learning: the feature space is
different
● TEAMS: transfer evolution-aware model shifting
5
78
H. Martin, M. Acher, J. A. Pereira, L. Lesoil, J. Jézéquel and D. E. Khelladi, “Transfer learning across variants
and versions: The case of linux kernel size” Transactions on Software Engineering (TSE), 2021
3
78
Luc Lesoil, Helge Spieker, Arnaud Gotlieb, Mathieu Acher, Paul Temple, Arnaud Blouin, Jean-Marc Jézéquel:
Learning input-aware performance models of configurable systems: An empirical evaluation. J. Syst. Softw. 208: 111883 (2024)
Solution #3 Transfer learning (con’t)
Is there an interplay between compile-time and
runtime options?
L. Lesoil, M. Acher, X. Tërnava, A. Blouin and
J.-M. Jézéquel “The Interplay of Compile-
time and Run-time Options for Performance
Prediction” in SPLC ’21
Solution #4: Leverage stability
across variability layers!
First good news: Worth tuning software at compile-time!
Second good news: For all the execution time distributions of x264 and all the input videos, the worst
correlation is greater than 0.97. If the compile-time options change the scale of the distribution, they do not
change the rankings of run-time configurations (i.e., they do not truly interact with the run-time options).
It has three practical implications:
1. Reuse of configuration knowledge: transfer learning of prediction models boils down to apply a linear
transformation among distributions. Users can also trust the documentation of run-time options,
consistent whatever the compile-time configuration is.
2. Tuning at lower cost: finding the best compile-time configuration among all the possible ones allows
one to immediately find the best configuration at run time. We can remove away one dimension!
3. Measuring at lower cost: do not use a default compile-time configuration, use the less costly once since
it will generalize!
Did we recommend to use two binaries? YES, one for measuring, another for reaching optimal
performances!
L. Lesoil, M. Acher, X. Tërnava, A. Blouin and
J.-M. Jézéquel “The Interplay of Compile-
time and Run-time Options for Performance
Prediction” in SPLC ’21
Key results (for x264)
First good news: Worth tuning software at compile-time!
Second good news: For all the execution time distributions of x264 and all the input videos, the worst
correlation is greater than 0.97. If the compile-time options change the scale of the distribution, they do not
change the rankings of run-time configurations (i.e., they do not truly interact with the run-time options).
It has three practical implications:
1. Reuse of configuration knowledge: transfer learning of prediction models boils down to apply a linear
transformation among distributions. Users can also trust the documentation of run-time options,
consistent whatever the compile-time configuration is.
2. Tuning at lower cost: finding the best compile-time configuration among all the possible ones allows
one to immediately find the best configuration at run time. We can remove away one dimension!
3. Measuring at lower cost: do not use a default compile-time configuration, use the less costly once since
it will generalize!
Did we recommend to use two binaries? YES, one for measuring, another for reaching optimal
performances!
interplay between
compile-time and runtime
options and even input!
L. Lesoil, M. Acher, X. Tërnava, A. Blouin and
J.-M. Jézéquel “The Interplay of Compile-
time and Run-time Options for Performance
Prediction” in SPLC ’21
What
is your move?
What is your prompt?
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
Solution #5: Strategic exploration with
modelling and learning
Solution #6 Identification of root causes of variability
(testing and verification)
https://github.com/acherm/progvary-withgpt/blob/main/varyfloatinC/ChatGPT-C_Variations_with_%23ifdef.md
https://github.com/acherm/progvary-withgpt/blob/main/varyfloatinC/approx.c
Solution #7: LLMs to support
exploration of variants space
https://github.com/acherm/progvary-withgpt/blob/main/varyfloatinC/ChatGPT-C_Variations_with_%23ifdef.md
https://github.com/acherm/progvary-withgpt/blob/main/varyfloatinC/approx_eval.py
Retrieve the result of S. Boldo et al.
M. Acher, J. Galindo, J.M Jézéquel, “On Programming Variability with Large
Language Model-based Assistant”, SPLC’2023
▸ Some solutions
▸ abstractions/models
▸ learning and sampling
▸ reuse of configuration knowledge
▸ leveraging stability
▸ systematic exploration
▸ identification of root causes
▸ LLMs to support exploration of variants’ space
▸ incremental build of configuration space (Randrianaina et al. ICSE’22)
▸ debloating variability (Ternava et al. SAC’23)
▸ feature subset selection (Martin et al. SPLC’23)
▸ Essentially, we want to reduce the dimensionality of the problem
as well as the computational and human cost to foster
verification of results and innovation
▸ Frictionless reproducibility: code+data+metrics
▸ Deep variability is a problem (frictions!)
▸ evidence in many scientific domains
▸ Deep variability is a solution (exploration!)
▸ fixing variability once and for all is not
▸ Replicability is the holy grail!
▸ explore variants for robustness, validation, optimization and knowledge finding
93
Backup slides (disclaimer: don’t try to understand
everything ;))
What can we do? (robustness)
Robustness (trustworthiness) of scientific results to sources of variability
I have shown many examples of sources of variations and non-robust results…
Robustness should be rigorously defined (hint: it’s not the definition as given in computer
science)
How to verify the effect of sources of variations on the robustness of given conclusions?
● actionable metrics?
● methodology? (eg when to stop?)
● variability can actually be leveraged to augment confidence
96
deep
software
variability
different
methods
different
assumptions
different analyses
different data
97
Deep software variability is…
a threat for reproducible research
“Authors provide all the necessary data and the computer codes to run the
analysis again, re-creating the results.”
an opportunity for replication
“A study that arrives at the same scientific findings as another study,
collecting new data (possibly with different methods) and completing new
analyses.”
“A study that refutes some scientific findings of another study, through the
collection of new data (possibly with different methods) and completion of
new analyses.”
robustifying and augmenting
scientific knowledge
Reproducible Science as a Testing Problem
#1 Test Generation Problem (input)
inputs: computing environment, parameters of an algorithm, versions of
a library or tool, choice of a programming language
#2 Oracle Problem (output)
we usually ignore the outcome! (open problems; open questions; new
knowledge)
System under
Study
(replicable)
Input Output
(scientific
result)
Reproduction vs replication http://rescience.github.io/faq/
“Reproduction of a computational study means running the same computation on the same input data, and then checking if the
results are the same, or at least “close enough” when it comes to numerical approximations. Reproduction can be considered as
software testing at the level of a complete study.”
We don’t “test” in one run, in one computing environment, with one kind of input data, etc.
“Replication of a scientific study (computational or other) means repeating a published protocol, respecting its spirit and intentions
but varying the technical details. For computational work, this would mean using different software, running a simulation from
different initial conditions, etc. The idea is to change something that everyone believes shouldn’t matter, and see if the scientific
conclusions are affected or not.”
It is the most interesting direction, basically for synthesizing new scientific knowledge!
In both cases, there is the need to
harness the combinatorial explosion
of deep software variability
99
Reproducible Science and Software Engineering
@acherm
aka Deep Software Variability for Replicability in Computational Science
Deep Questions?
Deep Software Variability and Frictionless Reproducibility
Transferring Performance Prediction Models Across Different Hardware Platforms
Valov et al. ICPE 2017
“Linear model provides a good approximation of
transformation between performance distributions
of a system deployed in different hardware
environments”
what about
variability of
input data?
compile-time options?
version?
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
Jamshidi et al. ASE 2017
mixing deep variability: hard to assess the specific
influence of each layer
very few hardware, version, and input data… but lots
of runtime configurations (variants)
Let’s go deep with input data!
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
Jamshidi et al. ASE 2017
Threats to variability knowledge for performance property bitrate
● optimal configuration is specific to an input; a good configuration can be a bad one
● some options’ values have an opposite effect depending on the input
● effectiveness of sampling strategies (random, 2-wise, etc.) is input specific (somehow
confirming Pereira et al. ICPE 2020)
● predicting, tuning, or understanding configurable systems
without being aware of inputs can be inaccurate and… pointless
Practical impacts for users, developers,
scientists, and self-adaptive systems
Computational science
depends on software and its engineering
106
from a set of scripts to automate the deployment to… a
comprehensive system containing several features that
help researchers exploring various hypotheses
multi-million line of code base
multi-dependencies
multi-systems
multi-layer
multi-version
multi-person
multi-variant
x264 video encoder (compilation/build)
compile-time
options
What can we do? (#1 studies)
Empirical studies about deep software variability
● more subject systems
● more variability layers, including interactions
● more quantitative (e.g., performance) properties
with challenges for gathering measurements data:
● how to scale experiments? Variant space is huge!
● how to fix/isolate some layers? (eg hardware)
● how to measure in a reliable way?
Expected outcomes:
● significance of deep software variability in the wild
● identification of stable layers: sources of variability that should not affect the conclusion and that can
be eliminated/forgotten
● identification/quantification of sensitive layers and interactions that matter
● variability knowledge
What can we do? (#2 cost)
Reducing the cost of exploring the variability spaces
Many directions here (references at the end of the slides):
● learning
○ many algorithms/techniques with tradeoffs interpretability/accuracy
○ transfer learning (instead of learning from scratch)
● sampling strategies
○ uniform random sampling? t-wise? distance-based? …
○ sample of hardware? input data?
● incremental build of configurations
● white-box approaches
● …
Key results (for x264)
Worth tuning software at compile-time: gain about 10 % of execution time with the
tuning of compile-time options (compared to the default compile-time configuration).
The improvements can be larger for some inputs and some runtime configurations.
Stability of variability knowledge: For all the execution time distributions of x264
and all the input videos, the worst correlation is greater than 0.97. If the compile-time
options change the scale of the distribution, they do not change the rankings of
run-time configurations (i.e., they do not truly interact with the run-time options).
Reuse of configuration knowledge:
● Linear transformation among distributions
● Users can also trust the documentation of run-time options,
consistent whatever the compile-time configuration is.
L. Lesoil, M. Acher, X. Tërnava, A. Blouin and
J.-M. Jézéquel “The Interplay of Compile-
time and Run-time Options for Performance
Prediction” in SPLC ’21
Embrace deep variability!
Explicit modeling of the variability points
and their relationships, such as:
1. Get insights into the variability “factors” and
their possible interactions
2. Capture and document configurations for
the sake of reproducibility
3. Explore diverse configurations to replicate,
and hence optimize, validate, increase the
robustness, or provide better resilience
Our Vision
ACM REP 2024
⇒ We aim to address the complexities associated
with reproducibility and replicability in modern
software systems and environments, facilitating a
more comprehensive and nuanced perspective on these
critical “factors”.
111
https://hal.science/hal-04582287
Deep Software Variability and Frictionless Reproducibility
exec (software) = exec_repro (software)
or
exec(software) ~= exec(software_repro)
(difference: exec_repro is another execution environment… and so somehow differs or not with exec; or we consider that software differs…)
(exec: execution? what’s the outcome then? in fact execution can be replaced by “build”... which is another kind of execution)
exec (software) ?= exec_repro (software)
software ~= software_repro
exec (software, hardware)
exec (software, hardware, compiler, input_data, operating_system, bios, container, hypervisor, dependencies_versions)
exec (v1, v2, …, vN) ~= exec_repro (v1’, v2’, …, vN’)
for i in [1, n], v_{i} ~= v_{i} (or not!)
~= is specific to a domain, to a usage, etc.
~= can be over the N layers or over N’ layers (N’ < N)
~= can be specific to some pairs elements (eg we know that with this hardware, the exec time is multiplied by 2)
for instance, we know the ~= between a software configuration with any hardware (but if the compiler changes, then the ~= should be “tuned” accordingly)
also ~= can be defined between a configuration set and an hardware set (eg performance distribution)
Exact same results? No
Frictionless reproducibility (annotated bibliography; grey literature)
https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1 The Mechanics of Frictionless
Reproducibility, B Recht
interesting historical perspective on research in neural networks (NeurIPs 87 titles are shockingly
still relevant); really love some parts about random experiments, science as a “massively parallel
genetic algorithm” or the discussions on the difficulty of using ML/DL software (completely
aligned with my terrible experience of Weka GUI in ~2006)
https://www.argmin.net/p/the-department-of-frictionless-reproducibilty
https://statmodeling.stat.columbia.edu/2023/10/13/frictionless-reproducibility-methods-as-proto-al
gorithms-division-of-labor-as-a-characteristic-of-statistical-methods-statistics-as-the-science-of-d
efaults-statisticians-well-prepared-to-think-abo/
Progress and frictionless reproducibility
Inspired by Thomas Kuhn (1962), we can think of the scientific and engineering process as a massively parallel genetic algorithm. If
we want to improve upon the systems we currently have, we might try a small perturbation to see if we get an improvement. If we
can find a small change that improves some desired outcome, we could change our systems to reflect this improvement. If we
continually search for these improvements and work hard to demonstrate their value, we may head in a better direction over time.
For scientific endeavors, we could perhaps gauge ‘better’ or ‘worse’ by performing random experiments—not randomized
experiments per se, but random experiments in the sense of trying potentially surprising improvements. If our small tweak results in
better outcomes, we can attempt to convince a journal editor or conference program committee to publish it. And this
communication gives everyone else a new starting point for their own random experimentation.
A single investigator can only make so much progress by random searching alone, but random search is pleasantly parallelizable.
Competing scientists can independently try their own random ideas and publish their results. Sometimes an individual result is so
promising that the herd of experimenters all flock around the good idea, hoping to strike gold on a nearby improvement and bring
home bragging rights. To some, this looks like an inefficient mess. To others, it looks like science.
https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1 The Mechanics of Frictionless
Reproducibility, B Recht
Data sharing and frictions
“Data set benchmarking and competitive testing took over machine learning in the late 1980s. Email and
file transfer were becoming more accessible. The current specification of FTP was finalized in 1985. In
1987, a PhD student at UC Irvine named David Aha put up an FTP server to host data sets for empirically
testing machine learning methods. Aha was motivated by service to the community, but he also wanted to
show his nearest-neighbor methods would outperform Ross Quinlan’s decision tree induction algorithms.
He formatted his data sets using the ‘attribute-value’ representation that a rival researcher, Ross Quinlan
(1986), had used. And, so, the UC Irvine Machine Learning Repository was born.”
“Improvements in computing greased the wheels, giving us faster computers, faster data transfer, and
smaller storage footprints. But computing technology alone was not sufficient to drive progress. Friendly
competition with Quinlan inspired Aha to build the UCI repository. And more explicit competitions were
also crucial components of the success.”
The Mechanics of Frictionless Reproducibility, B Recht, 2024
https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1
https://twitter.com/
StasBekman/statu
s/1749480373283
905611
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
https://github.com/FAMILIAR-project/reproducibility-associativity/

More Related Content

PDF
Reproducible Science and Deep Software Variability
PDF
Lopez
PDF
Mastering Software Variability for Innovation and Science
PPTX
Software Sustainability: Better Software Better Science
PDF
Replication and Benchmarking in Software Analytics
PPTX
Modeling and Analyzing Openness Trade-Offs in Software Platforms: A Goal-Orie...
PPTX
Big Data: the weakest link
PDF
2013 Melbourne Software Freedom Day talk - FOSS in Public Decision Making
Reproducible Science and Deep Software Variability
Lopez
Mastering Software Variability for Innovation and Science
Software Sustainability: Better Software Better Science
Replication and Benchmarking in Software Analytics
Modeling and Analyzing Openness Trade-Offs in Software Platforms: A Goal-Orie...
Big Data: the weakest link
2013 Melbourne Software Freedom Day talk - FOSS in Public Decision Making

Similar to Deep Software Variability and Frictionless Reproducibility (20)

PDF
Analyzing Big Data's Weakest Link (hint: it might be you)
PPTX
Docker in Open Science Data Analysis Challenges by Bruce Hoff
PDF
Syst biol 2012-burguiere-sysbio sys069
PDF
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
PDF
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
PDF
Open & reproducible research - What can we do in practice?
PDF
Understanding Continuous Design in F/OSS Projects
PPT
Results may vary: Collaborations Workshop, Oxford 2014
PDF
Embracing Deep Variability For Reproducibility and Replicability
PDF
[3.6] Beyond Data Sharing - Pieter van Gorp [3TU.Datacentrum Symposium 2014, ...
PDF
icssp-web
PPTX
Mtsr2015 goble-keynote
PPTX
Crediting informatics and data folks in life science teams
PDF
Open Engineering Framework
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Tds — big science dec 2021
PDF
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
PPTX
FAIR Computational Workflows
PPTX
research unveiling connections and recommendations.pptx
PDF
ScilabTEC 2015 - Irill
Analyzing Big Data's Weakest Link (hint: it might be you)
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Syst biol 2012-burguiere-sysbio sys069
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
Open & reproducible research - What can we do in practice?
Understanding Continuous Design in F/OSS Projects
Results may vary: Collaborations Workshop, Oxford 2014
Embracing Deep Variability For Reproducibility and Replicability
[3.6] Beyond Data Sharing - Pieter van Gorp [3TU.Datacentrum Symposium 2014, ...
icssp-web
Mtsr2015 goble-keynote
Crediting informatics and data folks in life science teams
Open Engineering Framework
Spark Summit Europe: Share and analyse genomic data at scale
Tds — big science dec 2021
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
FAIR Computational Workflows
research unveiling connections and recommendations.pptx
ScilabTEC 2015 - Irill
Ad

More from University of Rennes, INSA Rennes, Inria/IRISA, CNRS (20)

PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
PDF
A Demonstration of End-User Code Customization Using Generative AI
PDF
24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)
PDF
On Programming Variability with Large Language Model-based Assistant
PDF
Generative AI for Reengineering Variants into Software Product Lines: An Expe...
PDF
Tackling Deep Software Variability Together
PDF
On anti-cheating in chess, science, reproducibility, and variability
PDF
Feature Subset Selection for Learning Huge Configuration Spaces: The case of ...
PDF
Machine Learning and Deep Software Variability
PDF
Transfer Learning Across Variants and Versions: The Case of Linux Kernel Size
PDF
Software Variability and Artificial Intelligence
PDF
Teaching Software Product Lines: A Snapshot of Current Practices and Challenges
PDF
Exploiting the Enumeration of All Feature Model Configurations: A New Perspec...
PDF
Assessing Product Line Derivation Operators Applied to Java Source Code: An E...
PDF
Synthesis of Attributed Feature Models From Product Descriptions
PDF
From Basic Variability Models to OpenCompare.org
PDF
Pandoc: a universal document converter
PDF
Metamorphic Domain-Specific Languages
PDF
3D Printing, Customization, and Product Lines
PDF
WebFML: Synthesizing Feature Models Everywhere (@ SPLC 2014)
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
A Demonstration of End-User Code Customization Using Generative AI
24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)
On Programming Variability with Large Language Model-based Assistant
Generative AI for Reengineering Variants into Software Product Lines: An Expe...
Tackling Deep Software Variability Together
On anti-cheating in chess, science, reproducibility, and variability
Feature Subset Selection for Learning Huge Configuration Spaces: The case of ...
Machine Learning and Deep Software Variability
Transfer Learning Across Variants and Versions: The Case of Linux Kernel Size
Software Variability and Artificial Intelligence
Teaching Software Product Lines: A Snapshot of Current Practices and Challenges
Exploiting the Enumeration of All Feature Model Configurations: A New Perspec...
Assessing Product Line Derivation Operators Applied to Java Source Code: An E...
Synthesis of Attributed Feature Models From Product Descriptions
From Basic Variability Models to OpenCompare.org
Pandoc: a universal document converter
Metamorphic Domain-Specific Languages
3D Printing, Customization, and Product Lines
WebFML: Synthesizing Feature Models Everywhere (@ SPLC 2014)
Ad

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
famous lake in india and its disturibution and importance
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
diccionario toefl examen de ingles para principiante
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
famous lake in india and its disturibution and importance
Derivatives of integument scales, beaks, horns,.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Cell Membrane: Structure, Composition & Functions
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Phytochemical Investigation of Miliusa longipes.pdf
Classification Systems_TAXONOMY_SCIENCE8.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
diccionario toefl examen de ingles para principiante
. Radiology Case Scenariosssssssssssssss
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Comparative Structure of Integument in Vertebrates.pptx
The KM-GBF monitoring framework – status & key messages.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.

Deep Software Variability and Frictionless Reproducibility

  • 1. Deep Software Variability and Frictionless Reproducibility Mathieu Acher @acherm
  • 2. Deep Software Variability and Frictionless Reproducibility Abstract: The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions. Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability. Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields. I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating). I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems. Exposé invité, 5 juin 2024 @ GDRGPL
  • 3. Special thanks to* Aaron Randrianaina, Jean-Marc Jézéquel, Benoit Combemale, Luc Lesoil, Arnaud Gotlieb, Helge Spieker, Quentin Mazouni, Paul Temple, Gauthier Le Bartz Lyan, Xhevahire Tërnava, Olivier Barais, and the whole DiverSE and RIPOST teams *random order, incomplete
  • 4. Frictionless Reproducibility and (Deep) Software (Variability) Problem: Variability and Frictions Solution: Variability and Exploration Discussions AGENDA
  • 7. Computational science depends on software and its engineering 7 design of mathematical model mining and analysis of data executions of large simulations problem solving executable paper from a set of scripts to automate the deployment to… a comprehensive system containing several features that help researchers exploring various hypotheses
  • 8. Computational science depends on software and its engineering 8 Dealing with software collapse: software stops working eventually Konrad Hinsen 2019 Configuration failures represent one of the most common types of software failures Sayagh et al. TSE 2018 multi-million line of code base multi-dependencies multi-systems multi-layer multi-version multi-person multi-variant
  • 9. “Insanity is doing the same thing over and over again and expecting different results” 9 http://throwgrammarfromthetrain.blogspot.com/2010/10/definition-of-insanity.html
  • 10. Reproducibility 10 “Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” (Claerbout/Donoho/Peng definition) “The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.” (~executable paper)
  • 11. Reproducibility and Replicability 11 Reproducible: Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results. Replication: A study that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses. “Terminologies for Reproducible Research”, Lorena A. Barba, 2018
  • 12. Reproducibility and Replicability 12 Reproducible: Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results. Replication: A study that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses. “Terminologies for Reproducible Research”, Lorena A. Barba, 2018
  • 13. Reproducibility and Replicability 13 Methods Reproducibility: A method is reproducible if reusing the original code leads to the same results. Results Reproducibility: A result is reproducible if a reimplementation of the method generates statistically similar values. Inferential Reproducibility: A finding or a conclusion is reproducible if one can draw it from a different experimental setup. “Unreproducible Research is Reproducible”, Bouthillier et al., ICML 2019
  • 14. Reproducible science 14 “Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” Socio-technical issues: open science, open source software, multi-disciplinary collaboration, incentives/rewards, initiatives, etc. with many challenges related to data acquisition, knowledge organization/sharing, etc.
  • 15. Reproducible science 15 “Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” Socio-technical issues: open science, open source software, multi-disciplinary collaboration, incentives/rewards, initiatives, etc. with many challenges related to data acquisition, knowledge organization/sharing, etc. https://github.com/emsejournal/openscience https://rescience.github.io/ https://reproducible-research.inria.fr/
  • 16. Reproducible science 16 “Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” Socio-technical issues: open science, open source software, multi-disciplinary collaboration, incentives/rewards, initiatives, etc. with many challenges related to data acquisition, knowledge organization/sharing, etc.
  • 17. Lamb and Zacchiroli “Reproducible Builds: Increasing the Integrity of Software Supply Chains” IEEE Software 2022 https://arxiv.org/pdf/2104.06020 (best paper award IEEE Software for year 2022) “The build process of a software product is reproducible if, after designating a specific version of its source code and all of its build dependencies, every build produces bit-for-bit identical artifacts, no matter the environment in which the build is performed.”
  • 18. Frictionless reproducibility 18 https://arxiv.org/abs/2310.00865 https://hdsr.mitpress.mit.edu/pub/g9mau4m0/release/2 “Computation-driven research really has changed in the last 10 years, driven by three principles of data science, which, after longstanding partial efforts, are finally available in mature form for daily practice, as frictionless open services offering data sharing, code sharing, and competitive challenges.” [FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges] “We are entering an era of frictionless research exchange, in which research algorithmically builds on the digital artifacts created by earlier research, and any good ideas that are found get spread rapidly, everywhere. The collective behavior induced by frictionless research exchange is the emergent superpower driving many events that are so striking today.”
  • 19. Frictionless reproducibility 19 [FR-1: Data] “Datafication of everything, with a culture of research data sharing.” [FR-2: Re-execution (code)]: “Research code sharing including the ability to exactly re-execute the same complete workflow by different researchers.” [FR-3: Challenges] “a shared public dataset, a prescribed and quantified task performance metric, a set of enrolled competitors seeking to outperform each other on the task, and a public leaderboard.” performance metric
  • 20. Frictionless reproducibility 20 [FR-1: Data] “Datafication of everything, with a culture of research data sharing.” [FR-2: Re-execution (code)]: “Research code sharing including the ability to exactly re-execute the same complete workflow by different researchers.” [FR-3: Challenges] “a shared public dataset, a prescribed and quantified task performance metric, a set of enrolled competitors seeking to outperform each other on the task, and a public leaderboard.” frictionless reproducibility = [FR-1] + [FR-2] + [FR-3] performance metric
  • 21. Frictionless reproducibility 21 frictionless reproducibility = [FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges] [FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important and original piece On the one hand, [FR-3] is a way to objectively assess a contribution, compare solutions, and measure progress (if any). [FR-3] sounds legit to provide a “task definition that formalized a specific research problem and made it an object of study”. [FR-3] is “the competitive element that attracted our attention in the first place”. Think about the absence of [FR-3]. The “challenge paradigm” is a big ongoing shift (see Isabelle Guyon and Evelyne Viegas - "AI Competitions and the Science Behind Contests") ● Many success stories (mainly in empirical machine learning): speech processing, biometric recognition, facial recognition, protein structure prediction problem (CASP), etc. ● More and more leaderboard (eg https://evalplus.github.io/leaderboard.html https://robustbench.github.io/) or competition (eg SAT competition) ● Many platforms, services, and events supporting the shift (eg Kaggle)
  • 22. Frictionless reproducibility 22 frictionless reproducibility = [FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges] [FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important and original piece On the one hand, [FR-3] is a way to objectively assess a contribution, compare solutions, and measure progress (if any). [FR-3] sounds legit to provide a “task definition that formalized a specific research problem and made it an object of study”. [FR-3] is “the competitive element that attracted our attention in the first place”. The performance measurement crystallized a specific project’s contribution, boiling down an entire research contribution essentially to a single number, which can be reproduced. Think about the absence of [FR-3] The “challenge paradigm” is a big ongoing shift (see Isabelle Guyon and Evelyne Viegas - "AI Competitions and the Science Behind Contests") ● Many success stories (mainly in empirical machine learning): speech processing, biometric recognition, facial recognition, protein structure prediction problem (CASP), etc. ● More and more leaderboard (eg https://evalplus.github.io/leaderboard.html https://robustbench.github.io/) or competition (eg SAT competition) ● Many platforms, services, and events supporting the shift (eg Kaggle)
  • 23. Frictionless reproducibility 23 frictionless reproducibility = [FR-1: Data] + [FR-2: Re-execution] + [FR-3: Challenges] [FR-1] and [FR-2] are quite “standard” but do not come without frictions – more soon! [FR-3] is an important but discussable piece On the other hand, we know that the power of a simple scoring function is dangerous (e.g., Goodhart's law) “What if the metric is wrong? What if the subtleties of a complex problem are not amenable to representation by a single scalar? What happens when metrics for locally optimal solutions are apparent, but ones for globally optimal solutions are not? What happens when the community is not (yet) mature enough to rally around a consensus-scoring function? I think it is important to recognize that finding an appropriate scoring function, let alone an objectively best one, is an ongoing task and might evolve as FR-1 and FR-2 provide a deeper understanding of the problem space.” Overcoming Potential Obstacles as We Strive for Frictionless Reproducibility by Adam D. Schuyler (2024) performance metric
  • 24. Are we frictionless? Reading a paper in 2024 is sometimes like in 1970: ● Where is the source code? (eg implementation of the solution, scripts to compute metrics) ● Where is the data? (eg to test the solution) ● Contacting authors? ○ no response? ○ code not consistent with the PDF ○ … ● It does not work on my machine; results are completely different… There are lots of socio-technical frictions… even when you have the code and data! => When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress
  • 26. Reproducible science… with frictions 26 “Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” Despite the availability of data and code, several studies report that the same data analyzed with different software can lead to different results. from a set of scripts to automate the deployment to… a comprehensive system containing several features that help researchers exploring various hypotheses
  • 27. Can a coupled ESM simulation be restarted from a different machine without causing climate-changing modifications in the results? Using two versions of EC-Earth: one “non-replicable” case (see below) and one replicable case.
  • 28. Can a coupled ESM simulation be restarted from a different machine without causing climate-changing modifications in the results? Using two versions of EC-Earth: one “non-replicable” case (see below) and one replicable case.
  • 29. Can a coupled ESM simulation be restarted from a different machine without causing climate-changing modifications in the results? Using two versions of EC-Earth: one “non-replicable” case (see below) and one replicable case.
  • 30. Can a coupled ESM simulation be restarted from a different machine without causing climate-changing modifications in the results? A study involving eight institutions and seven different supercomputers in Europe is currently ongoing with EC-Earth. This ongoing study aims to do the following: ● evaluate different computational environments that are used in collaboration to produce CMIP6 experiments (can we safely create large ensembles composed of subsets that emanate from different partners of the consortium?); ● detect if the same CMIP6 configuration is replicable among platforms of the EC-Earth consortium (that is, can we safely exchange restarts with EC-Earth partners in order to initialize simulations and to avoid long spin-ups?); and ● systematically evaluate the impact of different compilation flag options (that is, what is the highest acceptable level of optimization that will not break the replicability of EC-Earth for a given environment?).
  • 31. Should software version numbers determine science? Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. [...] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6. The observed differences are similar in magnitude as effect sizes reported in accuracy evaluations and neurodegenerative studies. see also Krefting, D., Scheel, M., Freing, A., Specovius, S., Paul, F., and Brandt, A. (2011). “Reliability of quantitative neuroimage analysis using freesurfer in distributed environments,” in MICCAI Workshop on High-Performance and Distributed Computing for Medical Imaging.
  • 32. “Neuroimaging pipelines are known to generate different results depending on the computing platform where they are compiled and executed.” Reproducibility of neuroimaging analyses across operating systems, Glatard et al., Front. Neuroinform., 24 April 2015 The implementation of mathematical functions manipulating single-precision floating-point numbers in libmath has evolved during the last years, leading to numerical differences in computational results. While these differences have little or no impact on simple analysis pipelines such as brain extraction and cortical tissue classification, their accumulation creates important differences in longer pipelines such as the subcortical tissue classification, RSfMRI analysis, and cortical thickness extraction.
  • 33. “Neuroimaging pipelines are known to generate different results depending on the computing platform where they are compiled and executed.” Statically building programs improves reproducibility across OSes, but small differences may still remain when dynamic libraries are loaded by static executables[...]. When static builds are not an option, software heterogeneity might be addressed using virtual machines. However, such solutions are only workarounds: differences may still arise between static executables built on different OSes, or between dynamic executables executed in different VMs. Reproducibility of neuroimaging analyses across operating systems, Glatard et al., Front. Neuroinform., 24 April 2015
  • 34. Reproducible science as a (deep) software variability problem 34 “Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” Despite the availability of data and code, several studies report that the same data analyzed with different software can lead to different results. from a set of scripts to automate the deployment to… a comprehensive system containing several features that help researchers exploring various hypotheses
  • 35. 35 Despite the availability of data and code, several studies report that the same data analyzed with different software can lead to different results Many layers (operating system, third-party libraries, versions, workloads, compile-time options and flags, etc.) themselves subject to variability can alter the results. Reproducible science and deep software variability: a threat and opportunity for scientific knowledge! hardware variability operating system variability compiler variability build variability hypervisor variability software application variability v e r s i o n v a r i a b i l i t y input data variability container variability deep software variability
  • 36. How often (x+y)+z == x+(y+z) ? https://github.com/FAMILIAR-project/reproducibility-associativity/
  • 37. Frictionless Reproducibility and (Deep) Software (Variability) Problem (cont’d): Variability and Frictions Solution: Variability and Exploration Discussions AGENDA
  • 38. 15,000+ options thousands of compiler flags and compile-time options dozens of preferences 100+ command-line parameters 1000+ feature toggles 38 hardware variability deep software variability Non-functional properties execution time energy consumption accuracy security
  • 39. 15,000+ options thousands of compiler flags and compile-time options dozens of preferences 100+ command-line parameters 1000+ feature toggles 39 hardware variability deep software variability System under Study (reproducible and replicable) Variability Output (scientific result; most of the time quantitative information) input data performance metric
  • 41. Can a coupled ESM simulation be restarted from a different machine without causing climate-changing modifications in the results? Using two versions of EC-Earth: one “non-replicable” case (see below) and one replicable case.
  • 42. We demonstrate that effects of parameter, hardware, and software variation are detectable, complex, and interacting. However, we find most of the effects of parameter variation are caused by a small subset of parameters. Notably, the entrainment coefficient in clouds is associated with 30% of the variation seen in climate sensitivity, although both low and high values can give high climate sensitivity. We demonstrate that the effect of hardware and software is small relative to the effect of parameter variation and, over the wide range of systems tested, may be treated as equivalent to that caused by changes in initial conditions. 57,067 climate model runs. These runs sample parameter space for 10 parameters with between two and four levels of each, covering 12,487 parameter combinations (24% of possible combinations) and a range of initial conditions
  • 43. Joelle Pineau “Building Reproducible, Reusable, and Robust Machine Learning Software” ICSE’19 keynote “[...] results can be brittle to even minor perturbations in the domain or experimental procedure” What is the magnitude of the effect hyperparameter settings can have on baseline performance? How does the choice of network architecture for the policy and value function approximation affect performance? How can the reward scale affect results? Can random seeds drastically alter performance? How do the environment properties affect variability in reported RL algorithm performance? Are commonly used baseline implementations comparable?
  • 44. “Completing a full replication study of our previously published findings on bluff-body aerodynamics was harder than we thought. Despite the fact that we have good reproducible-research practices, sharing our code and data openly.”
  • 45. Data analysis workflows in many scientific domains have become increasingly complex and flexible (= subject to variability). Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.
  • 46. Can Machine Learning Pipelines Be Better Configured? Wang et al. FSE’2023 “A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue.”
  • 47. Deep software variability: Are layers/features orthogonal or are there interactions? Luc Lesoil, Mathieu Acher, Arnaud Blouin, Jean-Marc Jézéquel: Deep Software Variability: Towards Handling Cross-Layer Configuration.
  • 48. Configuration is hard: numerous options, informal knowledge ?????
  • 49. Hardware Operating System Software Input Data 10.4 x264 --mbtree ... x264 --no-mbtree ... x264 --no-mbtree ... x264 --mbtree ... 20.04 Dell latitude 7400 Raspberry Pi 4 model B vertical animation vertical animation vertical animation vertical animation Duration (s) 22 25 73 72 6 6 351 359 Size (MB) 28 34 33 21 33 21 28 34 A B 2 1 2 1 REAL WORLD Example (x264)
  • 50. REAL WORLD Example (x264) Hardware Operating System Software Input Data 10.4 x264 --mbtree ... x264 --no-mbtree ... x264 --no-mbtree ... x264 --mbtree ... 20.04 Dell latitude 7400 Raspberry Pi 4 model B vertical animation vertical animation vertical animation vertical animation Duration (s) 22 25 73 72 6 6 351 359 Size (MB) 28 34 33 21 33 21 28 34 A B 2 1 2 1
  • 51. Hardware Operating System Software Input Data 10.4 x264 --mbtree ... x264 --no-mbtree ... x264 --no-mbtree ... x264 --mbtree ... 20.04 Dell latitude 7400 Raspberry Pi 4 model B vertical animation vertical animation vertical animation vertical animation Duration (s) 22 25 73 72 6 6 351 359 Size (MB) 28 34 33 21 33 21 28 34 A B 2 1 2 1 ≈*16 ≈*12 REAL WORLD Example (x264)
  • 52. Age # Cores GPU SOFTWARE Variant Compil. Version Version Option Distrib. Size Length Res. Hardware Operating System Software Input Data Bug Perf. ↗ Perf. ↘ deep variability L. Lesoil, M. Acher, A. Blouin and J.-M. Jézéquel, “Deep Software Variability: Towards Handling Cross-Layer Configuration” in VaMoS 2021 The “best”/default software variant might be a bad one. Influential software options and their interactions vary. Performance prediction models and variability knowledge may not generalize
  • 53. Let’s go deep with input data! Intuition: video encoder behavior (and thus runtime configurations) hugely depends on the input video (different compression ratio, encoding size/type etc.) Is the best software configuration still the best? Are influential options always influential? Does the configuration knowledge generalize? ? YouTube User General Content dataset: 1397 videos Measurements of 201 soft. configurations (with same hardware, compiler, version, etc.): encoding time, bitrate, etc.
  • 54. configurations’ measurements over input_1 configurations’ measurements over input_42 Inputs = …
  • 55. configurations’ measurements over input_1 configurations’ measurements over input_42 Inputs = … Generalization/transfer: what’s the relationship between perf_pred_1 and perf_pred_42? ● with perf_pred_i a performance model capable of predicting performance of any configuration on input_i ● linear relationship? ○ eg Pearson/Spearman linear correlation ● influential features/options: same?
  • 56. Let’s go deep with input data! Intuition: video encoder behavior (and thus runtime configurations) hugely depends on the input video (different compression ratio, encoding size/type etc.) Is the best software configuration still the best? Are influential options always influential? Does the configuration knowledge generalize? ? YouTube User General Content dataset: 1397 videos Measurements of 201 soft. configurations (with same hardware, compiler, version, etc.): encoding time, bitrate, etc.
  • 57. Do x264 software performances stay consistent across inputs? ●Encoding time: very strong correlations ○ low input sensitivity ●FPS: very strong correlations ○ low input sensitivity ●CPU usage : moderate correlation, a few negative correlations ○ medium input sensitivity ●Bitrate: medium-low correlation, many negative correlations ○ High input sensitivity ●Encoding size: medium-low correlation, many negative correlations ○ High input sensitivity ? 1397 videos x 201 software configurations
  • 58. Are there some configuration options more sensitive to input videos? (bitrate)
  • 59. Are there some configuration options more sensitive to input videos? (bitrate)
  • 60. Practical impacts for users, developers, scientists, and self-adaptive systems Threats to variability knowledge: predicting, tuning, or understanding configurable systems without being aware of inputs can be inaccurate and… pointless Opportunities: for some performance properties (P) and subject systems, some stability is observed and performance remains consistent! L. Lesoil, M. Acher, A. Blouin and J.-M. Jézéquel “The Interaction between Inputs and Configurations fed to Software Systems: an Empirical Study” https://arxiv.org/abs/2112.07279
  • 61. Age # Cores GPU SOFTWARE Variant Compil. Version Version Option Distrib. Size Length Res. Hardware Operating System Software Input Data Bug Perf. ↗ Perf. ↘ deep variability Sometimes, variability is consistent/stable and knowledge transfer is immediate. But there are also interactions among variability layers and variability knowledge may not generalize
  • 62. Age # Cores GPU Compil. Version Version Option Distrib. Size Length Res. Hardware Operating System Software Input Data Does deep software variability affect previous scientific, software-based studies? (a graphical template) List all details… and questions: what iF we run the experiments on different: OS? version/commit? PARAMETERS? INPUT? SOFTWARE Variant
  • 63. Frictionless Reproducibility and (Deep) Software (Variability) Problem: Variability and Frictions Solution: Variability and Exploration Discussions AGENDA
  • 64. Deep variability problem (statement) Fundamentally, we have a huge multi-dimensional variant space (eg 10^6000) run (source_code) => result run (hardware, operating_system, build_environment, input_data, source_code, …) => results Fixing variability once and for all, in all dimensions/layers, is the obvious solution… But it is either impossible (eg the ages of processor can have an impact on execution time)... Or not desirable ● non-robust result ● generalization/transferability of the results/findings ● kill innovation 64
  • 65. Replicability is the holy grail! Exploring various configurations: ● Make more robust scientific findings ● Define and assess the validity enveloppe ● Enable exploration and optimization ● Innovation and new hypothesis, insights, knowledge ⇒ We propose to embrace deep variability for the sake of replicability 65
  • 66. Embrace deep variability! Explicit modeling of the variability points and their relationships, such as: 1. Get insights into the variability “factors” and their possible interactions 2. Capture and document configurations for the sake of reproducibility 3. Explore diverse configurations to replicate, and hence optimize, validate, increase the robustness, or provide better resilience Our Vision ACM REP 2024 ⇒ We aim to address the complexities associated with reproducibility and replicability in modern software systems and environments, facilitating a more comprehensive and nuanced perspective on these critical “factors”. 66
  • 67. Solution #1: Variability model ● Abstractions are definitely needed to… ○ reason about logical constraints and interactions ○ integrate domain knowledge ○ synthesize domain knowledge ○ automate and guide the exploration of variants ○ scope and prioritize experiments ● Language and formalism: feature model (widely applicable!) ○ translation to logics ○ reasoning with SAT/CP/SMT solvers ᵩ ⋃ ⋂ |
  • 68. Solution #1: Variability model ● Abstractions are definitely needed… ● Yes, but how to obtain a feature model? ○ modelling ○ reverse engineering (out of command-line parameters, source code, logs, configurations, etc.) ○ learning (next slide!) ○ modeling+reverse engineering+learning (HDR)
  • 69. Whole Population of Configurations Performance Prediction Training Sample Performance Measurements Prediction Model J. Alves Pereira, H. Martin, M. Acher, J.-M. Jézéquel, G. Botterweck and A. Ventresque “Learning Software Configuration Spaces: A Systematic Literature Review” JSS, 2021 Solution #2: sampling and learning (regression, classification) 69
  • 70. x264 --me dia --ref 5 … -o output_1.x264
  • 71. 15,000+ options thousands of compiler flags and compile-time options dozens of preferences 100+ command-line parameters 1000+ feature toggles 71 hardware variability deep software variability System under Study (reproducible) Variability Output (binary) input data “The build process of a software product is reproducible if, after designating a specific version of its source code and all of its build dependencies, every build produces bit-for-bit identical artifacts, no matter the environment in which the build is performed.” Lamb and Zacchiroli “Reproducible Builds: Increasing the Integrity of Software Supply Chains” IEEE Software 2022
  • 72. 15,000+ compile-time options 72 deep software variability System under Study Variability Output (binary) “The build process of a software product is reproducible if, after designating a specific version of its source code and all of its build dependencies, every build produces bit-for-bit identical artifacts, no matter the environment in which the build is performed.” Lamb and Zacchiroli “Reproducible Builds: Increasing the Integrity of Software Supply Chains” IEEE Software 2022 make defconfig # configuration make # build the kernel (binary) out of config make # should be the same, right?
  • 73. Options Matter: Documenting and Fixing Non-Reproducible Builds in Highly-Configurable Systems Randrianaina, Khelladi, Zendra, Acher MSR’2024 also at FOSDEM 2024 https://fosdem.org/2024/schedule/event/fosdem-2024-2848-documenting-and-fixing-non-reproducible-builds-due-to-configuration-options/
  • 74. Options Matter: Documenting and Fixing Non-Reproducible Builds in Highly-Configurable Systems Randrianaina, Khelladi, Zendra, Acher MSR’2024 also at FOSDEM 2024 https://fosdem.org/2024/schedule/event/fosdem-2024-2848-documenting-and-fixing-non-reproducible-builds-due-to-configuration-options/ #1 take away message: look at every variability layer when you want a bit-to-bit reproducibility; don’t ignore compile-time options! “The build process of a software product is reproducible if, after designating a specific version and a specific variant of its source code and all of its build dependencies, every build produces bit-for-bit identical artifacts, no matter the environment in which the build is performed.” Lamb and Zacchiroli “Reproducible Builds: Increasing the Integrity of Software Supply Chains” IEEE Software 2022
  • 75. Options Matter: Documenting and Fixing Non-Reproducible Builds in Highly-Configurable Systems Randrianaina, Khelladi, Zendra, Acher MSR’2024 also at FOSDEM 2024 https://fosdem.org/2024/schedule/event/fosdem-2024-2848-documenting-and-fixing-non-reproducible-builds-due-to-configuration-options/ #2 take away message: interactions across variability layers exist (eg compile-time option with build path) and may hamper reproducibility “The build process of a software product is reproducible if, after designating a specific version and a specific variant of its source code and all of its build dependencies, every build produces bit-for-bit identical artifacts, no matter the environment in which the build is performed.” Lamb and Zacchiroli “Reproducible Builds: Increasing the Integrity of Software Supply Chains” IEEE Software 2022
  • 76. ● Linux as a subject software system (not as an OS interacting with other layers) ● Targeted non-functional, quantitative property: binary size ○ interest for maintainers/users of the Linux kernel (embedded systems, cloud, etc.) ○ challenging to predict (cross-cutting options, interplay with compilers/build systems, etc/.) ● Dataset: version 4.13.3 (september 2017), x86_64 arch, measurements of 95K+ random configurations ○ paranoiac about deep variability since 2017, Docker to control the build environment and scale ○ diversity of binary sizes: from 7Mb to 1.9Gb ○ 6% MAPE errors: quite good, though costly… 2 76 H. Martin, M. Acher, J. A. Pereira, L. Lesoil, J. Jézéquel and D. E. Khelladi, “Transfer learning across variants and versions: The case of linux kernel size” Transactions on Software Engineering (TSE), 2021
  • 77. 4.13 version (sep 2017): 6%. What about evolution? Can we reuse the 4.13 Linux prediction model? No, accuracy quickly decreases: 4.15 (5 months after): 20%; 5.7 (3 years after): 35% 3 77
  • 78. Solution #3 Transfer learning (reuse of knowledge) ● Mission Impossible: Saving variability knowledge and prediction model 4.13 (15K hours of computation) ● Heterogeneous transfer learning: the feature space is different ● TEAMS: transfer evolution-aware model shifting 5 78 H. Martin, M. Acher, J. A. Pereira, L. Lesoil, J. Jézéquel and D. E. Khelladi, “Transfer learning across variants and versions: The case of linux kernel size” Transactions on Software Engineering (TSE), 2021 3 78
  • 79. Luc Lesoil, Helge Spieker, Arnaud Gotlieb, Mathieu Acher, Paul Temple, Arnaud Blouin, Jean-Marc Jézéquel: Learning input-aware performance models of configurable systems: An empirical evaluation. J. Syst. Softw. 208: 111883 (2024) Solution #3 Transfer learning (con’t)
  • 80. Is there an interplay between compile-time and runtime options? L. Lesoil, M. Acher, X. Tërnava, A. Blouin and J.-M. Jézéquel “The Interplay of Compile- time and Run-time Options for Performance Prediction” in SPLC ’21
  • 81. Solution #4: Leverage stability across variability layers! First good news: Worth tuning software at compile-time! Second good news: For all the execution time distributions of x264 and all the input videos, the worst correlation is greater than 0.97. If the compile-time options change the scale of the distribution, they do not change the rankings of run-time configurations (i.e., they do not truly interact with the run-time options). It has three practical implications: 1. Reuse of configuration knowledge: transfer learning of prediction models boils down to apply a linear transformation among distributions. Users can also trust the documentation of run-time options, consistent whatever the compile-time configuration is. 2. Tuning at lower cost: finding the best compile-time configuration among all the possible ones allows one to immediately find the best configuration at run time. We can remove away one dimension! 3. Measuring at lower cost: do not use a default compile-time configuration, use the less costly once since it will generalize! Did we recommend to use two binaries? YES, one for measuring, another for reaching optimal performances! L. Lesoil, M. Acher, X. Tërnava, A. Blouin and J.-M. Jézéquel “The Interplay of Compile- time and Run-time Options for Performance Prediction” in SPLC ’21
  • 82. Key results (for x264) First good news: Worth tuning software at compile-time! Second good news: For all the execution time distributions of x264 and all the input videos, the worst correlation is greater than 0.97. If the compile-time options change the scale of the distribution, they do not change the rankings of run-time configurations (i.e., they do not truly interact with the run-time options). It has three practical implications: 1. Reuse of configuration knowledge: transfer learning of prediction models boils down to apply a linear transformation among distributions. Users can also trust the documentation of run-time options, consistent whatever the compile-time configuration is. 2. Tuning at lower cost: finding the best compile-time configuration among all the possible ones allows one to immediately find the best configuration at run time. We can remove away one dimension! 3. Measuring at lower cost: do not use a default compile-time configuration, use the less costly once since it will generalize! Did we recommend to use two binaries? YES, one for measuring, another for reaching optimal performances! interplay between compile-time and runtime options and even input! L. Lesoil, M. Acher, X. Tërnava, A. Blouin and J.-M. Jézéquel “The Interplay of Compile- time and Run-time Options for Performance Prediction” in SPLC ’21
  • 84. What is your prompt?
  • 88. Solution #5: Strategic exploration with modelling and learning
  • 89. Solution #6 Identification of root causes of variability (testing and verification)
  • 92. Retrieve the result of S. Boldo et al. M. Acher, J. Galindo, J.M Jézéquel, “On Programming Variability with Large Language Model-based Assistant”, SPLC’2023
  • 93. ▸ Some solutions ▸ abstractions/models ▸ learning and sampling ▸ reuse of configuration knowledge ▸ leveraging stability ▸ systematic exploration ▸ identification of root causes ▸ LLMs to support exploration of variants’ space ▸ incremental build of configuration space (Randrianaina et al. ICSE’22) ▸ debloating variability (Ternava et al. SAC’23) ▸ feature subset selection (Martin et al. SPLC’23) ▸ Essentially, we want to reduce the dimensionality of the problem as well as the computational and human cost to foster verification of results and innovation ▸ Frictionless reproducibility: code+data+metrics ▸ Deep variability is a problem (frictions!) ▸ evidence in many scientific domains ▸ Deep variability is a solution (exploration!) ▸ fixing variability once and for all is not ▸ Replicability is the holy grail! ▸ explore variants for robustness, validation, optimization and knowledge finding 93
  • 94. Backup slides (disclaimer: don’t try to understand everything ;))
  • 95. What can we do? (robustness) Robustness (trustworthiness) of scientific results to sources of variability I have shown many examples of sources of variations and non-robust results… Robustness should be rigorously defined (hint: it’s not the definition as given in computer science) How to verify the effect of sources of variations on the robustness of given conclusions? ● actionable metrics? ● methodology? (eg when to stop?) ● variability can actually be leveraged to augment confidence
  • 97. 97 Deep software variability is… a threat for reproducible research “Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results.” an opportunity for replication “A study that arrives at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses.” “A study that refutes some scientific findings of another study, through the collection of new data (possibly with different methods) and completion of new analyses.” robustifying and augmenting scientific knowledge
  • 98. Reproducible Science as a Testing Problem #1 Test Generation Problem (input) inputs: computing environment, parameters of an algorithm, versions of a library or tool, choice of a programming language #2 Oracle Problem (output) we usually ignore the outcome! (open problems; open questions; new knowledge) System under Study (replicable) Input Output (scientific result)
  • 99. Reproduction vs replication http://rescience.github.io/faq/ “Reproduction of a computational study means running the same computation on the same input data, and then checking if the results are the same, or at least “close enough” when it comes to numerical approximations. Reproduction can be considered as software testing at the level of a complete study.” We don’t “test” in one run, in one computing environment, with one kind of input data, etc. “Replication of a scientific study (computational or other) means repeating a published protocol, respecting its spirit and intentions but varying the technical details. For computational work, this would mean using different software, running a simulation from different initial conditions, etc. The idea is to change something that everyone believes shouldn’t matter, and see if the scientific conclusions are affected or not.” It is the most interesting direction, basically for synthesizing new scientific knowledge! In both cases, there is the need to harness the combinatorial explosion of deep software variability 99
  • 100. Reproducible Science and Software Engineering @acherm aka Deep Software Variability for Replicability in Computational Science Deep Questions?
  • 102. Transferring Performance Prediction Models Across Different Hardware Platforms Valov et al. ICPE 2017 “Linear model provides a good approximation of transformation between performance distributions of a system deployed in different hardware environments” what about variability of input data? compile-time options? version?
  • 103. Transfer Learning for Software Performance Analysis: An Exploratory Analysis Jamshidi et al. ASE 2017
  • 104. mixing deep variability: hard to assess the specific influence of each layer very few hardware, version, and input data… but lots of runtime configurations (variants) Let’s go deep with input data! Transfer Learning for Software Performance Analysis: An Exploratory Analysis Jamshidi et al. ASE 2017
  • 105. Threats to variability knowledge for performance property bitrate ● optimal configuration is specific to an input; a good configuration can be a bad one ● some options’ values have an opposite effect depending on the input ● effectiveness of sampling strategies (random, 2-wise, etc.) is input specific (somehow confirming Pereira et al. ICPE 2020) ● predicting, tuning, or understanding configurable systems without being aware of inputs can be inaccurate and… pointless Practical impacts for users, developers, scientists, and self-adaptive systems
  • 106. Computational science depends on software and its engineering 106 from a set of scripts to automate the deployment to… a comprehensive system containing several features that help researchers exploring various hypotheses multi-million line of code base multi-dependencies multi-systems multi-layer multi-version multi-person multi-variant
  • 107. x264 video encoder (compilation/build) compile-time options
  • 108. What can we do? (#1 studies) Empirical studies about deep software variability ● more subject systems ● more variability layers, including interactions ● more quantitative (e.g., performance) properties with challenges for gathering measurements data: ● how to scale experiments? Variant space is huge! ● how to fix/isolate some layers? (eg hardware) ● how to measure in a reliable way? Expected outcomes: ● significance of deep software variability in the wild ● identification of stable layers: sources of variability that should not affect the conclusion and that can be eliminated/forgotten ● identification/quantification of sensitive layers and interactions that matter ● variability knowledge
  • 109. What can we do? (#2 cost) Reducing the cost of exploring the variability spaces Many directions here (references at the end of the slides): ● learning ○ many algorithms/techniques with tradeoffs interpretability/accuracy ○ transfer learning (instead of learning from scratch) ● sampling strategies ○ uniform random sampling? t-wise? distance-based? … ○ sample of hardware? input data? ● incremental build of configurations ● white-box approaches ● …
  • 110. Key results (for x264) Worth tuning software at compile-time: gain about 10 % of execution time with the tuning of compile-time options (compared to the default compile-time configuration). The improvements can be larger for some inputs and some runtime configurations. Stability of variability knowledge: For all the execution time distributions of x264 and all the input videos, the worst correlation is greater than 0.97. If the compile-time options change the scale of the distribution, they do not change the rankings of run-time configurations (i.e., they do not truly interact with the run-time options). Reuse of configuration knowledge: ● Linear transformation among distributions ● Users can also trust the documentation of run-time options, consistent whatever the compile-time configuration is. L. Lesoil, M. Acher, X. Tërnava, A. Blouin and J.-M. Jézéquel “The Interplay of Compile- time and Run-time Options for Performance Prediction” in SPLC ’21
  • 111. Embrace deep variability! Explicit modeling of the variability points and their relationships, such as: 1. Get insights into the variability “factors” and their possible interactions 2. Capture and document configurations for the sake of reproducibility 3. Explore diverse configurations to replicate, and hence optimize, validate, increase the robustness, or provide better resilience Our Vision ACM REP 2024 ⇒ We aim to address the complexities associated with reproducibility and replicability in modern software systems and environments, facilitating a more comprehensive and nuanced perspective on these critical “factors”. 111 https://hal.science/hal-04582287
  • 113. exec (software) = exec_repro (software) or exec(software) ~= exec(software_repro) (difference: exec_repro is another execution environment… and so somehow differs or not with exec; or we consider that software differs…) (exec: execution? what’s the outcome then? in fact execution can be replaced by “build”... which is another kind of execution) exec (software) ?= exec_repro (software) software ~= software_repro exec (software, hardware) exec (software, hardware, compiler, input_data, operating_system, bios, container, hypervisor, dependencies_versions) exec (v1, v2, …, vN) ~= exec_repro (v1’, v2’, …, vN’) for i in [1, n], v_{i} ~= v_{i} (or not!) ~= is specific to a domain, to a usage, etc. ~= can be over the N layers or over N’ layers (N’ < N) ~= can be specific to some pairs elements (eg we know that with this hardware, the exec time is multiplied by 2) for instance, we know the ~= between a software configuration with any hardware (but if the compiler changes, then the ~= should be “tuned” accordingly) also ~= can be defined between a configuration set and an hardware set (eg performance distribution)
  • 115. Frictionless reproducibility (annotated bibliography; grey literature) https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1 The Mechanics of Frictionless Reproducibility, B Recht interesting historical perspective on research in neural networks (NeurIPs 87 titles are shockingly still relevant); really love some parts about random experiments, science as a “massively parallel genetic algorithm” or the discussions on the difficulty of using ML/DL software (completely aligned with my terrible experience of Weka GUI in ~2006) https://www.argmin.net/p/the-department-of-frictionless-reproducibilty https://statmodeling.stat.columbia.edu/2023/10/13/frictionless-reproducibility-methods-as-proto-al gorithms-division-of-labor-as-a-characteristic-of-statistical-methods-statistics-as-the-science-of-d efaults-statisticians-well-prepared-to-think-abo/
  • 116. Progress and frictionless reproducibility Inspired by Thomas Kuhn (1962), we can think of the scientific and engineering process as a massively parallel genetic algorithm. If we want to improve upon the systems we currently have, we might try a small perturbation to see if we get an improvement. If we can find a small change that improves some desired outcome, we could change our systems to reflect this improvement. If we continually search for these improvements and work hard to demonstrate their value, we may head in a better direction over time. For scientific endeavors, we could perhaps gauge ‘better’ or ‘worse’ by performing random experiments—not randomized experiments per se, but random experiments in the sense of trying potentially surprising improvements. If our small tweak results in better outcomes, we can attempt to convince a journal editor or conference program committee to publish it. And this communication gives everyone else a new starting point for their own random experimentation. A single investigator can only make so much progress by random searching alone, but random search is pleasantly parallelizable. Competing scientists can independently try their own random ideas and publish their results. Sometimes an individual result is so promising that the herd of experimenters all flock around the good idea, hoping to strike gold on a nearby improvement and bring home bragging rights. To some, this looks like an inefficient mess. To others, it looks like science. https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1 The Mechanics of Frictionless Reproducibility, B Recht
  • 117. Data sharing and frictions “Data set benchmarking and competitive testing took over machine learning in the late 1980s. Email and file transfer were becoming more accessible. The current specification of FTP was finalized in 1985. In 1987, a PhD student at UC Irvine named David Aha put up an FTP server to host data sets for empirically testing machine learning methods. Aha was motivated by service to the community, but he also wanted to show his nearest-neighbor methods would outperform Ross Quinlan’s decision tree induction algorithms. He formatted his data sets using the ‘attribute-value’ representation that a rival researcher, Ross Quinlan (1986), had used. And, so, the UC Irvine Machine Learning Repository was born.” “Improvements in computing greased the wheels, giving us faster computers, faster data transfer, and smaller storage footprints. But computing technology alone was not sufficient to drive progress. Friendly competition with Quinlan inspired Aha to build the UCI repository. And more explicit competitions were also crucial components of the success.” The Mechanics of Frictionless Reproducibility, B Recht, 2024 https://hdsr.mitpress.mit.edu/pub/8dqgwqiu/release/1