An Implementation of Preregistration

We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to
fi
x this.
Preregistration
Stage 1 Stage 2

fi
x this.
Preregistration
Stage 1 Stage 2
Stage 1

fi
x this.
Preregistration
Stage 1 Stage 2
• Establish signi
fi
cance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
In-principle Accepted!
Go to Stage 2.
Outcomes of Stage 1:

fi
x this.
Preregistration
Stage 1 Stage 2
• Establish signi
fi
cance.
Go to Stage 2.
Major / Minor Revision.
Back to Stage 1.

fi
x this.
Preregistration
Stage 1 Stage 2
• Establish signi
fi
cance.
Go to Stage 2.
Major / Minor Revision.
Back to Stage 1.
Rejected.

fi
x this.
Preregistration
Stage 1 Stage 2
• Establish signi
fi
cance.
• Establish conformity.
• Execute agreed exp. protocol.
• Explain small deviations fr. protocol.
• Investigate unexpected results.
• Establish reproducibility.
• Submit evidence towards
the key claims in the paper.

fi
x this.
Preregistration
Stage 2
Accept
Major / Minor Revision
Explain deviations / unexpected results.
Improve artifact / reproducibility.
Reject
Severe deviations from experimental protocol.

Why Preregistration
• To get you fuzzing paper published, you need strong positive results.
• We believe, this unhealthy focus is a substantial inhibitor of scienti
fi
c progress.
• Duplicated E
ff
orts: Important investigations are never published.

Why Preregistration
fi
c progress.
• Duplicated E
ff
• Hypothesis / approach perfectly reasonable and scienti
fi
c appealing,
If hypothesis proves to be invalid or approach ine
ff
ective, other groups will never now.

Why Preregistration
fi
c progress.
• Duplicated E
ff
• Overclaims: Incentive to overclaim the bene
fi
ts of an approach.

Why Preregistration
fi
c progress.
• Duplicated E
ff
• Overclaims: Incentive to overclaim the bene
fi
ts of an approach.
• Di
ffi
cult to reproduce the results and misinforms future investigations by the community.
• Authors are uncomfortable sharing their research prototypes.
In 2020 only 35 of 60 fuzzing papers we surveyed published code w/ paper.

Why Preregistration
• Sound fuzzer evaluation imposes high barrier to entry for newcomers.

Why Preregistration
1. Well-designed experiment methodology.
2. Substantial computation resources.
• Huge variance due to randomness
• Repeat 20x, 24hrs, X fuzzers, Y programs
• Statistical Signi
fi
cance, e
ff
ect size
• CPU centuries.
On the Reliability of Coverage-Based Fuzzer Benchmarking
Marcel Böhme
MPI-SP, Germany
Monash University, Australia
László Szekeres
Google, USA
Jonathan Metzman
Google, USA
ABSTRACT
Given a program where none of our fuzzers �nds any bugs, how do
we know which fuzzer is better? In practice, we often look to code
coverage as a proxy measure of fuzzer e�ectiveness and consider
the fuzzer which achieves more coverage as the better one.
Indeed, evaluating 10 fuzzers for 23 hours on 24 programs, we
�nd that a fuzzer that covers more code also �nds more bugs. There
is a very strong correlation between the coverage achieved and the
number of bugs found by a fuzzer. Hence, it might seem reasonable
to compare fuzzers in terms of coverage achieved, and from that
derive empirical claims about a fuzzer’s superiority at �nding bugs.
Curiously enough, however, we �nd no strong agreement on
which fuzzer is superior if we compared multiple fuzzers in terms
of coverage achieved instead of the number of bugs found. The
fuzzer best at achieving coverage, may not be best at �nding bugs.
ACM Reference Format:
Marcel Böhme, László Szekeres, and Jonathan Metzman. 2022. On the Relia-
bility of Coverage-Based Fuzzer Benchmarking. In 44th International Confer-
ence on Software Engineering (ICSE ’22), May 21–29, 2022, Pittsburgh, PA, USA.
ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3510003.3510230
1 INTRODUCTION
In the recent decade, fuzzing has found widespread interest. In
industry, we have large continuous fuzzing platforms employing
100k+ machines for automatic bug �nding [23, 24, 46]. In academia,
in 2020 alone, almost 50 fuzzing papers were published in the top
conferences for Security and Software Engineering [62].
Imagine, we have several fuzzers available to test our program.
Hopefully, none of them �nds any bugs. If indeed they don’t, we
might have some con�dence in the correctness of the program.
Then again, even a perfectly non-functional fuzzer would �nd no
bugs in our program. So, how do we know which fuzzer has the
highest “potential” of �nding bugs? A widely used proxy measure
of fuzzer e�ectiveness is the code coverage that is achieved. After
all, a fuzzer cannot �nd bugs in code that it does not cover.
Indeed, in our experiments we identify a very strong positive
correlation between the coverage achieved and the number of bugs
found by a fuzzer. Correlation assesses the strength of the associa-
tion between two random variables or measures. We conduct our
empirical investigation on 10 fuzzers ⇥ 24 C programs ⇥ 20 fuzzing
campaigns of 23 hours (⇡ 13 CPU years). We use three measures of
coverage and two measures of bug �nding, and our results suggest:
As the fuzzer covers more code, it also discovers more bugs.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for pro�t or commercial advantage and that copies bear this notice and the full citation
on the �rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9221-1/22/05.
https://doi.org/10.1145/3510003.3510230
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
Fuzzer Ranks by avg. #branches covered
Fuzzer
Ranks
by
avg.
#bugs
discovered
0 2 4 6 8 10
#benchmarks
(a) 1 hour fuzzing campaigns (d = 0.38).
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
Fuzzer Ranks by avg. #branches covered
Fuzzer
Ranks
by
avg.
#bugs
discovered
0 2 4 6 8 10
#benchmarks
(b) 1 day fuzzing campaigns (d = 0.49).
Figure 1: Scatterplot of the ranks of 10 fuzzers applied to 24
programs for (a) 1 hour and (b) 23 hours, when ranking each
fuzzer in terms of the avg. number of branches covered (x-
axis) and in terms of the avg. number of bugs found (y-axis).
Hence, it might seem reasonable to conjecture that the fuzzer
which is better in terms of code coverage is also better in terms
of bug �nding—but is this really true? In Figure 1, we show the
ranking of these fuzzers across all programs in terms of the average
coverage achieved and the average number of bugs found in each
benchmark. The ranks are visibly di�erent. To be sure, we also
conducted a pair-wise comparison between any two fuzzers where
the di�erence in coverage and the di�erence in bug �nding are
statistically signi�cant. The results are similar.
We identify no strong agreement on the superiority or ranking
of a fuzzer when compared in terms of mean coverage versus mean
bug �nding. Inter-rater agreement assesses the degree to which
two raters, here both types of benchmarking, agree on the superi-
ority or ranking of a fuzzer when evaluated on multiple programs.
Indeed, two measures of the same construct are likely to exhibit a
high degree of correlation but can at the same time disagree sub-
stantially [41, 55]. We evaluate the agreement on fuzzer superiority
when comparing any two fuzzers where the di�erences in terms of
coverage and bug �nding are statistically signi�cant. We evaluate
the agreement on fuzzer ranking when comparing all the fuzzers.
Concretely, our results suggest a moderate agreement. For fuzzer
pairs, where the di�erences in terms of coverage and bug �nding
is statistically signi�cant, the results disagree for 10% to 15% of
programs. Only when measuring the agreement between branch
coverage and the number of bugs found and when we require the
di�erences to be statistically signi�cant at ?  0.0001 for coverage
and bug �nding, do we �nd a strong agreement. However, statistical
signi�cance at ?  0.0001 only in terms of coverage is not su�cient;
we again �nd only weak agreement. The increase in agreement
with statistical signi�cance is not observed when we measure bug
�nding using the time-to-error. We also �nd that the variance of the
agreement reduces as more programs are used, and that results of
1h campaigns do not strongly agree with results of 23h campaigns.
ICSE’22

Why Preregistration
• Statistical Signi
fi
cance, e
ff
ect size
• CPU centuries.
Many pitfalls of experimental design! Newcomers find out
only when receiving the reviews and after conducting
costly experiments following a flawed methodology.
Symptomatic plus-one comments.

Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.

Why Preregistration
• If Registered Report is in-principle accepted and proposed exp. design is
followed without unexplained deviations, results will be accepted as they are.

Why Preregistration
• Minimizes incentive to overclaim (while not reducing quality of evaluation).
• Allow publication of interesting ideas and investigations irrespective of results.

Why Preregistration
• Early feedback for newcomers.
• On signi
fi
cance and novelty of the problem/approach/hypothesis.
• On soundness and reproducibility of experimental methodology.
• Further lower barrier, Google pledges help with fuzzer evaluation via FuzzBench.

Why Preregistration
• We hope our initiative will turn the focus of the peer-reviewing process
back to the innovation and key claims in a paper, while leaving the burden of
evidence until after the in-principle acceptance.

Why Preregistration
• Reviewers go from gate-keeping to productive feedback.
Authors and reviewers work to ensure best study design possible.

Why Preregistration
Your thoughts
or experience?

Why Preregistration
• What do you see as the main strengths of the model?
• More reproduciblity.
• Less overclaims, mitigates publication bias, less unhealthy focus on positive results.
• Publications are more sound. Publication process is more fair.
• Allows interesting negative results, no forced positive result, less duplicated e
ff
ort.
• Ideas and methodology above positive results.

Why Preregistration
• What do you see as the main strengths of the model?
The main draws for me are the removal of the unhealthy focus on positive results
(bad for students, bad for reproducibility, bad for impact) as well as the fact that
the furthering of the
fi
eld is pushed forward with negative results regarding newly
attempted studies that have already been performed by others. Lastly, it removes
the questionable aspect of changing the approach until something working
appears, with no regard for a validation step. In ML lingo, we only have a test set,
no validation set, and are implicitly over
fi
tting to it with our early stopping.
“
“

Why Preregistration
• What do you see as the main weaknesses of the model?

Why Preregistration
• Time to publish is too long. Increased author / reviewing load.

Why Preregistration
At
fi
rst hand maybe longer publication process because of the pre-registration,
but overall it could be even faster, when someone also includes the time for
rejection and re-work etc.
“ “

Why Preregistration
• Sound experimental designs may be hard to create and vet / review.
• For the
fi
rst time, preregistration enables conversations about the soundness of
experimental design. It naturally creates and communicates community standards.
• Previously, experimental design was either accepted as is
or criticized with a high cost to authors.

Why Preregistration
• Is the model
fl
exible enough to accommodate changes in experimental design?

Why Preregistration
• Is the model
fl
• Yes. Deviations from the agreed protocol are allowed but must be explained.

Why Preregistration
• Is the model
fl
• Ideas that look bad theoretically may work well in practice.
• Without performing the experiment, we can't say if it could be useful or not.
• The model is not meant to substitute the traditional publication model, but to augment it.
• This model might not work very well for exploratory research (hypothesis generation).
• This model might work better for con
fi
rmatory research (hypothesis testing).

Why Preregistration
• In your opinion, how could this publication model be improved?

Why Preregistration
• Stage 2 publication in conference, instead of a journal.

Why Preregistration
• We see conference as a forum for discussion (which happens in this workshop).
• Maybe Stage 1 in conference, Stage 2 in journal (+ conference presentation)?

Why Preregistration
• Fast-track through Stage 1 and Stage 2 when results exist.
• Sounds like a more traditional publication, not preregistration :)

Why Preregistration

Why Preregistration
• Flexible author-list within reason, to incentivize post-announcement collaboration.
• Preregistration (where Stage 1 is published) would also allow early decon
fl
icting or
lead to increased collaboration between people with similar ideas and goals.

Why Preregistration
• Flexible author-list within reason, to incentivize post-announcement collaboration.

Why Preregistration
so we would like to introduce preregistration to fix this.
Preregistration
Stage 1 Stage 2
• Establish significance.

Why Preregistration
Preregistration
Stage 1 Stage 2
Why Preregistration
• Statistical Significance, eﬀect size
• CPU centuries.

Why Preregistration
Preregistration
Stage 1 Stage 2
Why Preregistration
• CPU centuries.
Why Preregistration

Why Preregistration
Preregistration
Stage 1 Stage 2
Why Preregistration
• CPU centuries.
Why Preregistration
Your thoughts
or experience?
Why Preregistration

An Implementation of Preregistration

More Related Content

What's hot (8)

Similar to An Implementation of Preregistration (20)

More from mboehme (10)

Recently uploaded (20)

An Implementation of Preregistration