SlideShare a Scribd company logo
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to
fi
x this.
Preregistration
Stage 1 Stage 2
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to
fi
x this.
Preregistration
Stage 1 Stage 2
Stage 1
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to
fi
x this.
Preregistration
Stage 1 Stage 2
• Establish signi
fi
cance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
In-principle Accepted!
Go to Stage 2.
Outcomes of Stage 1:
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to
fi
x this.
Preregistration
Stage 1 Stage 2
• Establish signi
fi
cance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
In-principle Accepted!
Go to Stage 2.
Major / Minor Revision.
Back to Stage 1.
Outcomes of Stage 1:
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to
fi
x this.
Preregistration
Stage 1 Stage 2
• Establish signi
fi
cance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
In-principle Accepted!
Go to Stage 2.
Major / Minor Revision.
Back to Stage 1.
Rejected.
Outcomes of Stage 1:
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to
fi
x this.
Preregistration
Stage 1 Stage 2
• Establish signi
fi
cance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
• Establish conformity.
• Execute agreed exp. protocol.
• Explain small deviations fr. protocol.
• Investigate unexpected results.
• Establish reproducibility.
• Submit evidence towards
the key claims in the paper.
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to
fi
x this.
Preregistration
Stage 2
• Establish conformity.
• Execute agreed exp. protocol.
• Explain small deviations fr. protocol.
• Investigate unexpected results.
• Establish reproducibility.
• Submit evidence towards
the key claims in the paper.
Outcomes of Stage 2:
Accept
Major / Minor Revision
Explain deviations / unexpected results.
Improve artifact / reproducibility.
Reject
Severe deviations from experimental protocol.
Why Preregistration
• To get you fuzzing paper published, you need strong positive results.
• We believe, this unhealthy focus is a substantial inhibitor of scienti
fi
c progress.
• Duplicated E
ff
orts: Important investigations are never published.
Why Preregistration
• To get you fuzzing paper published, you need strong positive results.
• We believe, this unhealthy focus is a substantial inhibitor of scienti
fi
c progress.
• Duplicated E
ff
orts: Important investigations are never published.
• Hypothesis / approach perfectly reasonable and scienti
fi
c appealing,
If hypothesis proves to be invalid or approach ine
ff
ective, other groups will never now.
Why Preregistration
• To get you fuzzing paper published, you need strong positive results.
• We believe, this unhealthy focus is a substantial inhibitor of scienti
fi
c progress.
• Duplicated E
ff
orts: Important investigations are never published.
• Overclaims: Incentive to overclaim the bene
fi
ts of an approach.
Why Preregistration
• To get you fuzzing paper published, you need strong positive results.
• We believe, this unhealthy focus is a substantial inhibitor of scienti
fi
c progress.
• Duplicated E
ff
orts: Important investigations are never published.
• Overclaims: Incentive to overclaim the bene
fi
ts of an approach.
• Di
ffi
cult to reproduce the results and misinforms future investigations by the community.
• Authors are uncomfortable sharing their research prototypes.
In 2020 only 35 of 60 fuzzing papers we surveyed published code w/ paper.
Why Preregistration
• Sound fuzzer evaluation imposes high barrier to entry for newcomers.
Why Preregistration
• Sound fuzzer evaluation imposes high barrier to entry for newcomers.
1. Well-designed experiment methodology.
2. Substantial computation resources.
• Huge variance due to randomness
• Repeat 20x, 24hrs, X fuzzers, Y programs
• Statistical Signi
fi
cance, e
ff
ect size
• CPU centuries.
On the Reliability of Coverage-Based Fuzzer Benchmarking
Marcel Böhme
MPI-SP, Germany
Monash University, Australia
László Szekeres
Google, USA
Jonathan Metzman
Google, USA
ABSTRACT
Given a program where none of our fuzzers �nds any bugs, how do
we know which fuzzer is better? In practice, we often look to code
coverage as a proxy measure of fuzzer e�ectiveness and consider
the fuzzer which achieves more coverage as the better one.
Indeed, evaluating 10 fuzzers for 23 hours on 24 programs, we
�nd that a fuzzer that covers more code also �nds more bugs. There
is a very strong correlation between the coverage achieved and the
number of bugs found by a fuzzer. Hence, it might seem reasonable
to compare fuzzers in terms of coverage achieved, and from that
derive empirical claims about a fuzzer’s superiority at �nding bugs.
Curiously enough, however, we �nd no strong agreement on
which fuzzer is superior if we compared multiple fuzzers in terms
of coverage achieved instead of the number of bugs found. The
fuzzer best at achieving coverage, may not be best at �nding bugs.
ACM Reference Format:
Marcel Böhme, László Szekeres, and Jonathan Metzman. 2022. On the Relia-
bility of Coverage-Based Fuzzer Benchmarking. In 44th International Confer-
ence on Software Engineering (ICSE ’22), May 21–29, 2022, Pittsburgh, PA, USA.
ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3510003.3510230
1 INTRODUCTION
In the recent decade, fuzzing has found widespread interest. In
industry, we have large continuous fuzzing platforms employing
100k+ machines for automatic bug �nding [23, 24, 46]. In academia,
in 2020 alone, almost 50 fuzzing papers were published in the top
conferences for Security and Software Engineering [62].
Imagine, we have several fuzzers available to test our program.
Hopefully, none of them �nds any bugs. If indeed they don’t, we
might have some con�dence in the correctness of the program.
Then again, even a perfectly non-functional fuzzer would �nd no
bugs in our program. So, how do we know which fuzzer has the
highest “potential” of �nding bugs? A widely used proxy measure
of fuzzer e�ectiveness is the code coverage that is achieved. After
all, a fuzzer cannot �nd bugs in code that it does not cover.
Indeed, in our experiments we identify a very strong positive
correlation between the coverage achieved and the number of bugs
found by a fuzzer. Correlation assesses the strength of the associa-
tion between two random variables or measures. We conduct our
empirical investigation on 10 fuzzers ⇥ 24 C programs ⇥ 20 fuzzing
campaigns of 23 hours (⇡ 13 CPU years). We use three measures of
coverage and two measures of bug �nding, and our results suggest:
As the fuzzer covers more code, it also discovers more bugs.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for pro�t or commercial advantage and that copies bear this notice and the full citation
on the �rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9221-1/22/05.
https://doi.org/10.1145/3510003.3510230
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
Fuzzer Ranks by avg. #branches covered
Fuzzer
Ranks
by
avg.
#bugs
discovered
0 2 4 6 8 10
#benchmarks
(a) 1 hour fuzzing campaigns (d = 0.38).
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
Fuzzer Ranks by avg. #branches covered
Fuzzer
Ranks
by
avg.
#bugs
discovered
0 2 4 6 8 10
#benchmarks
(b) 1 day fuzzing campaigns (d = 0.49).
Figure 1: Scatterplot of the ranks of 10 fuzzers applied to 24
programs for (a) 1 hour and (b) 23 hours, when ranking each
fuzzer in terms of the avg. number of branches covered (x-
axis) and in terms of the avg. number of bugs found (y-axis).
Hence, it might seem reasonable to conjecture that the fuzzer
which is better in terms of code coverage is also better in terms
of bug �nding—but is this really true? In Figure 1, we show the
ranking of these fuzzers across all programs in terms of the average
coverage achieved and the average number of bugs found in each
benchmark. The ranks are visibly di�erent. To be sure, we also
conducted a pair-wise comparison between any two fuzzers where
the di�erence in coverage and the di�erence in bug �nding are
statistically signi�cant. The results are similar.
We identify no strong agreement on the superiority or ranking
of a fuzzer when compared in terms of mean coverage versus mean
bug �nding. Inter-rater agreement assesses the degree to which
two raters, here both types of benchmarking, agree on the superi-
ority or ranking of a fuzzer when evaluated on multiple programs.
Indeed, two measures of the same construct are likely to exhibit a
high degree of correlation but can at the same time disagree sub-
stantially [41, 55]. We evaluate the agreement on fuzzer superiority
when comparing any two fuzzers where the di�erences in terms of
coverage and bug �nding are statistically signi�cant. We evaluate
the agreement on fuzzer ranking when comparing all the fuzzers.
Concretely, our results suggest a moderate agreement. For fuzzer
pairs, where the di�erences in terms of coverage and bug �nding
is statistically signi�cant, the results disagree for 10% to 15% of
programs. Only when measuring the agreement between branch
coverage and the number of bugs found and when we require the
di�erences to be statistically signi�cant at ?  0.0001 for coverage
and bug �nding, do we �nd a strong agreement. However, statistical
signi�cance at ?  0.0001 only in terms of coverage is not su�cient;
we again �nd only weak agreement. The increase in agreement
with statistical signi�cance is not observed when we measure bug
�nding using the time-to-error. We also �nd that the variance of the
agreement reduces as more programs are used, and that results of
1h campaigns do not strongly agree with results of 23h campaigns.
ICSE’22
Why Preregistration
• Sound fuzzer evaluation imposes high barrier to entry for newcomers.
1. Well-designed experiment methodology.
2. Substantial computation resources.
• Huge variance due to randomness
• Repeat 20x, 24hrs, X fuzzers, Y programs
• Statistical Signi
fi
cance, e
ff
ect size
• CPU centuries.
Many pitfalls of experimental design! Newcomers find out
only when receiving the reviews and after conducting
costly experiments following a flawed methodology.
Symptomatic plus-one comments.
Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.
Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.
• If Registered Report is in-principle accepted and proposed exp. design is
followed without unexplained deviations, results will be accepted as they are.
Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.
• If Registered Report is in-principle accepted and proposed exp. design is
followed without unexplained deviations, results will be accepted as they are.
• Minimizes incentive to overclaim (while not reducing quality of evaluation).
• Allow publication of interesting ideas and investigations irrespective of results.
Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.
• If Registered Report is in-principle accepted and proposed exp. design is
followed without unexplained deviations, results will be accepted as they are.
• Early feedback for newcomers.
• On signi
fi
cance and novelty of the problem/approach/hypothesis.
• On soundness and reproducibility of experimental methodology.
• Further lower barrier, Google pledges help with fuzzer evaluation via FuzzBench.
Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.
• If Registered Report is in-principle accepted and proposed exp. design is
followed without unexplained deviations, results will be accepted as they are.
• Early feedback for newcomers.
• We hope our initiative will turn the focus of the peer-reviewing process
back to the innovation and key claims in a paper, while leaving the burden of
evidence until after the in-principle acceptance.
Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.
• If Registered Report is in-principle accepted and proposed exp. design is
followed without unexplained deviations, results will be accepted as they are.
• Early feedback for newcomers.
• We hope our initiative will turn the focus of the peer-reviewing process
back to the innovation and key claims in a paper, while leaving the burden of
evidence until after the in-principle acceptance.
• Reviewers go from gate-keeping to productive feedback.
Authors and reviewers work to ensure best study design possible.
Why Preregistration
Why Preregistration
Your thoughts
or experience?
Why Preregistration
• What do you see as the main strengths of the model?
• More reproduciblity.
• Less overclaims, mitigates publication bias, less unhealthy focus on positive results.
• Publications are more sound. Publication process is more fair.
• Allows interesting negative results, no forced positive result, less duplicated e
ff
ort.
• Ideas and methodology above positive results.
Why Preregistration
• What do you see as the main strengths of the model?
The main draws for me are the removal of the unhealthy focus on positive results
(bad for students, bad for reproducibility, bad for impact) as well as the fact that
the furthering of the
fi
eld is pushed forward with negative results regarding newly
attempted studies that have already been performed by others. Lastly, it removes
the questionable aspect of changing the approach until something working
appears, with no regard for a validation step. In ML lingo, we only have a test set,
no validation set, and are implicitly over
fi
tting to it with our early stopping.
“
“
Why Preregistration
• What do you see as the main weaknesses of the model?
Why Preregistration
• What do you see as the main weaknesses of the model?
• Time to publish is too long. Increased author / reviewing load.
Why Preregistration
• What do you see as the main weaknesses of the model?
• Time to publish is too long. Increased author / reviewing load.
At
fi
rst hand maybe longer publication process because of the pre-registration,
but overall it could be even faster, when someone also includes the time for
rejection and re-work etc.
“ “
Why Preregistration
• What do you see as the main weaknesses of the model?
• Time to publish is too long. Increased author / reviewing load.
• Sound experimental designs may be hard to create and vet / review.
• For the
fi
rst time, preregistration enables conversations about the soundness of
experimental design. It naturally creates and communicates community standards.
• Previously, experimental design was either accepted as is
or criticized with a high cost to authors.
Why Preregistration
• What do you see as the main weaknesses of the model?
• Time to publish is too long. Increased author / reviewing load.
• Sound experimental designs may be hard to create and vet / review.
• Is the model
fl
exible enough to accommodate changes in experimental design?
Why Preregistration
• What do you see as the main weaknesses of the model?
• Time to publish is too long. Increased author / reviewing load.
• Sound experimental designs may be hard to create and vet / review.
• Is the model
fl
exible enough to accommodate changes in experimental design?
• Yes. Deviations from the agreed protocol are allowed but must be explained.
Why Preregistration
• What do you see as the main weaknesses of the model?
• Time to publish is too long. Increased author / reviewing load.
• Sound experimental designs may be hard to create and vet / review.
• Is the model
fl
exible enough to accommodate changes in experimental design?
• Ideas that look bad theoretically may work well in practice.
• Without performing the experiment, we can't say if it could be useful or not.
• The model is not meant to substitute the traditional publication model, but to augment it.
• This model might not work very well for exploratory research (hypothesis generation).
• This model might work better for con
fi
rmatory research (hypothesis testing).
Why Preregistration
• In your opinion, how could this publication model be improved?
Why Preregistration
• In your opinion, how could this publication model be improved?
• Stage 2 publication in conference, instead of a journal.
Why Preregistration
• In your opinion, how could this publication model be improved?
• Stage 2 publication in conference, instead of a journal.
• We see conference as a forum for discussion (which happens in this workshop).
• Maybe Stage 1 in conference, Stage 2 in journal (+ conference presentation)?
Why Preregistration
• In your opinion, how could this publication model be improved?
• Stage 2 publication in conference, instead of a journal.
• Fast-track through Stage 1 and Stage 2 when results exist.
• Sounds like a more traditional publication, not preregistration :)
Why Preregistration
• In your opinion, how could this publication model be improved?
• Stage 2 publication in conference, instead of a journal.
• Fast-track through Stage 1 and Stage 2 when results exist.
Why Preregistration
• In your opinion, how could this publication model be improved?
• Stage 2 publication in conference, instead of a journal.
• Fast-track through Stage 1 and Stage 2 when results exist.
• Flexible author-list within reason, to incentivize post-announcement collaboration.
• Preregistration (where Stage 1 is published) would also allow early decon
fl
icting or
lead to increased collaboration between people with similar ideas and goals.
Why Preregistration
• In your opinion, how could this publication model be improved?
• Stage 2 publication in conference, instead of a journal.
• Fast-track through Stage 1 and Stage 2 when results exist.
• Flexible author-list within reason, to incentivize post-announcement collaboration.
Why Preregistration
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to fix this.
Preregistration
Stage 1 Stage 2
• Establish significance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
• Establish conformity.
• Execute agreed exp. protocol.
• Explain small deviations fr. protocol.
• Investigate unexpected results.
• Establish reproducibility.
• Submit evidence towards
the key claims in the paper.
Why Preregistration
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to fix this.
Preregistration
Stage 1 Stage 2
• Establish significance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
• Establish conformity.
• Execute agreed exp. protocol.
• Explain small deviations fr. protocol.
• Investigate unexpected results.
• Establish reproducibility.
• Submit evidence towards
the key claims in the paper.
Why Preregistration
• Sound fuzzer evaluation imposes high barrier to entry for newcomers.
1. Well-designed experiment methodology.
2. Substantial computation resources.
• Huge variance due to randomness
• Repeat 20x, 24hrs, X fuzzers, Y programs
• Statistical Significance, effect size
• CPU centuries.
Many pitfalls of experimental design! Newcomers find out
only when receiving the reviews and after conducting
costly experiments following a flawed methodology.
Symptomatic plus-one comments.
Why Preregistration
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to fix this.
Preregistration
Stage 1 Stage 2
• Establish significance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
• Establish conformity.
• Execute agreed exp. protocol.
• Explain small deviations fr. protocol.
• Investigate unexpected results.
• Establish reproducibility.
• Submit evidence towards
the key claims in the paper.
Why Preregistration
• Sound fuzzer evaluation imposes high barrier to entry for newcomers.
1. Well-designed experiment methodology.
2. Substantial computation resources.
• Huge variance due to randomness
• Repeat 20x, 24hrs, X fuzzers, Y programs
• Statistical Significance, effect size
• CPU centuries.
Many pitfalls of experimental design! Newcomers find out
only when receiving the reviews and after conducting
costly experiments following a flawed methodology.
Symptomatic plus-one comments.
Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.
• If Registered Report is in-principle accepted and proposed exp. design is
followed without unexplained deviations, results will be accepted as they are.
• Early feedback for newcomers.
• We hope our initiative will turn the focus of the peer-reviewing process
back to the innovation and key claims in a paper, while leaving the burden of
evidence until after the in-principle acceptance.
• Reviewers go from gate-keeping to productive feedback.
Authors and reviewers work to ensure best study design possible.
Why Preregistration
We think that the incentive structure for fuzzing research is broken;
so we would like to introduce preregistration to fix this.
Preregistration
Stage 1 Stage 2
• Establish significance.
• Motivate the problem.
• Establish novelty.
• Discuss hypothesis for solution.
• Discuss related work.
• Establish soundness.
• Experimental design.
• Research questions & claims.
• Benchmarks & baselines.
• Establish conformity.
• Execute agreed exp. protocol.
• Explain small deviations fr. protocol.
• Investigate unexpected results.
• Establish reproducibility.
• Submit evidence towards
the key claims in the paper.
Why Preregistration
• Sound fuzzer evaluation imposes high barrier to entry for newcomers.
1. Well-designed experiment methodology.
2. Substantial computation resources.
• Huge variance due to randomness
• Repeat 20x, 24hrs, X fuzzers, Y programs
• Statistical Significance, effect size
• CPU centuries.
Many pitfalls of experimental design! Newcomers find out
only when receiving the reviews and after conducting
costly experiments following a flawed methodology.
Symptomatic plus-one comments.
Why Preregistration
Your thoughts
or experience?
Why Preregistration
• Address both issues by switching to a 2-stage publication process that
separates the review of (i) the methodology & ideas and (ii) the evidence.
• If Registered Report is in-principle accepted and proposed exp. design is
followed without unexplained deviations, results will be accepted as they are.
• Early feedback for newcomers.
• We hope our initiative will turn the focus of the peer-reviewing process
back to the innovation and key claims in a paper, while leaving the burden of
evidence until after the in-principle acceptance.
• Reviewers go from gate-keeping to productive feedback.
Authors and reviewers work to ensure best study design possible.

More Related Content

PPTX
Predatory Journals and Publishers
PDF
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
PPTX
Manuscript writing.pptx
PPTX
全部Excelでやろうとして後悔するデータ分析
PPTX
Scientometrics
PPT
UGC CARE.ppt
PDF
Canonical correlation
PDF
Scopus: a changing world of Research
Predatory Journals and Publishers
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
Manuscript writing.pptx
全部Excelでやろうとして後悔するデータ分析
Scientometrics
UGC CARE.ppt
Canonical correlation
Scopus: a changing world of Research

What's hot (8)

PDF
Introduction to Systematic Literature Review method
PDF
Winning data science competitions
PDF
Research Publications in Scopus
PPTX
Presentation on journal suggestion tool and journal finder
PPTX
8 rm research funding
PDF
Explotación minera en Chepes: Sierra de las Minas
PDF
機械学習概論 講義テキスト
PPTX
CNNの構造最適化手法(第3回3D勉強会)
Introduction to Systematic Literature Review method
Winning data science competitions
Research Publications in Scopus
Presentation on journal suggestion tool and journal finder
8 rm research funding
Explotación minera en Chepes: Sierra de las Minas
機械学習概論 講義テキスト
CNNの構造最適化手法(第3回3D勉強会)
Ad

Similar to An Implementation of Preregistration (20)

PPTX
A Quantitative Comparison of Coverage-Based Greybox Fuzzers
PDF
Democratizing Fuzzing at Scale by Abhishek Arya
PDF
0-knowledge fuzzing white paper
PDF
0-knowledge fuzzing white paper
PDF
On the Reliability of Coverage-based Fuzzer Benchmarking
PPTX
Dagstuhl2021
PPTX
Fuzzing.pptx
PPTX
A software fault localization technique based on program mutations
PPTX
Blaze Information Security: Slaying bugs and improving software security thro...
PPTX
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
PDF
Using Grammar Extracted from Sample Inputs to Generate Effective Fuzzing Files
PDF
Az4301280282
PDF
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
PDF
The Curious Case of Fuzzing for Automated Software Testing
PDF
FUZZING & SOFTWARE SECURITY TESTING
PPTX
NBTC#2 - Why instrumentation is cooler then ice
PDF
Fuzzing: The New Unit Testing
PPT
Otto Vinter - Analysing Your Defect Data for Improvement Potential
DOCX
Chapter 10 Testing and Quality Assurance1Unders.docx
PDF
Debug me
A Quantitative Comparison of Coverage-Based Greybox Fuzzers
Democratizing Fuzzing at Scale by Abhishek Arya
0-knowledge fuzzing white paper
0-knowledge fuzzing white paper
On the Reliability of Coverage-based Fuzzer Benchmarking
Dagstuhl2021
Fuzzing.pptx
A software fault localization technique based on program mutations
Blaze Information Security: Slaying bugs and improving software security thro...
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
Using Grammar Extracted from Sample Inputs to Generate Effective Fuzzing Files
Az4301280282
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
The Curious Case of Fuzzing for Automated Software Testing
FUZZING & SOFTWARE SECURITY TESTING
NBTC#2 - Why instrumentation is cooler then ice
Fuzzing: The New Unit Testing
Otto Vinter - Analysing Your Defect Data for Improvement Potential
Chapter 10 Testing and Quality Assurance1Unders.docx
Debug me
Ad

More from mboehme (10)

PDF
[Keynote @ RAID'24] How to solve cybersecurity once and for all
PDF
Statistical Reasoning About Programs
PDF
On the Surprising Efficiency and Exponential Cost of Fuzzing
PDF
Foundations Of Software Testing
PDF
DS3 Fuzzing Panel (M. Boehme)
PDF
Fuzzing: On the Exponential Cost of Vulnerability Discovery
PDF
Boosting Fuzzer Efficiency: An Information Theoretic Perspective
PDF
Fuzzing: Challenges and Reflections
PDF
AFLGo: Directed Greybox Fuzzing
KEY
NUS SoC Graduate Outreach @ TU Dresden
[Keynote @ RAID'24] How to solve cybersecurity once and for all
Statistical Reasoning About Programs
On the Surprising Efficiency and Exponential Cost of Fuzzing
Foundations Of Software Testing
DS3 Fuzzing Panel (M. Boehme)
Fuzzing: On the Exponential Cost of Vulnerability Discovery
Boosting Fuzzer Efficiency: An Information Theoretic Perspective
Fuzzing: Challenges and Reflections
AFLGo: Directed Greybox Fuzzing
NUS SoC Graduate Outreach @ TU Dresden

Recently uploaded (20)

PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
master seminar digital applications in india
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
01-Introduction-to-Information-Management.pdf
PDF
Classroom Observation Tools for Teachers
PPTX
Institutional Correction lecture only . . .
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
master seminar digital applications in india
Abdominal Access Techniques with Prof. Dr. R K Mishra
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Supply Chain Operations Speaking Notes -ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Anesthesia in Laparoscopic Surgery in India
01-Introduction-to-Information-Management.pdf
Classroom Observation Tools for Teachers
Institutional Correction lecture only . . .
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
O5-L3 Freight Transport Ops (International) V1.pdf
Computing-Curriculum for Schools in Ghana
102 student loan defaulters named and shamed – Is someone you know on the list?

An Implementation of Preregistration

  • 1. We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fi x this. Preregistration Stage 1 Stage 2
  • 2. We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fi x this. Preregistration Stage 1 Stage 2 Stage 1
  • 3. We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fi x this. Preregistration Stage 1 Stage 2 • Establish signi fi cance. • Motivate the problem. • Establish novelty. • Discuss hypothesis for solution. • Discuss related work. • Establish soundness. • Experimental design. • Research questions & claims. • Benchmarks & baselines. In-principle Accepted! Go to Stage 2. Outcomes of Stage 1:
  • 4. We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fi x this. Preregistration Stage 1 Stage 2 • Establish signi fi cance. • Motivate the problem. • Establish novelty. • Discuss hypothesis for solution. • Discuss related work. • Establish soundness. • Experimental design. • Research questions & claims. • Benchmarks & baselines. In-principle Accepted! Go to Stage 2. Major / Minor Revision. Back to Stage 1. Outcomes of Stage 1:
  • 5. We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fi x this. Preregistration Stage 1 Stage 2 • Establish signi fi cance. • Motivate the problem. • Establish novelty. • Discuss hypothesis for solution. • Discuss related work. • Establish soundness. • Experimental design. • Research questions & claims. • Benchmarks & baselines. In-principle Accepted! Go to Stage 2. Major / Minor Revision. Back to Stage 1. Rejected. Outcomes of Stage 1:
  • 6. We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fi x this. Preregistration Stage 1 Stage 2 • Establish signi fi cance. • Motivate the problem. • Establish novelty. • Discuss hypothesis for solution. • Discuss related work. • Establish soundness. • Experimental design. • Research questions & claims. • Benchmarks & baselines. • Establish conformity. • Execute agreed exp. protocol. • Explain small deviations fr. protocol. • Investigate unexpected results. • Establish reproducibility. • Submit evidence towards the key claims in the paper.
  • 7. We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fi x this. Preregistration Stage 2 • Establish conformity. • Execute agreed exp. protocol. • Explain small deviations fr. protocol. • Investigate unexpected results. • Establish reproducibility. • Submit evidence towards the key claims in the paper. Outcomes of Stage 2: Accept Major / Minor Revision Explain deviations / unexpected results. Improve artifact / reproducibility. Reject Severe deviations from experimental protocol.
  • 8. Why Preregistration • To get you fuzzing paper published, you need strong positive results. • We believe, this unhealthy focus is a substantial inhibitor of scienti fi c progress. • Duplicated E ff orts: Important investigations are never published.
  • 9. Why Preregistration • To get you fuzzing paper published, you need strong positive results. • We believe, this unhealthy focus is a substantial inhibitor of scienti fi c progress. • Duplicated E ff orts: Important investigations are never published. • Hypothesis / approach perfectly reasonable and scienti fi c appealing, If hypothesis proves to be invalid or approach ine ff ective, other groups will never now.
  • 10. Why Preregistration • To get you fuzzing paper published, you need strong positive results. • We believe, this unhealthy focus is a substantial inhibitor of scienti fi c progress. • Duplicated E ff orts: Important investigations are never published. • Overclaims: Incentive to overclaim the bene fi ts of an approach.
  • 11. Why Preregistration • To get you fuzzing paper published, you need strong positive results. • We believe, this unhealthy focus is a substantial inhibitor of scienti fi c progress. • Duplicated E ff orts: Important investigations are never published. • Overclaims: Incentive to overclaim the bene fi ts of an approach. • Di ffi cult to reproduce the results and misinforms future investigations by the community. • Authors are uncomfortable sharing their research prototypes. In 2020 only 35 of 60 fuzzing papers we surveyed published code w/ paper.
  • 12. Why Preregistration • Sound fuzzer evaluation imposes high barrier to entry for newcomers.
  • 13. Why Preregistration • Sound fuzzer evaluation imposes high barrier to entry for newcomers. 1. Well-designed experiment methodology. 2. Substantial computation resources. • Huge variance due to randomness • Repeat 20x, 24hrs, X fuzzers, Y programs • Statistical Signi fi cance, e ff ect size • CPU centuries. On the Reliability of Coverage-Based Fuzzer Benchmarking Marcel Böhme MPI-SP, Germany Monash University, Australia László Szekeres Google, USA Jonathan Metzman Google, USA ABSTRACT Given a program where none of our fuzzers �nds any bugs, how do we know which fuzzer is better? In practice, we often look to code coverage as a proxy measure of fuzzer e�ectiveness and consider the fuzzer which achieves more coverage as the better one. Indeed, evaluating 10 fuzzers for 23 hours on 24 programs, we �nd that a fuzzer that covers more code also �nds more bugs. There is a very strong correlation between the coverage achieved and the number of bugs found by a fuzzer. Hence, it might seem reasonable to compare fuzzers in terms of coverage achieved, and from that derive empirical claims about a fuzzer’s superiority at �nding bugs. Curiously enough, however, we �nd no strong agreement on which fuzzer is superior if we compared multiple fuzzers in terms of coverage achieved instead of the number of bugs found. The fuzzer best at achieving coverage, may not be best at �nding bugs. ACM Reference Format: Marcel Böhme, László Szekeres, and Jonathan Metzman. 2022. On the Relia- bility of Coverage-Based Fuzzer Benchmarking. In 44th International Confer- ence on Software Engineering (ICSE ’22), May 21–29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3510003.3510230 1 INTRODUCTION In the recent decade, fuzzing has found widespread interest. In industry, we have large continuous fuzzing platforms employing 100k+ machines for automatic bug �nding [23, 24, 46]. In academia, in 2020 alone, almost 50 fuzzing papers were published in the top conferences for Security and Software Engineering [62]. Imagine, we have several fuzzers available to test our program. Hopefully, none of them �nds any bugs. If indeed they don’t, we might have some con�dence in the correctness of the program. Then again, even a perfectly non-functional fuzzer would �nd no bugs in our program. So, how do we know which fuzzer has the highest “potential” of �nding bugs? A widely used proxy measure of fuzzer e�ectiveness is the code coverage that is achieved. After all, a fuzzer cannot �nd bugs in code that it does not cover. Indeed, in our experiments we identify a very strong positive correlation between the coverage achieved and the number of bugs found by a fuzzer. Correlation assesses the strength of the associa- tion between two random variables or measures. We conduct our empirical investigation on 10 fuzzers ⇥ 24 C programs ⇥ 20 fuzzing campaigns of 23 hours (⇡ 13 CPU years). We use three measures of coverage and two measures of bug �nding, and our results suggest: As the fuzzer covers more code, it also discovers more bugs. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro�t or commercial advantage and that copies bear this notice and the full citation on the �rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA © 2022 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9221-1/22/05. https://doi.org/10.1145/3510003.3510230 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Fuzzer Ranks by avg. #branches covered Fuzzer Ranks by avg. #bugs discovered 0 2 4 6 8 10 #benchmarks (a) 1 hour fuzzing campaigns (d = 0.38). 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Fuzzer Ranks by avg. #branches covered Fuzzer Ranks by avg. #bugs discovered 0 2 4 6 8 10 #benchmarks (b) 1 day fuzzing campaigns (d = 0.49). Figure 1: Scatterplot of the ranks of 10 fuzzers applied to 24 programs for (a) 1 hour and (b) 23 hours, when ranking each fuzzer in terms of the avg. number of branches covered (x- axis) and in terms of the avg. number of bugs found (y-axis). Hence, it might seem reasonable to conjecture that the fuzzer which is better in terms of code coverage is also better in terms of bug �nding—but is this really true? In Figure 1, we show the ranking of these fuzzers across all programs in terms of the average coverage achieved and the average number of bugs found in each benchmark. The ranks are visibly di�erent. To be sure, we also conducted a pair-wise comparison between any two fuzzers where the di�erence in coverage and the di�erence in bug �nding are statistically signi�cant. The results are similar. We identify no strong agreement on the superiority or ranking of a fuzzer when compared in terms of mean coverage versus mean bug �nding. Inter-rater agreement assesses the degree to which two raters, here both types of benchmarking, agree on the superi- ority or ranking of a fuzzer when evaluated on multiple programs. Indeed, two measures of the same construct are likely to exhibit a high degree of correlation but can at the same time disagree sub- stantially [41, 55]. We evaluate the agreement on fuzzer superiority when comparing any two fuzzers where the di�erences in terms of coverage and bug �nding are statistically signi�cant. We evaluate the agreement on fuzzer ranking when comparing all the fuzzers. Concretely, our results suggest a moderate agreement. For fuzzer pairs, where the di�erences in terms of coverage and bug �nding is statistically signi�cant, the results disagree for 10% to 15% of programs. Only when measuring the agreement between branch coverage and the number of bugs found and when we require the di�erences to be statistically signi�cant at ?  0.0001 for coverage and bug �nding, do we �nd a strong agreement. However, statistical signi�cance at ?  0.0001 only in terms of coverage is not su�cient; we again �nd only weak agreement. The increase in agreement with statistical signi�cance is not observed when we measure bug �nding using the time-to-error. We also �nd that the variance of the agreement reduces as more programs are used, and that results of 1h campaigns do not strongly agree with results of 23h campaigns. ICSE’22
  • 14. Why Preregistration • Sound fuzzer evaluation imposes high barrier to entry for newcomers. 1. Well-designed experiment methodology. 2. Substantial computation resources. • Huge variance due to randomness • Repeat 20x, 24hrs, X fuzzers, Y programs • Statistical Signi fi cance, e ff ect size • CPU centuries. Many pitfalls of experimental design! Newcomers find out only when receiving the reviews and after conducting costly experiments following a flawed methodology. Symptomatic plus-one comments.
  • 15. Why Preregistration • Address both issues by switching to a 2-stage publication process that separates the review of (i) the methodology & ideas and (ii) the evidence.
  • 16. Why Preregistration • Address both issues by switching to a 2-stage publication process that separates the review of (i) the methodology & ideas and (ii) the evidence. • If Registered Report is in-principle accepted and proposed exp. design is followed without unexplained deviations, results will be accepted as they are.
  • 17. Why Preregistration • Address both issues by switching to a 2-stage publication process that separates the review of (i) the methodology & ideas and (ii) the evidence. • If Registered Report is in-principle accepted and proposed exp. design is followed without unexplained deviations, results will be accepted as they are. • Minimizes incentive to overclaim (while not reducing quality of evaluation). • Allow publication of interesting ideas and investigations irrespective of results.
  • 18. Why Preregistration • Address both issues by switching to a 2-stage publication process that separates the review of (i) the methodology & ideas and (ii) the evidence. • If Registered Report is in-principle accepted and proposed exp. design is followed without unexplained deviations, results will be accepted as they are. • Early feedback for newcomers. • On signi fi cance and novelty of the problem/approach/hypothesis. • On soundness and reproducibility of experimental methodology. • Further lower barrier, Google pledges help with fuzzer evaluation via FuzzBench.
  • 19. Why Preregistration • Address both issues by switching to a 2-stage publication process that separates the review of (i) the methodology & ideas and (ii) the evidence. • If Registered Report is in-principle accepted and proposed exp. design is followed without unexplained deviations, results will be accepted as they are. • Early feedback for newcomers. • We hope our initiative will turn the focus of the peer-reviewing process back to the innovation and key claims in a paper, while leaving the burden of evidence until after the in-principle acceptance.
  • 20. Why Preregistration • Address both issues by switching to a 2-stage publication process that separates the review of (i) the methodology & ideas and (ii) the evidence. • If Registered Report is in-principle accepted and proposed exp. design is followed without unexplained deviations, results will be accepted as they are. • Early feedback for newcomers. • We hope our initiative will turn the focus of the peer-reviewing process back to the innovation and key claims in a paper, while leaving the burden of evidence until after the in-principle acceptance. • Reviewers go from gate-keeping to productive feedback. Authors and reviewers work to ensure best study design possible.
  • 23. Why Preregistration • What do you see as the main strengths of the model? • More reproduciblity. • Less overclaims, mitigates publication bias, less unhealthy focus on positive results. • Publications are more sound. Publication process is more fair. • Allows interesting negative results, no forced positive result, less duplicated e ff ort. • Ideas and methodology above positive results.
  • 24. Why Preregistration • What do you see as the main strengths of the model? The main draws for me are the removal of the unhealthy focus on positive results (bad for students, bad for reproducibility, bad for impact) as well as the fact that the furthering of the fi eld is pushed forward with negative results regarding newly attempted studies that have already been performed by others. Lastly, it removes the questionable aspect of changing the approach until something working appears, with no regard for a validation step. In ML lingo, we only have a test set, no validation set, and are implicitly over fi tting to it with our early stopping. “ “
  • 25. Why Preregistration • What do you see as the main weaknesses of the model?
  • 26. Why Preregistration • What do you see as the main weaknesses of the model? • Time to publish is too long. Increased author / reviewing load.
  • 27. Why Preregistration • What do you see as the main weaknesses of the model? • Time to publish is too long. Increased author / reviewing load. At fi rst hand maybe longer publication process because of the pre-registration, but overall it could be even faster, when someone also includes the time for rejection and re-work etc. “ “
  • 28. Why Preregistration • What do you see as the main weaknesses of the model? • Time to publish is too long. Increased author / reviewing load. • Sound experimental designs may be hard to create and vet / review. • For the fi rst time, preregistration enables conversations about the soundness of experimental design. It naturally creates and communicates community standards. • Previously, experimental design was either accepted as is or criticized with a high cost to authors.
  • 29. Why Preregistration • What do you see as the main weaknesses of the model? • Time to publish is too long. Increased author / reviewing load. • Sound experimental designs may be hard to create and vet / review. • Is the model fl exible enough to accommodate changes in experimental design?
  • 30. Why Preregistration • What do you see as the main weaknesses of the model? • Time to publish is too long. Increased author / reviewing load. • Sound experimental designs may be hard to create and vet / review. • Is the model fl exible enough to accommodate changes in experimental design? • Yes. Deviations from the agreed protocol are allowed but must be explained.
  • 31. Why Preregistration • What do you see as the main weaknesses of the model? • Time to publish is too long. Increased author / reviewing load. • Sound experimental designs may be hard to create and vet / review. • Is the model fl exible enough to accommodate changes in experimental design? • Ideas that look bad theoretically may work well in practice. • Without performing the experiment, we can't say if it could be useful or not. • The model is not meant to substitute the traditional publication model, but to augment it. • This model might not work very well for exploratory research (hypothesis generation). • This model might work better for con fi rmatory research (hypothesis testing).
  • 32. Why Preregistration • In your opinion, how could this publication model be improved?
  • 33. Why Preregistration • In your opinion, how could this publication model be improved? • Stage 2 publication in conference, instead of a journal.
  • 34. Why Preregistration • In your opinion, how could this publication model be improved? • Stage 2 publication in conference, instead of a journal. • We see conference as a forum for discussion (which happens in this workshop). • Maybe Stage 1 in conference, Stage 2 in journal (+ conference presentation)?
  • 35. Why Preregistration • In your opinion, how could this publication model be improved? • Stage 2 publication in conference, instead of a journal. • Fast-track through Stage 1 and Stage 2 when results exist. • Sounds like a more traditional publication, not preregistration :)
  • 36. Why Preregistration • In your opinion, how could this publication model be improved? • Stage 2 publication in conference, instead of a journal. • Fast-track through Stage 1 and Stage 2 when results exist.
  • 37. Why Preregistration • In your opinion, how could this publication model be improved? • Stage 2 publication in conference, instead of a journal. • Fast-track through Stage 1 and Stage 2 when results exist. • Flexible author-list within reason, to incentivize post-announcement collaboration. • Preregistration (where Stage 1 is published) would also allow early decon fl icting or lead to increased collaboration between people with similar ideas and goals.
  • 38. Why Preregistration • In your opinion, how could this publication model be improved? • Stage 2 publication in conference, instead of a journal. • Fast-track through Stage 1 and Stage 2 when results exist. • Flexible author-list within reason, to incentivize post-announcement collaboration.
  • 39. Why Preregistration We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fix this. Preregistration Stage 1 Stage 2 • Establish significance. • Motivate the problem. • Establish novelty. • Discuss hypothesis for solution. • Discuss related work. • Establish soundness. • Experimental design. • Research questions & claims. • Benchmarks & baselines. • Establish conformity. • Execute agreed exp. protocol. • Explain small deviations fr. protocol. • Investigate unexpected results. • Establish reproducibility. • Submit evidence towards the key claims in the paper.
  • 40. Why Preregistration We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fix this. Preregistration Stage 1 Stage 2 • Establish significance. • Motivate the problem. • Establish novelty. • Discuss hypothesis for solution. • Discuss related work. • Establish soundness. • Experimental design. • Research questions & claims. • Benchmarks & baselines. • Establish conformity. • Execute agreed exp. protocol. • Explain small deviations fr. protocol. • Investigate unexpected results. • Establish reproducibility. • Submit evidence towards the key claims in the paper. Why Preregistration • Sound fuzzer evaluation imposes high barrier to entry for newcomers. 1. Well-designed experiment methodology. 2. Substantial computation resources. • Huge variance due to randomness • Repeat 20x, 24hrs, X fuzzers, Y programs • Statistical Significance, effect size • CPU centuries. Many pitfalls of experimental design! Newcomers find out only when receiving the reviews and after conducting costly experiments following a flawed methodology. Symptomatic plus-one comments.
  • 41. Why Preregistration We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fix this. Preregistration Stage 1 Stage 2 • Establish significance. • Motivate the problem. • Establish novelty. • Discuss hypothesis for solution. • Discuss related work. • Establish soundness. • Experimental design. • Research questions & claims. • Benchmarks & baselines. • Establish conformity. • Execute agreed exp. protocol. • Explain small deviations fr. protocol. • Investigate unexpected results. • Establish reproducibility. • Submit evidence towards the key claims in the paper. Why Preregistration • Sound fuzzer evaluation imposes high barrier to entry for newcomers. 1. Well-designed experiment methodology. 2. Substantial computation resources. • Huge variance due to randomness • Repeat 20x, 24hrs, X fuzzers, Y programs • Statistical Significance, effect size • CPU centuries. Many pitfalls of experimental design! Newcomers find out only when receiving the reviews and after conducting costly experiments following a flawed methodology. Symptomatic plus-one comments. Why Preregistration • Address both issues by switching to a 2-stage publication process that separates the review of (i) the methodology & ideas and (ii) the evidence. • If Registered Report is in-principle accepted and proposed exp. design is followed without unexplained deviations, results will be accepted as they are. • Early feedback for newcomers. • We hope our initiative will turn the focus of the peer-reviewing process back to the innovation and key claims in a paper, while leaving the burden of evidence until after the in-principle acceptance. • Reviewers go from gate-keeping to productive feedback. Authors and reviewers work to ensure best study design possible.
  • 42. Why Preregistration We think that the incentive structure for fuzzing research is broken; so we would like to introduce preregistration to fix this. Preregistration Stage 1 Stage 2 • Establish significance. • Motivate the problem. • Establish novelty. • Discuss hypothesis for solution. • Discuss related work. • Establish soundness. • Experimental design. • Research questions & claims. • Benchmarks & baselines. • Establish conformity. • Execute agreed exp. protocol. • Explain small deviations fr. protocol. • Investigate unexpected results. • Establish reproducibility. • Submit evidence towards the key claims in the paper. Why Preregistration • Sound fuzzer evaluation imposes high barrier to entry for newcomers. 1. Well-designed experiment methodology. 2. Substantial computation resources. • Huge variance due to randomness • Repeat 20x, 24hrs, X fuzzers, Y programs • Statistical Significance, effect size • CPU centuries. Many pitfalls of experimental design! Newcomers find out only when receiving the reviews and after conducting costly experiments following a flawed methodology. Symptomatic plus-one comments. Why Preregistration Your thoughts or experience? Why Preregistration • Address both issues by switching to a 2-stage publication process that separates the review of (i) the methodology & ideas and (ii) the evidence. • If Registered Report is in-principle accepted and proposed exp. design is followed without unexplained deviations, results will be accepted as they are. • Early feedback for newcomers. • We hope our initiative will turn the focus of the peer-reviewing process back to the innovation and key claims in a paper, while leaving the burden of evidence until after the in-principle acceptance. • Reviewers go from gate-keeping to productive feedback. Authors and reviewers work to ensure best study design possible.