SlideShare a Scribd company logo
Statistical "Reforms":
Fixing Science or Threats to
Replication and Falsification
Deborah G Mayo
June 7, 2022
The Du Bois-Wells Symposium on Socially Aware Data Science
and Ethical Issues in Modeling
2
Mounting failures of replication give a new urgency
to critically appraising proposed statistical reforms.
• While many are welcome
o preregistration
o replication
o avoid cookbook statistics
• Others are quite radical!
Replication Crisis Paradox
Critic of P-values: It’s much too easy to get
a small P-value
Crisis of replication: It is much too difficult to
replicate small P-values (with prespecified
hypotheses)
Is it easy or is it hard?
3
• R.A. Fisher: it’s easy to lie with statistics by
selective reporting, (“political principle that
anything can be proved by statistics” (1955,
75))
• Sufficient finagling—cherry-picking, data-
dredging, multiple testing, optional stopping—
may practically guarantee a preferred claim H
appears supported, even if it’s unwarranted
4
• “We knew many researchers - including
ourselves - who readily admitted to [biasing
selection effects]..but they thought it was wrong
the way it's wrong to jaywalk. …simulations
revealed it was wrong the way it's wrong to rob a
bank.” (Simmons, Nelson and Simonsohn (2018,
255)
• “21 word solution” (2012)
5
This underwrites key features of significance tests:
• to bound the probabilities of misleading
inferences error probabilities
• to constrain the human tendency to selectively
favor views they believe in.
They should not be replaced with methods less
able to control erroneous interpretations of data.
Not if we want socially aware data science.
6
(Simple) Statistical significance
tests
Significance tests (R.A. Fisher) are a small part of a
rich error statistical methodology:
“…to test the conformity of the particular data
under analysis with H0 in some respect….”
…the P-value: the probability of an even
larger value of t0bs merely from background
variability or noise (Mayo and Cox 2006, 81) 7
Testing reasoning, as we see it
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of inconsistency with H0
• Small P-value indicates H1 some underlying
discrepancy from H0 because very probably
(1–P) you would have seen a smaller difference
than t0bs were H0 true.
• Even if the small P-value is valid, it isn’t evidence
of a scientific conclusion H*
Stat-Sub fallacy H1 => H*
8
Neyman-Pearson (N-P) put
Fisherian tests on firmer
footing (1933):
Introduces alternative hypotheses H0, H1
H0: μ ≤ 0 vs. H1: μ > 0
• Constrains tests by requiring control of both Type I
error (erroneously rejecting) and Type II error
(erroneously failing to reject) H0, and power
(Neyman also developed confidence interval
estimation at the same time)
9
N-P tests tools for optimal
performance:
Their success in optimal control of error
probabilities gives a new paradigm for
statistics
Yet a major criticism: they are only
“accept/reject” rules that should be “retired” for
anything but quality control decisions, not
science (e.g., Amrhein et al 2021)
10
• Our view is that reliable error control is crucial
for warranting specific inferences (not mere
long-run quality control)
• What bothers you with selective reporting,
cherry picking, stopping when the data look
good, P-hacking?
• Not a problem about long-run performance —
11
12
• We cannot say the test has done its job in the
case at hand in avoiding sources of
misinterpreting data
Basis for the severe testing
philosophy of evidence
• Use error probabilities to assess capabilities
of tools to probe various flaws (“probativism”)
• Data supply good evidence for a claim only if it
has been subjected to and passes a test that
probably would have found it flawed or
specifiably false (if it is).
• Altering this capability changes the evidence (on
our view)
13
On a rival view of evidence…
“Two problems that plague frequentist [error
statistical] inference: multiple comparisons and
multiple looks, or, …data dredging and peeking at the
data. The frequentist solution to both problems
involves adjusting the P-value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(Meta-Research Innovation Center at Stanford)
14
This view of evidence
(probabilist) is based on the
Likelihood Principle (LP)
All the evidence is contained in the ratio of
likelihoods:
Pr(x0;H0)/Pr(x0;H1)
x support H0 less well than H1 if H0 is less likely
than H1 in this technical sense
15
Likelihood Principle (LP) vs error
statistics
• Any hypothesis that perfectly fits the data is
maximally likely
• Pr(H0 is less well supported than H1; H0) is high
for some H1 or other
• So even with strong support, the error
probability associated with the inference is high
16
All error probabilities
violate the Likelihood Principle
(LP):
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space” (Lindley 1971, 436)
17
Many “reforms” offered as
alternatives to significance tests,
follow the LP
• “Bayes factors can be used in the complete absence
of a sampling plan…” (Bayarri, Benjamin, Berger,
Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.
(Berger and Wolpert, The Likelihood Principle 1988,
78)
18
In testing the mean of a
standard normal
distribution
Bayesian sequential (adaptive)
analysts say:
“The [regulatory] requirement of type I error control
for Bayesian adaptive designs causes them to lose
many of their philosophical advantages, such as
compliance with the likelihood principle” (Ryan et al.
2020, radiation oncology)
Our view: Why buy a philosophy that relinquishes
error probability control?
20
This is of relevance to AI/ML
If the field follows critics of explainable AI/ML:
“regulators should place more emphasis on
well-designed clinical trials, at least for some
higher-risk devices, and less on whether the
AI/ML system can be explained”. (Babic et al.
2021)
“Beware Explanations From AI in Health
Care”
Bayesians may (indirectly) block
implausible inferences
• With a low prior degree of belief on H (e.g.,
real effect), the Bayesian can block
inferring H
Concerns:
• Doesn’t show what has gone wrong—it’s the
multiplicity
• The believability of post hoc hypotheses is
what makes them so seductive
• Claims can be highly probable (or even
known) while poorly probed.
23
How to obtain
and interpret Bayesian priors?
• Most (?) use nonsubjective or default priors, to
prevent prior beliefs from influencing posteriors—
data dominant
• There is no agreement on which of rival systems
to use. (They may not even be probabilities.)
(e.g., maximum entropy, invariance, maximizing the
missing information, coverage matching)
There may be ways to combine
Bayesian and error statistical accounts
(Gelman: Falsificationist Bayesian; Shalizi: error
statistician)
“[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error probes’
in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and
Hinkley (1974) call ‘pure significance testing’, in which
certain of the model’s implications are compared
directly to the data.” (Gelman and Shalizi 2013, 10,
20).
• Can’t also champion “abandoning statistical
significance”, as in a recent “reform” 25
A recent recommended “reform”:
Don’t say ‘significance’, don’t use
P-value thresholds
• In 2019, executive director of the American
Statistical Association (ASA) (and 2 co-authors*)
announce: “declarations of ‘statistical
significance’ be abandoned”
• “It is time to stop using the term ‘statistically
significant’ entirely.”
*Wasserstein, Schirm & Lazar
26
• We agree the actual P-value should be
reported (as all the founders of tests
recommended)
• But the 2019 Editorial says prespecified P-
value thresholds should not be used at all in
interpreting results.
27
• Many who signed on to the “no threshold view”
think by removing P-value thresholds,
researchers lose an incentive to data dredge
and multiple test and otherwise exploit
researcher flexibility
• D. Hand and I (2022) argue: banning the use of
P-value thresholds in interpreting data does not
diminish but rather exacerbates data-dredging
28
• In a world without predesignated thresholds,
it would be hard to hold the data dredgers
accountable for reporting a nominally small P-
value through ransacking, data dredging,
trying and trying again
• What distinguishes data dredged P-values
from valid ones is that they fail to meet a
prespecified error probability
“Statistical Significance and its Critics: Practicing damaging
science, or damaging scientific practice?” (Mayo and Hand
2022)
29
No tests, no falsification
• If you cannot say about any results, ahead of
time, they will not be allowed to count in
favor of a claim C, then you do not have a
test of C
• Why insist on replications if at no point can
you say, the effect has failed to replicate?
• Nor can you test the assumptions of
statistical models and likelihood functions
30
ASA (President’s) Task Force on
Statistical Significance and
Replicability
• In 2019 ASA President appointed a task force of 14
statisticians put in the odd position of needing:
“to address concerns that [the ASA executive
director’s “no significance” editorial] might be
mistakenly interpreted as official ASA policy”
“P-values and significance testing, properly applied
and interpreted, are important tools that should not
be abandoned.” (Benjamini et al. 2021)
31
32
The ASA President’s Task Force:
Linda Young, National Agric Stats, U of Florida (Co-Chair)
Xuming He, University of Michigan (Co-Chair)
Yoav Benjamini, Tel Aviv University
Dick De Veaux, Williams College (ASA Vice President)
Bradley Efron, Stanford University
Scott Evans, George Washington U (ASA Pubs Rep)
Mark Glickman, Harvard University (ASA Section Rep)
Barry Graubard, National Cancer Institute
Xiao-Li Meng, Harvard University
Vijay Nair, Wells Fargo and University of Michigan
Nancy Reid, University of Toronto
Stephen Stigler, The University of Chicago
Stephen Vardeman, Iowa State University
Chris Wikle, University of Missouri
The Task Force also states:
“P-values and significance tests are among the
most studied and best understood statistical
procedures in the statistics literature”.
As an aside to this, Stephen Stigler asks:
“…Which of the exciting new methods on
modern data science machine learning can the
same be said?”
33
Our view: Reformulate Tests
• Instead of a binary cut-off (significant or not)
the particular outcome is used to infer
discrepancies that are or are not well-
warranted
• Avoids fallacies of significance and
nonsignificance, and improves on confidence
interval estimation
34
Severity Reformulation
Severity function: SEV(Test T, data x, claim C)
In a nutshell: one tests several discrepancies
from a test hypothesis and infers those well or
poorly warranted
Akin to confidence distributions
35
36
To avoid Fallacies of Rejection
(e.g., magnitude error)
Testing the mean of a Normal distribution: H0: μ ≤ 0
vs. H1: μ > 0, consider
H0: μ ≤ μ1 vs. H1: μ > μ1
for μ1 = μ0 + γ. If you very probably would have
observed a more impressive (smaller) P-value if μ =
μ1 then the data are poor evidence that μ > μ1.
SEV(μ > μ1 ) is low
Power vs Severity for 𝛍 > 𝛍𝟏
37
To give an informal construal,
consider how severity avoids the
“large n problem”
• Fixing the P-value, increasing sample size n,
the cut-off gets smaller
• Get to a point where x is closer to the null
than various alternatives
38
Severity tells us:
• an observed difference just statistically significant at
level α indicates less of a discrepancy from the null if it
results from larger (n1) rather than a smaller (n2) sample
size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that doesn’t
go off unless the house is fully ablaze?
• The larger sample size is like the one that goes off with
burnt toast
[assumptions are presumed to hold] 39
What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in larger
differences than observed — set upper bounds
• If you very probably would have observed a
larger value of test statistic (smaller P-value),
were μ = μ1 then the data indicate that μ< μ1
SEV(μ < μ1) is high
40
Brief overview
• The sources of irreplication are not
mysterious: in many fields, latitude in
collecting and interpreting data makes it too
easy to dredge up impressive looking
findings even when spurious.
• Some of the reforms intended to fix science
enable rather than reveal illicit inferences
due to multiple testing, and data-dredging.
(either they obey the LP or block thresholds)
41
• Banning the use of P-value thresholds in
interpreting data does not diminish but rather
exacerbates data-dredging and biasing
selection effects.
• If an account cannot specify outcomes that
will not be allowed to count as evidence for a
claim—if all thresholds are abandoned—then
there is no test of that claim
• We should instead reformulate tests so as to
avoid fallacies and report the extent of
discrepancies that are and are not indicated
with severity.
42
Mayo (1996, 2018); Mayo and Cox (2006):
Frequentist Principle of Evidence (FEV); SEV: Mayo
and Spanos (2006), Mayo and Hand (2022)
FEV/SEV significant result : A small P-value is
evidence of discrepancy γ from H0, if and only if, there
is a high probability the test would have d(X) < d(x0)
were a discrepancy as large as γ absent
FEV/SEV: insignificant result: A moderate P-value is
evidence of the absence of a discrepancy γ from H0,
only if there is a high probability the test would
have given a worse fit with H0 (i.e., d(X) > d(x0))
were a discrepancy γ to exist 43
References
• Amrhein, V., Greenland, S., & McShane, B. (2019). Comment: Scientists rise up against
statistical significance. Nature, 567, 305–307. https://doi.org/10.1038/d41586-019-00857-9
• Babic, B., Gerke, S., Evgeniou, T., & Cohen, I. G. (2021). Beware explanations from ai in
health care. Science, 373(6552), 284–286. https://doi.org/10.1126/science.abg1834
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A
Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90-
103.
• Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force
statement on statistical significance and replicability. The Annals of Applied Statistics.
(Online June 20, 2021.)
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-
Monograph Series. Hayward, CA: Institute of Mathematical Statistics.
• Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical
Society, Series B (Methodological) 17 (1) (January 1): 69–78.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and
“Rejoinder.” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.
• Goodman SN. (1999). “Toward Evidence-based Medical Statistics. 2: The Bayes factor.” Annals of
Internal Medicine 1999; 130:1005 –1013.
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” In Foundations of Statistical Inference,
edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of
Chicago Press.
• Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,
Cambridge: Cambridge University Press.
44
References cont.
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in J.
Rojo (ed.), The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph
Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
• Mayo, D.G., Hand, D. Statistical significance and its critics: practicing damaging science, or
damaging scientific practice? Synthese 200, 220 (2022). https://doi.org/10.1007/s11229-022-03692-0
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson
Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
• Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical
Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337.
Reprinted in Joint Statistical Papers, 140–85.
• Ryan, E., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a
Bayesian adaptive trial design?. BMC Medical Research Methodology, 20(1), 1-9.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012). “A 21 Word Solution.” Dialogue: The
Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4–7.
• Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2018). False-positive citations. Perspectives
on Psychological Science, 13(2), 255 259. https://doi.org/10.1177/1745691617698146
• Stigler, S. (2022). Comment at the NISS Awards Ceremony & Affiliate Luncheon Program
(Discussion on discussions leading to Task Force’s final report) on August 2, 2021 5 pm ET.
https://www.niss.org/events/niss-awards-ceremony-affiliate-luncheon-program
• Wasserstein, R., & Lazar, N. (2016). The ASA’s statement on p-values: Context, process and
purpose (and supplemental materials). The American Statistician, 70(2), 129–133.
https://doi.org/10.1080/00031305.2016.1154108
• Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial].
The American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913 45

More Related Content

PDF
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
PDF
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
PPTX
D. Mayo: Philosophical Interventions in the Statistics Wars
PDF
The Statistics Wars and Their Casualties
PDF
The Statistics Wars and Their Casualties (w/refs)
PDF
The Statistics Wars and Their Causalities (refs)
PPTX
Mayod@psa 21(na)
PPTX
Controversy Over the Significance Test Controversy
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
D. Mayo: Philosophical Interventions in the Statistics Wars
The Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Causalities (refs)
Mayod@psa 21(na)
Controversy Over the Significance Test Controversy

Similar to Statistical "Reforms": Fixing Science or Threats to Replication and Falsification (20)

PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
PDF
“The importance of philosophy of science for statistical science and vice versa”
PPTX
Replication Crises and the Statistics Wars: Hidden Controversies
PPTX
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
PPTX
The Statistics Wars: Errors and Casualties
PPTX
Severe Testing: The Key to Error Correction
PPTX
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
PPTX
Mayo minnesota 28 march 2 (1)
PDF
Severity as a basic concept in philosophy of statistics
PPTX
Mayo &amp; parker spsp 2016 june 16
PPTX
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
PDF
D. G. Mayo Columbia slides for Workshop on Probability &Learning
PDF
What is the Philosophy of Statistics? (and how I was drawn to it)
PPTX
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
PPTX
D.G. Mayo Slides LSE PH500 Meeting #1
PDF
D.g. mayo 1st mtg lse ph 500
PDF
Philosophy of Science and Philosophy of Statistics
PDF
D. Mayo: Replication Research Under an Error Statistical Philosophy
PDF
Error Control and Severity
PDF
The replication crisis: are P-values the problem and are Bayes factors the so...
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
“The importance of philosophy of science for statistical science and vice versa”
Replication Crises and the Statistics Wars: Hidden Controversies
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
The Statistics Wars: Errors and Casualties
Severe Testing: The Key to Error Correction
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
Mayo minnesota 28 march 2 (1)
Severity as a basic concept in philosophy of statistics
Mayo &amp; parker spsp 2016 june 16
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
D. G. Mayo Columbia slides for Workshop on Probability &Learning
What is the Philosophy of Statistics? (and how I was drawn to it)
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
D.G. Mayo Slides LSE PH500 Meeting #1
D.g. mayo 1st mtg lse ph 500
Philosophy of Science and Philosophy of Statistics
D. Mayo: Replication Research Under an Error Statistical Philosophy
Error Control and Severity
The replication crisis: are P-values the problem and are Bayes factors the so...
Ad

More from jemille6 (19)

PDF
D. Mayo JSM slides v2.pdf
PDF
reid-postJSM-DRC.pdf
PDF
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
PDF
Causal inference is not statistical inference
PDF
What are questionable research practices?
PDF
What's the question?
PDF
The neglected importance of complexity in statistics and Metascience
PDF
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
PDF
On Severity, the Weight of Evidence, and the Relationship Between the Two
PDF
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
PDF
Comparing Frequentists and Bayesian Control of Multiple Testing
PPTX
Good Data Dredging
PDF
The Duality of Parameters and the Duality of Probability
PDF
On the interpretation of the mathematical characteristics of statistical test...
PDF
The role of background assumptions in severity appraisal (
PDF
The two statistical cornerstones of replicability: addressing selective infer...
PDF
The ASA president Task Force Statement on Statistical Significance and Replic...
PDF
D. G. Mayo jan 11 slides
PDF
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
D. Mayo JSM slides v2.pdf
reid-postJSM-DRC.pdf
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Causal inference is not statistical inference
What are questionable research practices?
What's the question?
The neglected importance of complexity in statistics and Metascience
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
On Severity, the Weight of Evidence, and the Relationship Between the Two
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Comparing Frequentists and Bayesian Control of Multiple Testing
Good Data Dredging
The Duality of Parameters and the Duality of Probability
On the interpretation of the mathematical characteristics of statistical test...
The role of background assumptions in severity appraisal (
The two statistical cornerstones of replicability: addressing selective infer...
The ASA president Task Force Statement on Statistical Significance and Replic...
D. G. Mayo jan 11 slides
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
Ad

Recently uploaded (20)

PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Presentation on HIE in infants and its manifestations
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
master seminar digital applications in india
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Classroom Observation Tools for Teachers
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
A systematic review of self-coping strategies used by university students to ...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
01-Introduction-to-Information-Management.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
GDM (1) (1).pptx small presentation for students
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Presentation on HIE in infants and its manifestations
Microbial diseases, their pathogenesis and prophylaxis
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
2.FourierTransform-ShortQuestionswithAnswers.pdf
master seminar digital applications in india
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Classroom Observation Tools for Teachers
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
A systematic review of self-coping strategies used by university students to ...

Statistical "Reforms": Fixing Science or Threats to Replication and Falsification

  • 1. Statistical "Reforms": Fixing Science or Threats to Replication and Falsification Deborah G Mayo June 7, 2022 The Du Bois-Wells Symposium on Socially Aware Data Science and Ethical Issues in Modeling
  • 2. 2 Mounting failures of replication give a new urgency to critically appraising proposed statistical reforms. • While many are welcome o preregistration o replication o avoid cookbook statistics • Others are quite radical!
  • 3. Replication Crisis Paradox Critic of P-values: It’s much too easy to get a small P-value Crisis of replication: It is much too difficult to replicate small P-values (with prespecified hypotheses) Is it easy or is it hard? 3
  • 4. • R.A. Fisher: it’s easy to lie with statistics by selective reporting, (“political principle that anything can be proved by statistics” (1955, 75)) • Sufficient finagling—cherry-picking, data- dredging, multiple testing, optional stopping— may practically guarantee a preferred claim H appears supported, even if it’s unwarranted 4
  • 5. • “We knew many researchers - including ourselves - who readily admitted to [biasing selection effects]..but they thought it was wrong the way it's wrong to jaywalk. …simulations revealed it was wrong the way it's wrong to rob a bank.” (Simmons, Nelson and Simonsohn (2018, 255) • “21 word solution” (2012) 5
  • 6. This underwrites key features of significance tests: • to bound the probabilities of misleading inferences error probabilities • to constrain the human tendency to selectively favor views they believe in. They should not be replaced with methods less able to control erroneous interpretations of data. Not if we want socially aware data science. 6
  • 7. (Simple) Statistical significance tests Significance tests (R.A. Fisher) are a small part of a rich error statistical methodology: “…to test the conformity of the particular data under analysis with H0 in some respect….” …the P-value: the probability of an even larger value of t0bs merely from background variability or noise (Mayo and Cox 2006, 81) 7
  • 8. Testing reasoning, as we see it • If even larger differences than t0bs occur fairly frequently under H0 (i.e., P-value is not small), there’s scarcely evidence of inconsistency with H0 • Small P-value indicates H1 some underlying discrepancy from H0 because very probably (1–P) you would have seen a smaller difference than t0bs were H0 true. • Even if the small P-value is valid, it isn’t evidence of a scientific conclusion H* Stat-Sub fallacy H1 => H* 8
  • 9. Neyman-Pearson (N-P) put Fisherian tests on firmer footing (1933): Introduces alternative hypotheses H0, H1 H0: μ ≤ 0 vs. H1: μ > 0 • Constrains tests by requiring control of both Type I error (erroneously rejecting) and Type II error (erroneously failing to reject) H0, and power (Neyman also developed confidence interval estimation at the same time) 9
  • 10. N-P tests tools for optimal performance: Their success in optimal control of error probabilities gives a new paradigm for statistics Yet a major criticism: they are only “accept/reject” rules that should be “retired” for anything but quality control decisions, not science (e.g., Amrhein et al 2021) 10
  • 11. • Our view is that reliable error control is crucial for warranting specific inferences (not mere long-run quality control) • What bothers you with selective reporting, cherry picking, stopping when the data look good, P-hacking? • Not a problem about long-run performance — 11
  • 12. 12 • We cannot say the test has done its job in the case at hand in avoiding sources of misinterpreting data
  • 13. Basis for the severe testing philosophy of evidence • Use error probabilities to assess capabilities of tools to probe various flaws (“probativism”) • Data supply good evidence for a claim only if it has been subjected to and passes a test that probably would have found it flawed or specifiably false (if it is). • Altering this capability changes the evidence (on our view) 13
  • 14. On a rival view of evidence… “Two problems that plague frequentist [error statistical] inference: multiple comparisons and multiple looks, or, …data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value… But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, 1010) (Meta-Research Innovation Center at Stanford) 14
  • 15. This view of evidence (probabilist) is based on the Likelihood Principle (LP) All the evidence is contained in the ratio of likelihoods: Pr(x0;H0)/Pr(x0;H1) x support H0 less well than H1 if H0 is less likely than H1 in this technical sense 15
  • 16. Likelihood Principle (LP) vs error statistics • Any hypothesis that perfectly fits the data is maximally likely • Pr(H0 is less well supported than H1; H0) is high for some H1 or other • So even with strong support, the error probability associated with the inference is high 16
  • 17. All error probabilities violate the Likelihood Principle (LP): “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space” (Lindley 1971, 436) 17
  • 18. Many “reforms” offered as alternatives to significance tests, follow the LP • “Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100) • It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself. (Berger and Wolpert, The Likelihood Principle 1988, 78) 18
  • 19. In testing the mean of a standard normal distribution
  • 20. Bayesian sequential (adaptive) analysts say: “The [regulatory] requirement of type I error control for Bayesian adaptive designs causes them to lose many of their philosophical advantages, such as compliance with the likelihood principle” (Ryan et al. 2020, radiation oncology) Our view: Why buy a philosophy that relinquishes error probability control? 20
  • 21. This is of relevance to AI/ML If the field follows critics of explainable AI/ML: “regulators should place more emphasis on well-designed clinical trials, at least for some higher-risk devices, and less on whether the AI/ML system can be explained”. (Babic et al. 2021) “Beware Explanations From AI in Health Care”
  • 22. Bayesians may (indirectly) block implausible inferences • With a low prior degree of belief on H (e.g., real effect), the Bayesian can block inferring H
  • 23. Concerns: • Doesn’t show what has gone wrong—it’s the multiplicity • The believability of post hoc hypotheses is what makes them so seductive • Claims can be highly probable (or even known) while poorly probed. 23
  • 24. How to obtain and interpret Bayesian priors? • Most (?) use nonsubjective or default priors, to prevent prior beliefs from influencing posteriors— data dominant • There is no agreement on which of rival systems to use. (They may not even be probabilities.) (e.g., maximum entropy, invariance, maximizing the missing information, coverage matching)
  • 25. There may be ways to combine Bayesian and error statistical accounts (Gelman: Falsificationist Bayesian; Shalizi: error statistician) “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” “[W]hat we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data.” (Gelman and Shalizi 2013, 10, 20). • Can’t also champion “abandoning statistical significance”, as in a recent “reform” 25
  • 26. A recent recommended “reform”: Don’t say ‘significance’, don’t use P-value thresholds • In 2019, executive director of the American Statistical Association (ASA) (and 2 co-authors*) announce: “declarations of ‘statistical significance’ be abandoned” • “It is time to stop using the term ‘statistically significant’ entirely.” *Wasserstein, Schirm & Lazar 26
  • 27. • We agree the actual P-value should be reported (as all the founders of tests recommended) • But the 2019 Editorial says prespecified P- value thresholds should not be used at all in interpreting results. 27
  • 28. • Many who signed on to the “no threshold view” think by removing P-value thresholds, researchers lose an incentive to data dredge and multiple test and otherwise exploit researcher flexibility • D. Hand and I (2022) argue: banning the use of P-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging 28
  • 29. • In a world without predesignated thresholds, it would be hard to hold the data dredgers accountable for reporting a nominally small P- value through ransacking, data dredging, trying and trying again • What distinguishes data dredged P-values from valid ones is that they fail to meet a prespecified error probability “Statistical Significance and its Critics: Practicing damaging science, or damaging scientific practice?” (Mayo and Hand 2022) 29
  • 30. No tests, no falsification • If you cannot say about any results, ahead of time, they will not be allowed to count in favor of a claim C, then you do not have a test of C • Why insist on replications if at no point can you say, the effect has failed to replicate? • Nor can you test the assumptions of statistical models and likelihood functions 30
  • 31. ASA (President’s) Task Force on Statistical Significance and Replicability • In 2019 ASA President appointed a task force of 14 statisticians put in the odd position of needing: “to address concerns that [the ASA executive director’s “no significance” editorial] might be mistakenly interpreted as official ASA policy” “P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned.” (Benjamini et al. 2021) 31
  • 32. 32 The ASA President’s Task Force: Linda Young, National Agric Stats, U of Florida (Co-Chair) Xuming He, University of Michigan (Co-Chair) Yoav Benjamini, Tel Aviv University Dick De Veaux, Williams College (ASA Vice President) Bradley Efron, Stanford University Scott Evans, George Washington U (ASA Pubs Rep) Mark Glickman, Harvard University (ASA Section Rep) Barry Graubard, National Cancer Institute Xiao-Li Meng, Harvard University Vijay Nair, Wells Fargo and University of Michigan Nancy Reid, University of Toronto Stephen Stigler, The University of Chicago Stephen Vardeman, Iowa State University Chris Wikle, University of Missouri
  • 33. The Task Force also states: “P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature”. As an aside to this, Stephen Stigler asks: “…Which of the exciting new methods on modern data science machine learning can the same be said?” 33
  • 34. Our view: Reformulate Tests • Instead of a binary cut-off (significant or not) the particular outcome is used to infer discrepancies that are or are not well- warranted • Avoids fallacies of significance and nonsignificance, and improves on confidence interval estimation 34
  • 35. Severity Reformulation Severity function: SEV(Test T, data x, claim C) In a nutshell: one tests several discrepancies from a test hypothesis and infers those well or poorly warranted Akin to confidence distributions 35
  • 36. 36 To avoid Fallacies of Rejection (e.g., magnitude error) Testing the mean of a Normal distribution: H0: μ ≤ 0 vs. H1: μ > 0, consider H0: μ ≤ μ1 vs. H1: μ > μ1 for μ1 = μ0 + γ. If you very probably would have observed a more impressive (smaller) P-value if μ = μ1 then the data are poor evidence that μ > μ1. SEV(μ > μ1 ) is low
  • 37. Power vs Severity for 𝛍 > 𝛍𝟏 37
  • 38. To give an informal construal, consider how severity avoids the “large n problem” • Fixing the P-value, increasing sample size n, the cut-off gets smaller • Get to a point where x is closer to the null than various alternatives 38
  • 39. Severity tells us: • an observed difference just statistically significant at level α indicates less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) • What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one that doesn’t go off unless the house is fully ablaze? • The larger sample size is like the one that goes off with burnt toast [assumptions are presumed to hold] 39
  • 40. What About Fallacies of Non-Significant Results? • They don’t warrant 0 discrepancy • Using severity reasoning: rule out discrepancies that very probably would have resulted in larger differences than observed — set upper bounds • If you very probably would have observed a larger value of test statistic (smaller P-value), were μ = μ1 then the data indicate that μ< μ1 SEV(μ < μ1) is high 40
  • 41. Brief overview • The sources of irreplication are not mysterious: in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. • Some of the reforms intended to fix science enable rather than reveal illicit inferences due to multiple testing, and data-dredging. (either they obey the LP or block thresholds) 41
  • 42. • Banning the use of P-value thresholds in interpreting data does not diminish but rather exacerbates data-dredging and biasing selection effects. • If an account cannot specify outcomes that will not be allowed to count as evidence for a claim—if all thresholds are abandoned—then there is no test of that claim • We should instead reformulate tests so as to avoid fallacies and report the extent of discrepancies that are and are not indicated with severity. 42
  • 43. Mayo (1996, 2018); Mayo and Cox (2006): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006), Mayo and Hand (2022) FEV/SEV significant result : A small P-value is evidence of discrepancy γ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as γ absent FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy γ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0)) were a discrepancy γ to exist 43
  • 44. References • Amrhein, V., Greenland, S., & McShane, B. (2019). Comment: Scientists rise up against statistical significance. Nature, 567, 305–307. https://doi.org/10.1038/d41586-019-00857-9 • Babic, B., Gerke, S., Evgeniou, T., & Cohen, I. G. (2021). Beware explanations from ai in health care. Science, 373(6552), 284–286. https://doi.org/10.1126/science.abg1834 • Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90- 103. • Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on statistical significance and replicability. The Annals of Applied Statistics. (Online June 20, 2021.) • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes- Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78. • Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder.” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. • Goodman SN. (1999). “Toward Evidence-based Medical Statistics. 2: The Bayes factor.” Annals of Internal Medicine 1999; 130:1005 –1013. • Lindley, D. V. (1971). “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston. • Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. • Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. 44
  • 45. References cont. • Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in J. Rojo (ed.), The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275. • Mayo, D.G., Hand, D. Statistical significance and its critics: practicing damaging science, or damaging scientific practice? Synthese 200, 220 (2022). https://doi.org/10.1007/s11229-022-03692-0 • Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. • Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85. • Ryan, E., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a Bayesian adaptive trial design?. BMC Medical Research Methodology, 20(1), 1-9. • Simmons, J. Nelson, L. and Simonsohn, U. (2012). “A 21 Word Solution.” Dialogue: The Official Newsletter of the Society for Personality and Social Psychology, 26(2), 4–7. • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2018). False-positive citations. Perspectives on Psychological Science, 13(2), 255 259. https://doi.org/10.1177/1745691617698146 • Stigler, S. (2022). Comment at the NISS Awards Ceremony & Affiliate Luncheon Program (Discussion on discussions leading to Task Force’s final report) on August 2, 2021 5 pm ET. https://www.niss.org/events/niss-awards-ceremony-affiliate-luncheon-program • Wasserstein, R., & Lazar, N. (2016). The ASA’s statement on p-values: Context, process and purpose (and supplemental materials). The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108 • Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial]. The American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913 45