SlideShare a Scribd company logo
Statistical Inference as Severe
Testing: Beyond Performance
and Probabilism
Deborah G Mayo
Dept of Philosophy, Virginia Tech
Seminar in Advanced Research Methods
Dept of Psychology, Princeton University
November 14, 2023
1
Philosophical controversies in
statistics
Both ancient and up to the minute:
• How do humans learn about the world despite
threats of error due to incomplete and variable data?
2
Role of Probability: performance or
probabilism?
• Despite “unifications,” long-standing battles
simmer below the surface in today’s
”statistical crisis in science”
• What’s behind it?
3
Minimal principle of evidence
• We set sail with a minimal principle:
• We don’t have evidence for a claim C if little
if anything has been done that would have
found C flawed, even if it is
4
Statistical inference as severe testing
• Probability is used to assess error-probing
capabilities
• Excavation tool for appraising reforms
5
Replication crisis leads to “reforms”
Several are welcome:
• preregistration of protocol, replication
checks, avoid cookbook statistics
Others are radical
• and even lead to violating our minimal
requirement for evidence
6
• How to subject today’s “reforms” to your
own severe, critical examination?
7
Most often used tools are most
criticized
“Several methodologists have pointed out
that the high rate of nonreplication of
research discoveries is a consequence of
the convenient, yet ill-founded strategy of
claiming conclusive research findings solely
on the basis of a single study assessed by
formal statistical significance, typically for a
p-value less than 0.05. …” (Ioannidis 2005,
696)
8
R.A. Fisher
“[W]e need, not an isolated record, but a
reliable method of procedure. In relation to
the test of significance, we may say that a
phenomenon is experimentally
demonstrable when we know how to
conduct an experiment which will rarely fail
to give us a statistically significant result.”
(Fisher 1947, 14)
9
Simple significance tests (Fisher)
“to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, the
test statistic, such that
• the larger the value of T the more
inconsistent are the data with H0;
p = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81) 10
Testing Reasoning
• Small P-value indicates some underlying
discrepancy from H0 because very
probably you would have seen a less
impressive difference than t0bs were
H0 true.
• This still isn’t evidence of a genuine
statistical effect H1, let alone a scientific
conclusion H*
11
Fallacy of rejection
• H* makes claims that haven’t been
probed by the statistical test
• The moves from experimental
interventions to H* don’t get enough
attention–but statistical accounts
should block them
12
13
Neyman and Pearson tests (1933) put
Fisherian tests on firmer ground:
Introduce alternative hypotheses H0, H1
H0: μ = 0 vs. H1: μ > 0
• Trade-off between Type I errors and Type II errors
• Restricts the inference to statistical alternatives (in a model)
14
Fisher-Neyman (pathological) battles
(after 1935)
• The success of N-P optimal error control
led to a new paradigm in statistics,
overshadows Fisher.
15
Contemporary casualties of Fisher-
Neyman (N-P) battles
• N-P & Fisher tests claimed to be an “inconsistent
hybrid” (Gigerenzer 2004, 590):
• Fisherians can’t use power; N-P testers can’t report
P-values only fixed error probabilities (e.g., P < .05)
• In fact, Fisher & N-P recommended both pre-
data error probabilities and post-data P-value
They are mathematically
essentially identical
• They both fall under tools for “appraising
and bounding the probabilities of seriously
misleading interpretations of data”
(Birnbaum 1970, 1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian
tests, resampling, randomization
16
Both Fisher & N-P: it’s easy to lie
with biasing selection effects
• Sufficient finagling—cherry-picking,
significance seeking, multiple testing,
post-data subgroups, trying and trying
again—may practically guarantee an
impressive-looking effect, even if it’s
unwarranted by evidence
• Violates severity
17
18
Severity Requirement
• We have evidence for a claim C only to
the extent C has been subjected to and
passes a test that would probably have
found C flawed, just if it is.
• This probability is the stringency or
severity with which it has passed the
test.
Requires a third role for probability
Probabilism. To assign a degree of confirmation,
support or belief in a hypothesis, given data x0 (absolute
or comparative)
(e.g., Bayesian, likelihoodist, Fisher at times)
Performance. Ensure long-run reliability of methods,
coverage probabilities (frequentist, behavioristic
Neyman-Pearson, Fisher)
Only probabilism is thought to be inferential or evidential
19
What happened to using probability to
assess error-probing capacity?
• Neither “probabilism” nor “performance” directly
captures error probing capacity
• Good long-run performance is a necessary, not a
sufficient, condition for severity
20
Key to solving a central problem
for frequentists
• Why is good performance relevant for
inference in the case at hand?
• What bothers you with selective
reporting, cherry picking, stopping when
the data look good, P-hacking
• Not problems about long-runs—
21
We cannot say the case at hand has
done a good job of avoiding the
sources of misinterpreting data
A claim C is not warranted _______
• Probabilism: unless C is true or
probable (gets a probability boost, made
comparatively firmer)
• Performance: unless it stems from a
method with low long-run error
• Probativism (severe testing) unless
something (a fair amount) has been done
to probe ways we can be wrong about C
23
Fishing for significance alters
probative capacity
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at the
5 percent level.’ ….The actual level of
significance is not 5 percent, but 64
percent!* (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!) [*Pr(no success) = (.95)20]
24
• Frequentist (error statisticians) need to
adjust P-values to avoid being “fooled
by randomness”
25
• In a classic example in the volume
adjustments for multiplicity led to
falsifying claimed infant training benefits
on personality (from the 1940s)
• Only 18 of 460 tests were found
statistically significant (11 in the
direction expected by Freudian theory
of the day)
26
Today’s Meta-research is not free of
philosophy of statistics
From the Bayesian perspective:
“adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman
1999, 1010)
(Co-director of the Meta-Research Innovation Center
at Stanford)
27
Likelihood Principle (LP)
A pivotal disagreement about statistical
evidence
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses
vary
28
Logic of Support
• Ian Hacking (1965) “Law of Likelihood”: x
support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• “there always is such a rival hypothesis
viz., that things just had to turn out the
way they actually did” (Barnard 1972,
129).
29
Error Probability
• Pr(H0 is less well supported than H1; H0 )
is high
for some H1 or other
30
On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance
levels, power, all depend on something
more [than the likelihood function]–
something that is irrelevant in Bayesian
inference–namely the sample space”
(Lindley 1971, 436)
31
32
Familiar “reforms” offered as alternative
to significance tests follow the LP
• “The Bayes factor only depends on the actually
observed data, and not on whether they have been
collected from an experiment with fixed or variable
sample size.”
(van Dongen, Sprenger, and Wagenmakers 2022)
• “It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for
itself”.
(Berger and Wolpert 1988, 78)
In testing the mean of a standard
normal distribution
33
Optional Stopping
34
• “if an experimenter uses this [optional
stopping] procedure, then with probability
1 he will eventually reject any sharp null
hypothesis, even though it be true”.
(Edwards, Lindman, and Savage in
Psychological Review 1963, 239)
The Stopping Rule Principle
From their Bayesian standpoint the stopping
rule is irrelevant
• “[the] irrelevance of stopping rules to
statistical inference restores a simplicity
and freedom to experimental design that
had been lost by classical emphasis on
significance levels (in the sense of Neyman
and Pearson).” (Edwards, Lindman, and
Savage 1963, 239)
35
Contrast this with: The 21 Word
Solution
• Replication researchers (re)discovered that
data-dependent hypotheses and stopping are a
major source of spurious significance levels.
• Simmons, Nelson, and Simonsohn (2011) place
at the top of their list the need to block flexible
stopping
• “Authors must decide the rule for terminating
data collection before data collection begins and
report this rule in the articles” (ibid. 1362).
36
37
You might think replication researchers
disavow the stopping rule principle
“[I]f the sampling plan is ignored, the
researcher is able to always reject the null
hypothesis, even if it is true. ..Some people
feel that ‘optional stopping’ amounts to
cheating…. This feeling is, however,
contradicted by a mathematical analysis. (Eric-
Jan Wagenmakers, 2007, 785)
The mathematical analysis assumes the
likelihood principle
Replication Paradox
• Significance test critic: It’s too easy to satisfy
standard statistical significance thresholds
• You: Why is it so hard to replicate significance
thresholds with preregistered protocols?
• Significance test critic: Obviously the initial studies
were guilty of P-hacking, cherry-picking, data-
dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on these biasing selection effects.
• Significance test critic: Actually, “reforms”
recommend methods with no need to adjust P-values
due to multiplicity 38
39
But the value of preregistered reports
is error statistical
• Your appraisal is altered by considering the
probability that some hypotheses, stopping
point, subgroups, etc. could have led to a
false positive –even if informal
• True, there are many ways to correct P-
values (Bonferroni, false discovery rates).
• The main thing is to have an alert that the
reported P-values are invalid
Probabilists can still block intuitively
unwarranted inferences
(without error probabilities)
• Supplement with subjective beliefs
• Likelihoods + prior probabilities
40
Problems
• Could work in some cases, but doesn’t show
what researchers had done wrong—battle of
beliefs
• The believability of data-dredged hypotheses
is what makes them so seductive
• Additional source of flexibility, priors and
biasing selection effects
41
Most Bayesians (last decade) use
“default” priors
• Default priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
42
How should we interpret them?
• “The priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Conventional priors may not even be
probabilities…” (Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors
(maximum entropy, invariance, maximizing the
missing information, coverage matching.)
43
Criticisms of P-hackers lose force
• Wanting to promote an account that
downplays error probabilities, the data
dredger deserving criticism is given a life-raft:
44
Bem’s “Feeling the future” 2011:
ESP?
• Daryl Bem (2011): subjects do better than
chance at predicting the (erotic) picture
shown in the future
• Bem admits data dredging, but Bayesian
critics resort to a default Bayesian prior to (a
point) null hypothesis
Wagenmakers et al. 2011 “Why psychologists must
change the way they analyze their data”
45
Bem’s response
“Whenever the null hypothesis is sharply defined but the
prior distribution on the alternative hypothesis is diffused
over a wide range of values, as it is as it is in...
Wagenmakers et al. (2011), it boosts the probability that
any observed data will be higher under the null
hypothesis than under the alternative.
This is known as the Lindley-Jeffreys paradox*: ... strong
[frequentist] evidence in support of the experimental
hypothesis be contradicted by a ... Bayesian analysis.”
(Bem et al. 2011, 717)
*Bayes-Fisher disagreement 46
Many of Today’s Significance Test
Debates Trace to Bayes/Fisher
Disagreement
• The posterior probability Pr(H0|x) can be
high while the P-value is low (2-sided test)
47
Bayes/Fisher Disagreement
With a lump of prior given to a point null, and
the rest appropriately spread over the
alternative [spike and smear], an α significant
result can correspond to
Pr(H0 |x) = (1 - α)! (e.g., 0.95)
with large n
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
48
• To the Bayesian, the P-value exaggerates
the evidence against H0
• The significance tester balks at taking low p-
values as no evidence against, or even
evidence for, H0
49
“Concentrating mass on the point null
hypothesis is biasing the prior in favor
of H0 as much as possible” (Casella
and R. Berger 1987, 111)
50
“Redefine Statistical Significance”
‘Spike and smear” is the basis for the move
to lower the P-value threshold to .005
(Benjamin et al. 2018)
Opposing megateam: Lakens et al. (2018)
51
• The problem isn’t lowering the probability
of type I errors
• The problem is assuming there should be
agreement between quantities measuring
different things
52
53
Whether P-values exaggerate,
“depends on one’s philosophy of
statistics …
…based on directly comparing P values
against certain quantities (likelihood ratios
and Bayes factors) that play a central role as
evidence measures in Bayesian analysis …
other statisticians do not accept these
quantities as gold standards,”… (Greenland,
Senn, Rothman, Carlin, Poole, Goodman,
Altman 2016, 342)
54
• A silver lining to distinguishing highly
probable and highly probed–can use
different methods for different contexts
55
“A Bayesian Perspective on Severity” van
Dongen, Sprenger, Wagenmakers [VSW]
(Psychonomic Bulletin and Review 2022):
“As Mayo emphasizes, the Bayes factor is insensitive
to variations in the sampling protocol that affect the
error rates, i.e., optional stopping” ...
They argue that Bayesians can satisfy severity
“regardless of whether the test has been conducted in
a severe or less severe fashion”. (VSW 2022)
56
What they mean is that data can be much more
probable on hypothesis H1 than on H0
But severity in their comparative subjective Bayesian
sense does not mean H1 was well probed (in the error
statistical sense)
Bayes Factors (BF)
57
58
The Bayes factor proponents and
severe testers are on the same
side against
• “declarations of ‘statistical significance’ be
abandoned” (Wasserstein, Schirm & Lazar
2019).
• “whether a p-value passes any arbitrary
threshold should not be considered at all" in
interpreting data (ibid., 2)
No significance/no threshold view
59
ASA (President’s) Task Force on
Statistical Significance and
Replicability (2019-2021)
The ASA executive director’s “no threshold”
view is not ASA policy:
“P-values and significance testing, properly
applied and interpreted, are important tools
that should not be abandoned.” (Benjamini et
al. 2021)
60
Severity Reformulate tests
in terms of discrepancies (effect sizes) that
are and are not severely-tested
SEV(Test T, data x, claim C)
• In a nutshell: one tests several
discrepancies from a test hypothesis and
infers those well or poorly warranted
Mayo1981-2018; Mayo and Spanos (2006); Mayo and
Cox (2006); Mayo and Hand (2022)
61
Avoid misinterpreting a 2 SE
significant result (let SE =1)
62
Severity vs Power for 𝛍 > 𝛍𝟏
In the same way, severity avoids
the “large n” problem
• Fixing the P-value, increasing sample
size n, the cut-off gets smaller
• Large n is the basis for the Jeffreys-
Lindley paradox
63
Severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 64
65
Setting upper bounds
What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in larger
differences than observed- set upper bounds
66
Confidence Intervals CIs are also
improved
Duality between tests and intervals: values within the
(1 --α) CI are non--rejectable at the α level.
• Differentiate the warrant for claims within the
interval
• Move away from fixed confidence levels (e.g., .95)
• Provide an inferential rationale
67
68
We get an inferential rationale
CI Estimate:
CI-lower < μ < CI-upper
Performance rationale: Because it came from a
procedure with good coverage probability
Severe Tester:
μ > CI-lower because with high probability (.975) we
would have observed a smaller ̅
𝑥 if μ ≤ CI-lower
likewise for warranting μ < CI-upper
• I begin with a simple tool: the minimal
requirement for evidence
• We have evidence for C only to the
extent C has been subjected to and
passes a test it probably would have
failed if false
69
• Biasing selection effects make it easy to
find impressive-looking effects
erroneously
• They alter a method’s error probing
capacities
• They do not alter evidence (in traditional
probabilisms): Likelihood Principle (LP)
• On the LP, error probabilities consider
“imaginary data” and “intentions”
70
• To the severe tester, probabilists are
robbed from a main way to block spurious
results
• Probabilists may block inferences without
appeal to error probabilities: high prior to
H0 (no effect) can result in a high posterior
probability to H0
• Problems: Increased flexibility, puts blame
in the wrong place, unclear how to interpret
• Gives a life-raft to the P-hacker and cherry
picker 71
• The aim of severe testing directs the
reinterpretation of significance tests and other
methods
• Severe probing (formal or informal) must take
place at every level: from data to statistical
hypothesis; to substantive claims
• A silver lining to distinguishing highly probable
and highly probed–can use different methods
for different contexts
72
Thank you!
I’d be glad to have questions
73
References
• Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of
Statistical Inference” by Ian Hacking). British Journal for the Philosophy of Science
23(2), 123–32.
• Bem, J. (2011). Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality and Social
Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. (2011). Must psychologists change the way they
analyze their data? Journal of Personality and Social Psychology 101(4), 716-719.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2018). Redefine statistical
significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-
017-0189-z
• Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task
force statement on statistical significance and replicability. The Annals of Applied
Statistics. https://doi.org/10.1080/09332480.2021.2003631.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6
Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical
Statistics.
• Birnbaum, A. (1970). Statistical methods in scientific inference (letter to the
Editor).” Nature, 225 (5237) (March 14), 1033.
74
• Casella, G. and Berger, R. (1987). Reconciling Bayesian and frequentist evidence
in the one-sided testing problem. Journal of the American Statistical Association,
82(397), 106-11.
• Cox, D. R. (2006). Principles of statistical inference. Cambridge University Press.
https://doi.org/10.1017/CBO9780511813559
• Cox, D. R., and Mayo, D. G. (2010). Objectivity and conditionality in frequentist
inference. In D. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges
on Experimental Reasoning, Reliability, and the Objectivity and Rationality of
Science,, pp. 276–304. Cambridge: Cambridge University Press.
• Edwards, W., Lindman, H., and Savage, L. (1963). Bayesian statistical inference for
psychological research. Psychological Review, 70(3), 193-242.
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and
Boyd.
• Gigerenzer, G. 2004. Mindless statistics. Journal of Socio-Economics, 33(5), 587–
606.
• Goodman SN. (1999). Toward evidence-based medical statistics. 2: The Bayes
factor. Annals of Internal Medicine, 130, 1005 –1013.
• Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Statistical tests, P values,
confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 31,
337–350. https://doi.org/10.1007/s10654-016-0149-3 75
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University
Press.
• Hacking, I. (1980). The theory of probable inference: Neyman, Peirce and
Braithwaite. In D. Mellor (Ed.), Science, Belief and Behavior: Essays in Honour of
R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.
• Ioannidis, J. (2005). Why most published research findings are false. PLoS
Medicine 2(8), 0696–0701.
• Lakens, D., et al. (2018). Justify your alpha. Nature Human Behavior 2, 168-71.
• Lindley, D. V. (1971). The estimation of many parameters. In V. Godambe & D.
Sprott, (Eds.), Foundations of Statistical Inference pp. 435–455. Toronto: Holt,
Rinehart and Winston.
• Mayo, D. (1981). In Defense of the Neyman-Pearson Theory of Confidence
Intervals. Philosophy of Science, 48(2), 269-280.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science
and Its Conceptual Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond
the Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. (2020). Significance tests: Vitiated or vindicated by the replication
crisis in psychology? Review of Philosophy and Psychology 12, 101-120.
DOI https://doi.org/10.1007/s13164-020-00501-w
76
77
• Mayo, D. G. (2020). P-values on trial: Selective reporting of (best practice guides
against) selective reporting. Harvard Data Science Review 2.1.
• Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest.
Conservation Biology : The Journal of the Society for Conservation Biology, 36(1),
13861. https://doi.org/10.1111/cobi.13861.
• Mayo, D. G. and Cox, D. R. (2006). Frequentist statistics as a theory of inductive
inference. In J. Rojo, (Ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, pp. 247-275. Lecture Notes-Monograph Series, Volume 49, Institute of
Mathematical Statistics.
• Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing
damaging science, or damaging scientific practice?. Synthese 200, 220.
• Mayo, D. G. and Kruse, M. (2001). Principles of inference and their consequences.
In D. Cornfield & J. Williamson (Eds.) Foundations of Bayesianism, pp. 381-403.
Dordrecht: Kluwer Academic Publishes.
• Mayo, D. G., and A. Spanos. (2006). Severe testing as a basic concept in a
Neyman–Pearson philosophy of induction.” British Journal for the Philosophy of
Science 57(2) (June 1), 323–357.
• Mayo, D. G., and A. Spanos (2011). Error statistics. In P. Bandyopadhyay and M.
Forster (Eds.), Philosophy of Statistics, 7, pp. 152–198. Handbook of the Philosophy
of Science. The Netherlands: Elsevier.
78
• Morrison, D. E., and R. E. Henkel, (Eds.), (1970). The Significance Test Controversy:
A Reader. Chicago: Aldine De Gruyter.
• Neyman, J. and Pearson, E. (1933). On the problem of the most efficient tests of
statistical hypotheses. Philosophical Transactions of the Royal Society of London
Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85.
• Neyman, J. & Pearson, E. (1967). Joint statistical papers of J. Neyman and E. S.
Pearson. University of California Press.
• Open Science Collaboration (2015). Estimating the reproducibility of psychological
science”, Science 349(6251), 943-51.
• Pearson, E. S. & Neyman, J. (1967). On the problem of two samples. In J. Neyman &
E.S. Pearson (Eds.) Joint Statistical Papers, pp. 99-115 (Berkeley: U. of Calif.
Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96.
• Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London:
Methuen.
• Selvin, H. (1970). A critique of tests of significance in survey research. In D. Morrison
and R. Henkel (Eds.). The Significance Test Controversy, pp., 94-106. Chicago:
Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2011). A false-positive psychology:
Undisclosed flexibility in data collection and analysis allow presenting anything as
significant”, Dialogue: Psychological Science, 22(11), 1359-66.
79
• Simmons, J., et al. 2012. A 21 word solution. Dialogue: The Official Newsletter of the
Society for Personality and Social Psychology 26 (2), 4–7.
• van Dongen, N., Sprenger, J. & Wagenmakers, EJ. (2022). A Bayesian perspective
on severity: Risky predictions and specific hypotheses. Psychon Bull Rev 30, 516–
533. https://doi.org/10.3758/s13423-022-02069-1
• Wagenmakers, E-J., (2007). A practical solution to the pervasive problems of p
values. Psychonomic Bulletin & Review 14(5), 779-804.
• Wagenmakers et al. (2011). Why psychologists must change the way they analyze
their data: The case of psi: Comment on Bem (2011). Journal of Personality and
Social Psychology, 100(3): 426-32.
• Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world beyond “p < 0.05”
(Editorial). The American Statistician 73(S1), 1–19.
https://doi.org/10.1080/00031305.2019.1583913
Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is
the datum of some other experiment,
and if it happens that P(x|µ) and
P(y|µ) are proportional functions of
µ (that is, constant multiples of each
other), then each of the two data x
and y have exactly the same thing to
say about the values of µ…” (Savage
1962, 17)
80

More Related Content

PDF
“The importance of philosophy of science for statistical science and vice versa”
PPTX
Replication Crises and the Statistics Wars: Hidden Controversies
PDF
Severity as a basic concept in philosophy of statistics
PPTX
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
PDF
D. G. Mayo Columbia slides for Workshop on Probability &Learning
PDF
D.g. mayo 1st mtg lse ph 500
PPTX
D.G. Mayo Slides LSE PH500 Meeting #1
PPTX
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
“The importance of philosophy of science for statistical science and vice versa”
Replication Crises and the Statistics Wars: Hidden Controversies
Severity as a basic concept in philosophy of statistics
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D.g. mayo 1st mtg lse ph 500
D.G. Mayo Slides LSE PH500 Meeting #1
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...

Similar to Statistical Inference as Severe Testing: Beyond Performance and Probabilism (20)

PPTX
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
PDF
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
PPTX
D. Mayo: Philosophical Interventions in the Statistics Wars
PPTX
Mayod@psa 21(na)
PDF
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
PDF
The Statistics Wars and Their Causalities (refs)
PDF
The Statistics Wars and Their Casualties
PDF
The Statistics Wars and Their Casualties (w/refs)
PDF
What is the Philosophy of Statistics? (and how I was drawn to it)
PPTX
Mayo &amp; parker spsp 2016 june 16
PPTX
The Statistics Wars: Errors and Casualties
PDF
Philosophy of Science and Philosophy of Statistics
PPTX
Severe Testing: The Key to Error Correction
PDF
D. Mayo: Replication Research Under an Error Statistical Philosophy
PPTX
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
PPTX
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
PPTX
D. G. Mayo: Your data-driven claims must still be probed severely
PDF
Error Control and Severity
PDF
Final mayo's aps_talk
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
D. Mayo: Philosophical Interventions in the Statistics Wars
Mayod@psa 21(na)
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties (w/refs)
What is the Philosophy of Statistics? (and how I was drawn to it)
Mayo &amp; parker spsp 2016 june 16
The Statistics Wars: Errors and Casualties
Philosophy of Science and Philosophy of Statistics
Severe Testing: The Key to Error Correction
D. Mayo: Replication Research Under an Error Statistical Philosophy
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
D. G. Mayo: Your data-driven claims must still be probed severely
Error Control and Severity
Final mayo's aps_talk
Ad

More from jemille6 (20)

PDF
D. Mayo JSM slides v2.pdf
PDF
reid-postJSM-DRC.pdf
PDF
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
PDF
Causal inference is not statistical inference
PDF
What are questionable research practices?
PDF
What's the question?
PDF
The neglected importance of complexity in statistics and Metascience
PDF
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
PDF
On Severity, the Weight of Evidence, and the Relationship Between the Two
PDF
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
PDF
Comparing Frequentists and Bayesian Control of Multiple Testing
PPTX
Good Data Dredging
PDF
The Duality of Parameters and the Duality of Probability
PDF
On the interpretation of the mathematical characteristics of statistical test...
PDF
The role of background assumptions in severity appraisal (
PDF
The two statistical cornerstones of replicability: addressing selective infer...
PDF
The replication crisis: are P-values the problem and are Bayes factors the so...
PDF
The ASA president Task Force Statement on Statistical Significance and Replic...
PDF
D. G. Mayo jan 11 slides
PDF
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
D. Mayo JSM slides v2.pdf
reid-postJSM-DRC.pdf
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Causal inference is not statistical inference
What are questionable research practices?
What's the question?
The neglected importance of complexity in statistics and Metascience
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
On Severity, the Weight of Evidence, and the Relationship Between the Two
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Comparing Frequentists and Bayesian Control of Multiple Testing
Good Data Dredging
The Duality of Parameters and the Duality of Probability
On the interpretation of the mathematical characteristics of statistical test...
The role of background assumptions in severity appraisal (
The two statistical cornerstones of replicability: addressing selective infer...
The replication crisis: are P-values the problem and are Bayes factors the so...
The ASA president Task Force Statement on Statistical Significance and Replic...
D. G. Mayo jan 11 slides
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
Ad

Recently uploaded (20)

PPTX
Cell Structure & Organelles in detailed.
PDF
Classroom Observation Tools for Teachers
PPTX
Lesson notes of climatology university.
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
01-Introduction-to-Information-Management.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
RMMM.pdf make it easy to upload and study
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Cell Types and Its function , kingdom of life
PPTX
GDM (1) (1).pptx small presentation for students
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Cell Structure & Organelles in detailed.
Classroom Observation Tools for Teachers
Lesson notes of climatology university.
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
VCE English Exam - Section C Student Revision Booklet
Module 4: Burden of Disease Tutorial Slides S2 2025
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
01-Introduction-to-Information-Management.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
RMMM.pdf make it easy to upload and study
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Abdominal Access Techniques with Prof. Dr. R K Mishra
Cell Types and Its function , kingdom of life
GDM (1) (1).pptx small presentation for students
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf

Statistical Inference as Severe Testing: Beyond Performance and Probabilism

  • 1. Statistical Inference as Severe Testing: Beyond Performance and Probabilism Deborah G Mayo Dept of Philosophy, Virginia Tech Seminar in Advanced Research Methods Dept of Psychology, Princeton University November 14, 2023 1
  • 2. Philosophical controversies in statistics Both ancient and up to the minute: • How do humans learn about the world despite threats of error due to incomplete and variable data? 2
  • 3. Role of Probability: performance or probabilism? • Despite “unifications,” long-standing battles simmer below the surface in today’s ”statistical crisis in science” • What’s behind it? 3
  • 4. Minimal principle of evidence • We set sail with a minimal principle: • We don’t have evidence for a claim C if little if anything has been done that would have found C flawed, even if it is 4
  • 5. Statistical inference as severe testing • Probability is used to assess error-probing capabilities • Excavation tool for appraising reforms 5
  • 6. Replication crisis leads to “reforms” Several are welcome: • preregistration of protocol, replication checks, avoid cookbook statistics Others are radical • and even lead to violating our minimal requirement for evidence 6
  • 7. • How to subject today’s “reforms” to your own severe, critical examination? 7
  • 8. Most often used tools are most criticized “Several methodologists have pointed out that the high rate of nonreplication of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. …” (Ioannidis 2005, 696) 8
  • 9. R.A. Fisher “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14) 9
  • 10. Simple significance tests (Fisher) “to test the conformity of the particular data under analysis with H0 in some respect: …we find a function T = t(y) of the data, the test statistic, such that • the larger the value of T the more inconsistent are the data with H0; p = Pr(T ≥ t0bs; H0)” (Mayo and Cox 2006, 81) 10
  • 11. Testing Reasoning • Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true. • This still isn’t evidence of a genuine statistical effect H1, let alone a scientific conclusion H* 11
  • 12. Fallacy of rejection • H* makes claims that haven’t been probed by the statistical test • The moves from experimental interventions to H* don’t get enough attention–but statistical accounts should block them 12
  • 13. 13 Neyman and Pearson tests (1933) put Fisherian tests on firmer ground: Introduce alternative hypotheses H0, H1 H0: μ = 0 vs. H1: μ > 0 • Trade-off between Type I errors and Type II errors • Restricts the inference to statistical alternatives (in a model)
  • 14. 14 Fisher-Neyman (pathological) battles (after 1935) • The success of N-P optimal error control led to a new paradigm in statistics, overshadows Fisher.
  • 15. 15 Contemporary casualties of Fisher- Neyman (N-P) battles • N-P & Fisher tests claimed to be an “inconsistent hybrid” (Gigerenzer 2004, 590): • Fisherians can’t use power; N-P testers can’t report P-values only fixed error probabilities (e.g., P < .05) • In fact, Fisher & N-P recommended both pre- data error probabilities and post-data P-value
  • 16. They are mathematically essentially identical • They both fall under tools for “appraising and bounding the probabilities of seriously misleading interpretations of data” (Birnbaum 1970, 1033)–error probabilities • I place all under the rubric of error statistics • Confidence intervals, N-P and Fisherian tests, resampling, randomization 16
  • 17. Both Fisher & N-P: it’s easy to lie with biasing selection effects • Sufficient finagling—cherry-picking, significance seeking, multiple testing, post-data subgroups, trying and trying again—may practically guarantee an impressive-looking effect, even if it’s unwarranted by evidence • Violates severity 17
  • 18. 18 Severity Requirement • We have evidence for a claim C only to the extent C has been subjected to and passes a test that would probably have found C flawed, just if it is. • This probability is the stringency or severity with which it has passed the test.
  • 19. Requires a third role for probability Probabilism. To assign a degree of confirmation, support or belief in a hypothesis, given data x0 (absolute or comparative) (e.g., Bayesian, likelihoodist, Fisher at times) Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson, Fisher) Only probabilism is thought to be inferential or evidential 19
  • 20. What happened to using probability to assess error-probing capacity? • Neither “probabilism” nor “performance” directly captures error probing capacity • Good long-run performance is a necessary, not a sufficient, condition for severity 20
  • 21. Key to solving a central problem for frequentists • Why is good performance relevant for inference in the case at hand? • What bothers you with selective reporting, cherry picking, stopping when the data look good, P-hacking • Not problems about long-runs— 21
  • 22. We cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data
  • 23. A claim C is not warranted _______ • Probabilism: unless C is true or probable (gets a probability boost, made comparatively firmer) • Performance: unless it stems from a method with low long-run error • Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about C 23
  • 24. Fishing for significance alters probative capacity Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent!* (Selvin 1970, 104) (Morrison & Henkel’s Significance Test Controversy 1970!) [*Pr(no success) = (.95)20] 24
  • 25. • Frequentist (error statisticians) need to adjust P-values to avoid being “fooled by randomness” 25
  • 26. • In a classic example in the volume adjustments for multiplicity led to falsifying claimed infant training benefits on personality (from the 1940s) • Only 18 of 460 tests were found statistically significant (11 in the direction expected by Freudian theory of the day) 26
  • 27. Today’s Meta-research is not free of philosophy of statistics From the Bayesian perspective: “adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, 1010) (Co-director of the Meta-Research Innovation Center at Stanford) 27
  • 28. Likelihood Principle (LP) A pivotal disagreement about statistical evidence In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses Pr(x0;H0)/Pr(x0;H1) The data x0 are fixed, while the hypotheses vary 28
  • 29. Logic of Support • Ian Hacking (1965) “Law of Likelihood”: x support hypothesis H0 less well than H1 if, Pr(x;H0) < Pr(x;H1) (rejects in 1980) • “there always is such a rival hypothesis viz., that things just had to turn out the way they actually did” (Barnard 1972, 129). 29
  • 30. Error Probability • Pr(H0 is less well supported than H1; H0 ) is high for some H1 or other 30
  • 31. On the LP, error probabilities appeal to something irrelevant “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]– something that is irrelevant in Bayesian inference–namely the sample space” (Lindley 1971, 436) 31
  • 32. 32 Familiar “reforms” offered as alternative to significance tests follow the LP • “The Bayes factor only depends on the actually observed data, and not on whether they have been collected from an experiment with fixed or variable sample size.” (van Dongen, Sprenger, and Wagenmakers 2022) • “It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself”. (Berger and Wolpert 1988, 78)
  • 33. In testing the mean of a standard normal distribution 33
  • 34. Optional Stopping 34 • “if an experimenter uses this [optional stopping] procedure, then with probability 1 he will eventually reject any sharp null hypothesis, even though it be true”. (Edwards, Lindman, and Savage in Psychological Review 1963, 239)
  • 35. The Stopping Rule Principle From their Bayesian standpoint the stopping rule is irrelevant • “[the] irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson).” (Edwards, Lindman, and Savage 1963, 239) 35
  • 36. Contrast this with: The 21 Word Solution • Replication researchers (re)discovered that data-dependent hypotheses and stopping are a major source of spurious significance levels. • Simmons, Nelson, and Simonsohn (2011) place at the top of their list the need to block flexible stopping • “Authors must decide the rule for terminating data collection before data collection begins and report this rule in the articles” (ibid. 1362). 36
  • 37. 37 You might think replication researchers disavow the stopping rule principle “[I]f the sampling plan is ignored, the researcher is able to always reject the null hypothesis, even if it is true. ..Some people feel that ‘optional stopping’ amounts to cheating…. This feeling is, however, contradicted by a mathematical analysis. (Eric- Jan Wagenmakers, 2007, 785) The mathematical analysis assumes the likelihood principle
  • 38. Replication Paradox • Significance test critic: It’s too easy to satisfy standard statistical significance thresholds • You: Why is it so hard to replicate significance thresholds with preregistered protocols? • Significance test critic: Obviously the initial studies were guilty of P-hacking, cherry-picking, data- dredging (QRPs) • You: So, the replication researchers want methods that pick up on these biasing selection effects. • Significance test critic: Actually, “reforms” recommend methods with no need to adjust P-values due to multiplicity 38
  • 39. 39 But the value of preregistered reports is error statistical • Your appraisal is altered by considering the probability that some hypotheses, stopping point, subgroups, etc. could have led to a false positive –even if informal • True, there are many ways to correct P- values (Bonferroni, false discovery rates). • The main thing is to have an alert that the reported P-values are invalid
  • 40. Probabilists can still block intuitively unwarranted inferences (without error probabilities) • Supplement with subjective beliefs • Likelihoods + prior probabilities 40
  • 41. Problems • Could work in some cases, but doesn’t show what researchers had done wrong—battle of beliefs • The believability of data-dredged hypotheses is what makes them so seductive • Additional source of flexibility, priors and biasing selection effects 41
  • 42. Most Bayesians (last decade) use “default” priors • Default priors are supposed to prevent prior beliefs from influencing the posteriors–data dominant 42
  • 43. How should we interpret them? • “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, 299) • No agreement on rival systems for default/non- subjective priors (maximum entropy, invariance, maximizing the missing information, coverage matching.) 43
  • 44. Criticisms of P-hackers lose force • Wanting to promote an account that downplays error probabilities, the data dredger deserving criticism is given a life-raft: 44
  • 45. Bem’s “Feeling the future” 2011: ESP? • Daryl Bem (2011): subjects do better than chance at predicting the (erotic) picture shown in the future • Bem admits data dredging, but Bayesian critics resort to a default Bayesian prior to (a point) null hypothesis Wagenmakers et al. 2011 “Why psychologists must change the way they analyze their data” 45
  • 46. Bem’s response “Whenever the null hypothesis is sharply defined but the prior distribution on the alternative hypothesis is diffused over a wide range of values, as it is as it is in... Wagenmakers et al. (2011), it boosts the probability that any observed data will be higher under the null hypothesis than under the alternative. This is known as the Lindley-Jeffreys paradox*: ... strong [frequentist] evidence in support of the experimental hypothesis be contradicted by a ... Bayesian analysis.” (Bem et al. 2011, 717) *Bayes-Fisher disagreement 46
  • 47. Many of Today’s Significance Test Debates Trace to Bayes/Fisher Disagreement • The posterior probability Pr(H0|x) can be high while the P-value is low (2-sided test) 47
  • 48. Bayes/Fisher Disagreement With a lump of prior given to a point null, and the rest appropriately spread over the alternative [spike and smear], an α significant result can correspond to Pr(H0 |x) = (1 - α)! (e.g., 0.95) with large n Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0. 48
  • 49. • To the Bayesian, the P-value exaggerates the evidence against H0 • The significance tester balks at taking low p- values as no evidence against, or even evidence for, H0 49
  • 50. “Concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (Casella and R. Berger 1987, 111) 50
  • 51. “Redefine Statistical Significance” ‘Spike and smear” is the basis for the move to lower the P-value threshold to .005 (Benjamin et al. 2018) Opposing megateam: Lakens et al. (2018) 51
  • 52. • The problem isn’t lowering the probability of type I errors • The problem is assuming there should be agreement between quantities measuring different things 52
  • 53. 53 Whether P-values exaggerate, “depends on one’s philosophy of statistics … …based on directly comparing P values against certain quantities (likelihood ratios and Bayes factors) that play a central role as evidence measures in Bayesian analysis … other statisticians do not accept these quantities as gold standards,”… (Greenland, Senn, Rothman, Carlin, Poole, Goodman, Altman 2016, 342)
  • 54. 54 • A silver lining to distinguishing highly probable and highly probed–can use different methods for different contexts
  • 55. 55 “A Bayesian Perspective on Severity” van Dongen, Sprenger, Wagenmakers [VSW] (Psychonomic Bulletin and Review 2022): “As Mayo emphasizes, the Bayes factor is insensitive to variations in the sampling protocol that affect the error rates, i.e., optional stopping” ... They argue that Bayesians can satisfy severity “regardless of whether the test has been conducted in a severe or less severe fashion”. (VSW 2022)
  • 56. 56 What they mean is that data can be much more probable on hypothesis H1 than on H0 But severity in their comparative subjective Bayesian sense does not mean H1 was well probed (in the error statistical sense)
  • 58. 58 The Bayes factor proponents and severe testers are on the same side against • “declarations of ‘statistical significance’ be abandoned” (Wasserstein, Schirm & Lazar 2019). • “whether a p-value passes any arbitrary threshold should not be considered at all" in interpreting data (ibid., 2) No significance/no threshold view
  • 59. 59 ASA (President’s) Task Force on Statistical Significance and Replicability (2019-2021) The ASA executive director’s “no threshold” view is not ASA policy: “P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned.” (Benjamini et al. 2021)
  • 60. 60 Severity Reformulate tests in terms of discrepancies (effect sizes) that are and are not severely-tested SEV(Test T, data x, claim C) • In a nutshell: one tests several discrepancies from a test hypothesis and infers those well or poorly warranted Mayo1981-2018; Mayo and Spanos (2006); Mayo and Cox (2006); Mayo and Hand (2022)
  • 61. 61 Avoid misinterpreting a 2 SE significant result (let SE =1)
  • 62. 62 Severity vs Power for 𝛍 > 𝛍𝟏
  • 63. In the same way, severity avoids the “large n” problem • Fixing the P-value, increasing sample size n, the cut-off gets smaller • Large n is the basis for the Jeffreys- Lindley paradox 63
  • 64. Severity tells us: • an α-significant difference indicates less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) • What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one that doesn’t go off unless the house is fully ablaze? • [The larger sample size is like the one that goes off with burnt toast] 64
  • 66. What About Fallacies of Non-Significant Results? • They don’t warrant 0 discrepancy • Using severity reasoning: rule out discrepancies that very probably would have resulted in larger differences than observed- set upper bounds 66
  • 67. Confidence Intervals CIs are also improved Duality between tests and intervals: values within the (1 --α) CI are non--rejectable at the α level. • Differentiate the warrant for claims within the interval • Move away from fixed confidence levels (e.g., .95) • Provide an inferential rationale 67
  • 68. 68 We get an inferential rationale CI Estimate: CI-lower < μ < CI-upper Performance rationale: Because it came from a procedure with good coverage probability Severe Tester: μ > CI-lower because with high probability (.975) we would have observed a smaller ̅ 𝑥 if μ ≤ CI-lower likewise for warranting μ < CI-upper
  • 69. • I begin with a simple tool: the minimal requirement for evidence • We have evidence for C only to the extent C has been subjected to and passes a test it probably would have failed if false 69
  • 70. • Biasing selection effects make it easy to find impressive-looking effects erroneously • They alter a method’s error probing capacities • They do not alter evidence (in traditional probabilisms): Likelihood Principle (LP) • On the LP, error probabilities consider “imaginary data” and “intentions” 70
  • 71. • To the severe tester, probabilists are robbed from a main way to block spurious results • Probabilists may block inferences without appeal to error probabilities: high prior to H0 (no effect) can result in a high posterior probability to H0 • Problems: Increased flexibility, puts blame in the wrong place, unclear how to interpret • Gives a life-raft to the P-hacker and cherry picker 71
  • 72. • The aim of severe testing directs the reinterpretation of significance tests and other methods • Severe probing (formal or informal) must take place at every level: from data to statistical hypothesis; to substantive claims • A silver lining to distinguishing highly probable and highly probed–can use different methods for different contexts 72
  • 73. Thank you! I’d be glad to have questions 73
  • 74. References • Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of Statistical Inference” by Ian Hacking). British Journal for the Philosophy of Science 23(2), 123–32. • Bem, J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology 100(3), 407-425. • Bem, J., Utts, J., and Johnson, W. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology 101(4), 716-719. • Benjamin, D., Berger, J., Johannesson, M., et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562- 017-0189-z • Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on statistical significance and replicability. The Annals of Applied Statistics. https://doi.org/10.1080/09332480.2021.2003631. • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Birnbaum, A. (1970). Statistical methods in scientific inference (letter to the Editor).” Nature, 225 (5237) (March 14), 1033. 74
  • 75. • Casella, G. and Berger, R. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem. Journal of the American Statistical Association, 82(397), 106-11. • Cox, D. R. (2006). Principles of statistical inference. Cambridge University Press. https://doi.org/10.1017/CBO9780511813559 • Cox, D. R., and Mayo, D. G. (2010). Objectivity and conditionality in frequentist inference. In D. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science,, pp. 276–304. Cambridge: Cambridge University Press. • Edwards, W., Lindman, H., and Savage, L. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3), 193-242. • Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd. • Gigerenzer, G. 2004. Mindless statistics. Journal of Socio-Economics, 33(5), 587– 606. • Goodman SN. (1999). Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine, 130, 1005 –1013. • Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 31, 337–350. https://doi.org/10.1007/s10654-016-0149-3 75
  • 76. • Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press. • Hacking, I. (1980). The theory of probable inference: Neyman, Peirce and Braithwaite. In D. Mellor (Ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60. • Ioannidis, J. (2005). Why most published research findings are false. PLoS Medicine 2(8), 0696–0701. • Lakens, D., et al. (2018). Justify your alpha. Nature Human Behavior 2, 168-71. • Lindley, D. V. (1971). The estimation of many parameters. In V. Godambe & D. Sprott, (Eds.), Foundations of Statistical Inference pp. 435–455. Toronto: Holt, Rinehart and Winston. • Mayo, D. (1981). In Defense of the Neyman-Pearson Theory of Confidence Intervals. Philosophy of Science, 48(2), 269-280. • Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. • Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. G. (2020). Significance tests: Vitiated or vindicated by the replication crisis in psychology? Review of Philosophy and Psychology 12, 101-120. DOI https://doi.org/10.1007/s13164-020-00501-w 76
  • 77. 77 • Mayo, D. G. (2020). P-values on trial: Selective reporting of (best practice guides against) selective reporting. Harvard Data Science Review 2.1. • Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest. Conservation Biology : The Journal of the Society for Conservation Biology, 36(1), 13861. https://doi.org/10.1111/cobi.13861. • Mayo, D. G. and Cox, D. R. (2006). Frequentist statistics as a theory of inductive inference. In J. Rojo, (Ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, pp. 247-275. Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics. • Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220. • Mayo, D. G. and Kruse, M. (2001). Principles of inference and their consequences. In D. Cornfield & J. Williamson (Eds.) Foundations of Bayesianism, pp. 381-403. Dordrecht: Kluwer Academic Publishes. • Mayo, D. G., and A. Spanos. (2006). Severe testing as a basic concept in a Neyman–Pearson philosophy of induction.” British Journal for the Philosophy of Science 57(2) (June 1), 323–357. • Mayo, D. G., and A. Spanos (2011). Error statistics. In P. Bandyopadhyay and M. Forster (Eds.), Philosophy of Statistics, 7, pp. 152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.
  • 78. 78 • Morrison, D. E., and R. E. Henkel, (Eds.), (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. • Neyman, J. and Pearson, E. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85. • Neyman, J. & Pearson, E. (1967). Joint statistical papers of J. Neyman and E. S. Pearson. University of California Press. • Open Science Collaboration (2015). Estimating the reproducibility of psychological science”, Science 349(6251), 943-51. • Pearson, E. S. & Neyman, J. (1967). On the problem of two samples. In J. Neyman & E.S. Pearson (Eds.) Joint Statistical Papers, pp. 99-115 (Berkeley: U. of Calif. Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96. • Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen. • Selvin, H. (1970). A critique of tests of significance in survey research. In D. Morrison and R. Henkel (Eds.). The Significance Test Controversy, pp., 94-106. Chicago: Aldine De Gruyter. • Simmons, J. Nelson, L. and Simonsohn, U. (2011). A false-positive psychology: Undisclosed flexibility in data collection and analysis allow presenting anything as significant”, Dialogue: Psychological Science, 22(11), 1359-66.
  • 79. 79 • Simmons, J., et al. 2012. A 21 word solution. Dialogue: The Official Newsletter of the Society for Personality and Social Psychology 26 (2), 4–7. • van Dongen, N., Sprenger, J. & Wagenmakers, EJ. (2022). A Bayesian perspective on severity: Risky predictions and specific hypotheses. Psychon Bull Rev 30, 516– 533. https://doi.org/10.3758/s13423-022-02069-1 • Wagenmakers, E-J., (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review 14(5), 779-804. • Wagenmakers et al. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3): 426-32. • Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world beyond “p < 0.05” (Editorial). The American Statistician 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913
  • 80. Jimmy Savage on the LP: “According to Bayes' theorem,…. if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, 17) 80