SlideShare a Scribd company logo
The Statistics Wars and Their Casualties
September 2022
The two statistical pillars of replicability: Addressing
selective inference and irrelevant variability
Yoav Benjamini
Statistics & O.R., School of Mathematical Sciences
The Sagol School of Neuroscience
Tel Aviv University
Outline
• Tukey’s two pillars of replicability
Addressing Selective Inference
Relevant Variability
• Some recent evidence
• The relation of StatWars to replicability issues
• Frequentist & Bayesian responses to the pillars
• NEJM guidelines: a casualty of the wars
Tukey’s last published work
A puzzling encyclopedic entry on Multiple Comparisons.
Opens : ``a diversity of issues … that tend to be
important, difficult, and often unresolved.”
Details:
• The FDR approach in pairwise comparisons2
• The Random Effects vs Fixed Effects analysis 3
What’s that to do with multiple comparisons?
1 Jones et al (‘22) International Encyclopedia of Statistics in the Social Sciences
2 Williams Jones & Tukey ( ’99) 3 Cornfield & Tukey (’56)
Genes & Behaviour: Crabbe et a (Science, ‘99)
Compared 12 measures across strains at 3 labs
In spite of strict standardization,
Significant Lab*Genotype Interaction
“Thus, experiments characterizing mutants may yield
results that are idiosyncratic to a particular laboratory.”
We thought that using our computational tools will solve the
problem
Comparing 17 measures between 8 inbred strains of mice
At 3 labs: Golani at TAU, Elmer MPRC, Kafkafi NIDA1
MCP 2002, 1Kafkafi et al PNAS 2004
YB Berkeley ʼ10
Behavioral Endpoint Labs Fixed Labs Mixed
Prop. Lingering Time 0.00001 0.0029
# Progression segments 0.00001 0.0068
Median Turn Radius (scaled) 0.00001 0.0092
Time away from wall 0.00001 0.0108
Distance traveled 0.00001 0.0144
Acceleration 0.00001 0.0146
# Excursions 0.00001 0.0178
Time to half max speed 0.00001 0.0204
Max speed wall segments 0.00001 0.0257
Median Turn rate 0.00001 0.0320
Spatial spread 0.00001 0.0388
Lingering mean speed 0.00001 0.0588
Homebase occupancy 0.001 0.0712
# stops per excursion 0.0028 0.1202
Stop diversity 0.027 0.1489
Length of progression segments 0.44 0.5150
Activity decrease 0.67 0.8875
Significance of 8 Strain differences
Strain x Lab
Interaction
significant
Strain x Lab
Interaction
not significant
FDR ≤ .05
Recalling Mann’s warning in “Behavior Genetics in transition”1
(Science, 94)
“…jumping too soon to discoveries..” (and press discoveries)
“raises the issue of Replicability”
Tukey’s entry is about replicability of discoveries1
Addressing : Selective Inference
The relevant variability
1 Kafkafi et al (PNAS ’04)
Intensified due to the industrialization of the scientific process
Testing the approach1
• Took Single lab experimental results involving comparisons
between mouse strains from Mouse Phenotyping Database
• Carried similar experiments in 3 labs : JAX, TAUL, and TAUM
without standardization
• Used Random Lab Mixed Model Analysis to assess replicability of
original result
• Estimated g2=s2
GxL / s2
within for each endpoint from Database or
from our experiments
And used it to adjust the single lab results (by inflating sd)
60% of single lab rejected results were non-replicable
12% of adjusted single lab rejected results were non-replicable
Jaljuli, Kafkafi et al ’22+ BioRxiv
Reading ‘56 paper again
EXPERIMENT
Fixed
Scientific Knowledge
Island reached by
statist Inference
Reading ‘56 paper again
EXPERIMENT
Mixed
Scientific Knowledge
Island reached by
statist Inference
Inference on a selected subset of the parameters that turned out to
be of interest after viewing the data!
Relevant to all statistical methods – hurting replicability
Out-of-study selection - not evident in the published work
File drawer problem / publication bias
The garden of forking paths, p-hacking, cherry picking
significance chasing, HARKing, Data dredging,
All are widely discussed and addressed
e.g. by Transparency & Reproducibility standards
Selective inference
In-study selection - evident in the published work:
Selection by the Abstract, Discussion
Table, Figure
Selection by highlighting those passing a threshold
p<.05, p<.005, p<5*10-8, *,**,2 fold
Selection by modeling: AIC, Cp, BIC, LASSO,…
In complex research problems - in-study selection is unavoidable!
Selective inference
• Giovannucci et al. (1995) look for relationships between more
than a hundred types of food intakes and the risk of prostate
cancer
• The abstract reports three (marginal) 95% confidence intervals
(CIs), apparently only for those relative risks whose CIs do not
cover 1.
“Eat Ketchup and Pizza and avoid Prostate Cancer”
12
Selective inference hampers replicability
“Although the pooled RR for raw tomato consumption
was initially significant in 1995, this association has
remained nonsignificant since 2000 after the addition
of 7 studies…” Meta-analysis by Rowles et al (2017)
Any adjustment for selection yields Conf. Intervals covering 1
Error-rates for selective inference
Secondary endpoints from effect of fish oil consumption on CHD, stroke or
death from CVD . Manson et al ‘18 used by NEJM editorial
• Simultaneous over
all possible selections
• Simultaneous
over the selected
• Conditional
on being selected
• On the average
over the selected
• On the average
over all
Addressing selective inference in psychology
Transparency problems 1/100
Reproducibility problems 6/100
Reproducible & selected p ≤. 05
56/88 (=64%) failed.
Evident selection
Adjusting via hierarchical FDR
22 with padj > .05; and screened
Of them
21 non-replicable results
1 replicable discovery lost
Failure rate 36/67 (=52%)
Power loss ~1/31
100 replications efforts, 64/100 failed
Addressing selective inference in psychology
Reducing level to p ≤ 0.005
Benjamin et al + 200 signatures
32 with p > .005; of them
21 non-replicable results
11 replicable discovery
Failure rate 25/47 (=54%)
Power loss =1/3
Zeevi, et al, (‘21+)
The statistical wars
A. Identifying problems of non-evident selective inference
with the use of p-values and statistical testing
Ending with the ‘New Statistics’ and bans on p-values &NHST
The statistical wars
A. Identifying problems of non-evident selective inference
with p-values and statistical testing
B. The 2016 ASA guidelines regarding the p-values1
“… some statisticians prefer …to even replace p-values
with other approaches”
e.g. Bayes factors, confidence intervals, credence intervals
1Wasserstein& Lazar (Amm. Stat. ‘16)
The statistical wars
A. Identifying problems of non-evident selective inference
with p-values and statistical testing
B. The 2016 ASA guidelines regarding the p-values1
C. The 2019 ASA conference and Editorial2
Don’t use p<.05; Don’t say “statistically significant”
1Wasserstein& Lazar (Am. Stat. ‘16) 2Wasserstein,Schirm & Lazar (Am. Stat ’19)
The statistical wars
A. Identifying problems of non-evident selective inference
with p-values and statistical testing
B. The 2016 ASA guidelines regarding the p-values
C. The 2019 ASA conference and Editorial
D. The ASA president’s task force statement3
3 YB et al AOS ‘21
E. The disclaimer
E. By 2022, MacNaughton documented 41 explicit references
to 2019 editorial as official ASA policy.
Past-president Kafadar’s letter to ASA board required:
• Either the board approves the editorial as policy; or
• A disclaimer is added
The editorial was written by the three editors acting as
individuals and reflects their scientific views not an
endorsed position of the American Statistical Association.
May 2022 in the online version only
Scientific Reproducibility and Statistical Significance
Symposium, Convened by ASA on June 3:
“The goals of the symposium are:”
1. To disseminate the stance of the ASA on the
appropriate use of results from null hypothesis
significance testing
2. To offer alternatives to such testing
3. To discuss changes to publication policies that
would benefit both individual scientists and science
writ large.
The war goes on
• The war between Bayesian and frequentists was fierce
in the 1950’s.
• It receded to co-existence and mutual respect
• The replicability crisis was used by zealous Bayesians to
restart the war
• On the eve of the meeting preparing the 2016
statement I expressed my opinion that the statement
should be about statistics and replicabiliy in general -
not merely focused on the p-value
• Unfortunately, I see no connection between the
StatWars and replicability.
When P values are reported for multiple outcomes without
adjustment for multiplicity, the probability of declaring a treatment
difference when none exists can be much higher than 5%. (July ‘19)
The Casualties: NEJM guidelines
1. P-values may not be reported (for secondary endpoints) if
multiplicity correction method was not specified in the protocol or in
the statistical analysis plan
2. Unadjusted (marginal) 95% CIs reported for all secondary
endpoints
The Casualties: NEJM guidelines
Wu et al, citing results of Manson et al NEJM 2018
Nature Reviews Cardiology (2019)
Fish oil supplementation …
had no significant effect on the composite primary end point of
CHD, stroke or death from CVD
but reduced the risk of
total CHD* (HR 0.83, 95% CI 0.71–0.97),
percutaneous corona intervention (HR 0.78, 95% CI 0.63–0.95),
total myocardial infarction* (HR 0.72, 95% CI 0.59–0.90),
fatal myocardial infarction (HR 0.50, 95% CI 0.26–0.97).
The only 4 out of 17 that excluded 1 and were not exploratory.
What’s the problem?
• CI as decision tool
not crossing the no-effect value ó significance testing at 0.05
The issue of multiplicity, as recognized by NEJM, does not disappear
• CIs coverage
Coverage deteriorates: Taking the 17 estimators from above as
parameters.
Generated random means with SNR as estimated above times k
Selected the CIs not crossing above 1 ; 10,000 simulation
Checked the average non-coverage over the so selected
For K=1 0.11 ; For K=0.5 0.18 ; For K=0.01 0.56
Selective inference by CIs is totally ignored
We1 suggest:
Select if nominal 95%CI
does not cross 0 (=log(1))
Assure False Coverage-Rate
control over the so selected
via the general BY-CIs:
* Replace the 4 nominal CIs
that do not cross 1 by
95(1-0.05*4/17)% CIs
*For others use nominal CIs
1YB, Heller, Panagiotou arXiv ‘21
To offer NEJM default guidelines retaining power for exploration
An epidemiologist with 13k followers
The Casualties: 2
The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability
From a Lancet reviewer
• How can statistical hypotheses and strategies for addressing
multiplicity issues be pre-specified in an observational
study, which typically requires exploratory analyses to
reduce bias?
• Can an author claim to have performed a confirmatory test
without this pre-specification?
Logical Fallacy:
Confirmatory analysis => Multiplicity adjustment
Therefore:
Multiplicity adjustment => Confirmatory analysis
Addressing the relevant variability
Frequentists Ignored by many but increasing awareness
Bayesians Recognized (Hierarchical modelling)
Addressing evident selective inference
Frequentists Recognized more for p-values less for CIs &
exploratory research
Bayesian Ignored as a matter of principle
The 2 pillars in Frequentist & Bayesian eyes
E.g. Gelman, Hill & Yajima, M. (‘12)
Why we (usually) don't have to worry about multiple comparisons.
The underlying theoretical justification:
Since we condition on all the data,
Any selection after the data is viewed is already reflected in the
posterior distribution.
Are Bayesian intervals immune from selection’s harms?
Assumed Prior µi~N(0,0.52); yi~N(µi ,1); i=1,2,…,106 (Gelman’s Ex.)
Parameters generated by N(0,.52) 0.999*N(0,.52)+0.001*N(0,.52+32)
Type of 95%
confidence/credence intervals
Marginal Bayesian
Credibility
Marginal
FCR-
adjusted
Bayesian
Credibility
BH-Selected FCR-
adjusted
Intervals not covering
their parameter
5.0% 5.0% 5.1% 2.1%
Intervals not covering 0:
Selected
7.3% 0.01% 0.03% 0.03%
Intervals not covering their
parameter: Out of the Selected
48% 3.4% 1.0% 71.5% 2.1%
YB HDSR ‘19
From Gelman’s Blog August ‘22
Eric van Zwet offers a solution to the above example:
“We can mix the N(0,0.5) with almost any wider normal
distribution with almost any probability and then very large
effects will hardly be shrunken.
He demonstrates it by the prior 0.99*N(0,0.52)+0.01*N(0,62)
Of 741 credible intervals not covering 0, the proportion not
covering the parameter is 0.07 (CI: 0.05 to 0.09)
Gelman response: I continue to think that Bayesian inference
completely solves the multiple comparisons problem.
So, I generated data from this prior to which I add 5% N(4.052)
And got 29% of the so selected not covering their parameter.
(with an order of magnitude power loss relative to frequentist
FDR adjusted testing and FCR CIs)
My point: Bayesians should worry about selective inference if
they care about replicability,
And cannot hide behind the theoretical guarantees.
Some Bayesians do that
Connections with FDR in large inferential problems
Genovese & Wasserman, ’02 Storey et al ’03…
Fdr and fdr variations on FDR in empirical Bayes framework
Efron et al ’13 …
Purely Bayes model where selection should be addressed
Yekutieli et al ’13
Thresholding of posterior odds using BH
Take away massages
Replicability can be enhanced mainly by addressing
Selective inference
• Evident selective inference is as harmful as non-evident
• Needed in exploratory research even when spool is mall
• Needed for CIs
• ASA attitude against p-values and statistical significance is
political and harms replicability
The relevant variability
• Prefer random effect (mixed model) analysis
• Many small studies are better than one/few large ones
Thanks to JWT for the insight
Take away massages
Most Bayesians ignore selective inference
But appreciate addressing the relevant variability
For frequentists it’s the opposite
In both research communities practitioners try to avoid
addressing them,
Until the research complexity is so large that selective
inference is addressed
Until results are so non-replicable that the relevant
variability is addressed
This usually takes too long
Thanks to JWT for the insight
Thanks!
www.replicability.tau.ac.il
The industrialization of the scientific process
1888 1999
1950 2010
The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability

More Related Content

PDF
The ASA president Task Force Statement on Statistical Significance and Replic...
PPT
25_Anderson_Biostatistics_and_Epidemiology.ppt
PPT
Copenhagen 2008
PPT
Copenhagen 23.10.2008
PDF
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
PPT
Quantitative Synthesis I
PDF
Advice On Statistical Analysis For Circulation Research
The ASA president Task Force Statement on Statistical Significance and Replic...
25_Anderson_Biostatistics_and_Epidemiology.ppt
Copenhagen 2008
Copenhagen 23.10.2008
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Quantitative Synthesis I
Advice On Statistical Analysis For Circulation Research

Similar to The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability (20)

PPT
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
PPT
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
PDF
Common statistical pitfalls & errors in biomedical research (a top-5 list)
PPTX
Basic of Biostatistics The Second Part.pptx
PDF
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
PPTX
Statistics in meta analysis
PDF
Glymour aaai
PDF
Lemeshow samplesize
PPT
Analysis and Interpretation
PPTX
Metaanalysis copy
PPTX
Test of significance
PPTX
Effective strategies to monitor clinical risks using biostatistics - Pubrica....
PDF
Choosing appropriate statistical test RSS6 2104
PDF
David Moher - MedicReS World Congress 2012
PPT
Oac guidelines
PPTX
Causal inference lecture to Texas Children's fellows
PPTX
Biostatistics-for-5th-Year-Medical-Students-in-South-Sudan-_Refresher.pptx
PPT
Metanalysis Lecture
PPTX
Dr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptx
PPT
Prague 02.10.2008
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Common statistical pitfalls & errors in biomedical research (a top-5 list)
Basic of Biostatistics The Second Part.pptx
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Statistics in meta analysis
Glymour aaai
Lemeshow samplesize
Analysis and Interpretation
Metaanalysis copy
Test of significance
Effective strategies to monitor clinical risks using biostatistics - Pubrica....
Choosing appropriate statistical test RSS6 2104
David Moher - MedicReS World Congress 2012
Oac guidelines
Causal inference lecture to Texas Children's fellows
Biostatistics-for-5th-Year-Medical-Students-in-South-Sudan-_Refresher.pptx
Metanalysis Lecture
Dr. RM Pandey -Importance of Biostatistics in Biomedical Research.pptx
Prague 02.10.2008
Ad

More from jemille6 (20)

PDF
What is the Philosophy of Statistics? (and how I was drawn to it)
PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
PDF
Severity as a basic concept in philosophy of statistics
PDF
“The importance of philosophy of science for statistical science and vice versa”
PDF
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
PDF
D. Mayo JSM slides v2.pdf
PDF
reid-postJSM-DRC.pdf
PDF
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
PDF
Causal inference is not statistical inference
PDF
What are questionable research practices?
PDF
What's the question?
PDF
The neglected importance of complexity in statistics and Metascience
PDF
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
PDF
On Severity, the Weight of Evidence, and the Relationship Between the Two
PDF
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
PDF
Comparing Frequentists and Bayesian Control of Multiple Testing
PPTX
Good Data Dredging
PDF
The Duality of Parameters and the Duality of Probability
PDF
Error Control and Severity
PDF
The Statistics Wars and Their Causalities (refs)
What is the Philosophy of Statistics? (and how I was drawn to it)
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
Severity as a basic concept in philosophy of statistics
“The importance of philosophy of science for statistical science and vice versa”
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
D. Mayo JSM slides v2.pdf
reid-postJSM-DRC.pdf
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Causal inference is not statistical inference
What are questionable research practices?
What's the question?
The neglected importance of complexity in statistics and Metascience
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
On Severity, the Weight of Evidence, and the Relationship Between the Two
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Comparing Frequentists and Bayesian Control of Multiple Testing
Good Data Dredging
The Duality of Parameters and the Duality of Probability
Error Control and Severity
The Statistics Wars and Their Causalities (refs)
Ad

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Basic Mud Logging Guide for educational purpose
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
RMMM.pdf make it easy to upload and study
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Cell Structure & Organelles in detailed.
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
master seminar digital applications in india
Classroom Observation Tools for Teachers
Final Presentation General Medicine 03-08-2024.pptx
Basic Mud Logging Guide for educational purpose
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Renaissance Architecture: A Journey from Faith to Humanism
RMMM.pdf make it easy to upload and study
FourierSeries-QuestionsWithAnswers(Part-A).pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Module 4: Burden of Disease Tutorial Slides S2 2025
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
VCE English Exam - Section C Student Revision Booklet
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Cell Structure & Organelles in detailed.
102 student loan defaulters named and shamed – Is someone you know on the list?
O5-L3 Freight Transport Ops (International) V1.pdf
Complications of Minimal Access Surgery at WLH
master seminar digital applications in india

The two statistical cornerstones of replicability: addressing selective inference and irrelevant variability

  • 1. The Statistics Wars and Their Casualties September 2022 The two statistical pillars of replicability: Addressing selective inference and irrelevant variability Yoav Benjamini Statistics & O.R., School of Mathematical Sciences The Sagol School of Neuroscience Tel Aviv University
  • 2. Outline • Tukey’s two pillars of replicability Addressing Selective Inference Relevant Variability • Some recent evidence • The relation of StatWars to replicability issues • Frequentist & Bayesian responses to the pillars • NEJM guidelines: a casualty of the wars
  • 3. Tukey’s last published work A puzzling encyclopedic entry on Multiple Comparisons. Opens : ``a diversity of issues … that tend to be important, difficult, and often unresolved.” Details: • The FDR approach in pairwise comparisons2 • The Random Effects vs Fixed Effects analysis 3 What’s that to do with multiple comparisons? 1 Jones et al (‘22) International Encyclopedia of Statistics in the Social Sciences 2 Williams Jones & Tukey ( ’99) 3 Cornfield & Tukey (’56)
  • 4. Genes & Behaviour: Crabbe et a (Science, ‘99) Compared 12 measures across strains at 3 labs In spite of strict standardization, Significant Lab*Genotype Interaction “Thus, experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory.” We thought that using our computational tools will solve the problem Comparing 17 measures between 8 inbred strains of mice At 3 labs: Golani at TAU, Elmer MPRC, Kafkafi NIDA1 MCP 2002, 1Kafkafi et al PNAS 2004
  • 5. YB Berkeley ʼ10 Behavioral Endpoint Labs Fixed Labs Mixed Prop. Lingering Time 0.00001 0.0029 # Progression segments 0.00001 0.0068 Median Turn Radius (scaled) 0.00001 0.0092 Time away from wall 0.00001 0.0108 Distance traveled 0.00001 0.0144 Acceleration 0.00001 0.0146 # Excursions 0.00001 0.0178 Time to half max speed 0.00001 0.0204 Max speed wall segments 0.00001 0.0257 Median Turn rate 0.00001 0.0320 Spatial spread 0.00001 0.0388 Lingering mean speed 0.00001 0.0588 Homebase occupancy 0.001 0.0712 # stops per excursion 0.0028 0.1202 Stop diversity 0.027 0.1489 Length of progression segments 0.44 0.5150 Activity decrease 0.67 0.8875 Significance of 8 Strain differences Strain x Lab Interaction significant Strain x Lab Interaction not significant FDR ≤ .05
  • 6. Recalling Mann’s warning in “Behavior Genetics in transition”1 (Science, 94) “…jumping too soon to discoveries..” (and press discoveries) “raises the issue of Replicability” Tukey’s entry is about replicability of discoveries1 Addressing : Selective Inference The relevant variability 1 Kafkafi et al (PNAS ’04) Intensified due to the industrialization of the scientific process
  • 7. Testing the approach1 • Took Single lab experimental results involving comparisons between mouse strains from Mouse Phenotyping Database • Carried similar experiments in 3 labs : JAX, TAUL, and TAUM without standardization • Used Random Lab Mixed Model Analysis to assess replicability of original result • Estimated g2=s2 GxL / s2 within for each endpoint from Database or from our experiments And used it to adjust the single lab results (by inflating sd) 60% of single lab rejected results were non-replicable 12% of adjusted single lab rejected results were non-replicable Jaljuli, Kafkafi et al ’22+ BioRxiv
  • 8. Reading ‘56 paper again EXPERIMENT Fixed Scientific Knowledge Island reached by statist Inference
  • 9. Reading ‘56 paper again EXPERIMENT Mixed Scientific Knowledge Island reached by statist Inference
  • 10. Inference on a selected subset of the parameters that turned out to be of interest after viewing the data! Relevant to all statistical methods – hurting replicability Out-of-study selection - not evident in the published work File drawer problem / publication bias The garden of forking paths, p-hacking, cherry picking significance chasing, HARKing, Data dredging, All are widely discussed and addressed e.g. by Transparency & Reproducibility standards Selective inference
  • 11. In-study selection - evident in the published work: Selection by the Abstract, Discussion Table, Figure Selection by highlighting those passing a threshold p<.05, p<.005, p<5*10-8, *,**,2 fold Selection by modeling: AIC, Cp, BIC, LASSO,… In complex research problems - in-study selection is unavoidable! Selective inference
  • 12. • Giovannucci et al. (1995) look for relationships between more than a hundred types of food intakes and the risk of prostate cancer • The abstract reports three (marginal) 95% confidence intervals (CIs), apparently only for those relative risks whose CIs do not cover 1. “Eat Ketchup and Pizza and avoid Prostate Cancer” 12 Selective inference hampers replicability
  • 13. “Although the pooled RR for raw tomato consumption was initially significant in 1995, this association has remained nonsignificant since 2000 after the addition of 7 studies…” Meta-analysis by Rowles et al (2017) Any adjustment for selection yields Conf. Intervals covering 1
  • 14. Error-rates for selective inference Secondary endpoints from effect of fish oil consumption on CHD, stroke or death from CVD . Manson et al ‘18 used by NEJM editorial • Simultaneous over all possible selections • Simultaneous over the selected • Conditional on being selected • On the average over the selected • On the average over all
  • 15. Addressing selective inference in psychology Transparency problems 1/100 Reproducibility problems 6/100 Reproducible & selected p ≤. 05 56/88 (=64%) failed. Evident selection Adjusting via hierarchical FDR 22 with padj > .05; and screened Of them 21 non-replicable results 1 replicable discovery lost Failure rate 36/67 (=52%) Power loss ~1/31 100 replications efforts, 64/100 failed
  • 16. Addressing selective inference in psychology Reducing level to p ≤ 0.005 Benjamin et al + 200 signatures 32 with p > .005; of them 21 non-replicable results 11 replicable discovery Failure rate 25/47 (=54%) Power loss =1/3 Zeevi, et al, (‘21+)
  • 17. The statistical wars A. Identifying problems of non-evident selective inference with the use of p-values and statistical testing Ending with the ‘New Statistics’ and bans on p-values &NHST
  • 18. The statistical wars A. Identifying problems of non-evident selective inference with p-values and statistical testing B. The 2016 ASA guidelines regarding the p-values1 “… some statisticians prefer …to even replace p-values with other approaches” e.g. Bayes factors, confidence intervals, credence intervals 1Wasserstein& Lazar (Amm. Stat. ‘16)
  • 19. The statistical wars A. Identifying problems of non-evident selective inference with p-values and statistical testing B. The 2016 ASA guidelines regarding the p-values1 C. The 2019 ASA conference and Editorial2 Don’t use p<.05; Don’t say “statistically significant” 1Wasserstein& Lazar (Am. Stat. ‘16) 2Wasserstein,Schirm & Lazar (Am. Stat ’19)
  • 20. The statistical wars A. Identifying problems of non-evident selective inference with p-values and statistical testing B. The 2016 ASA guidelines regarding the p-values C. The 2019 ASA conference and Editorial D. The ASA president’s task force statement3 3 YB et al AOS ‘21
  • 21. E. The disclaimer E. By 2022, MacNaughton documented 41 explicit references to 2019 editorial as official ASA policy. Past-president Kafadar’s letter to ASA board required: • Either the board approves the editorial as policy; or • A disclaimer is added The editorial was written by the three editors acting as individuals and reflects their scientific views not an endorsed position of the American Statistical Association. May 2022 in the online version only
  • 22. Scientific Reproducibility and Statistical Significance Symposium, Convened by ASA on June 3: “The goals of the symposium are:” 1. To disseminate the stance of the ASA on the appropriate use of results from null hypothesis significance testing 2. To offer alternatives to such testing 3. To discuss changes to publication policies that would benefit both individual scientists and science writ large. The war goes on
  • 23. • The war between Bayesian and frequentists was fierce in the 1950’s. • It receded to co-existence and mutual respect • The replicability crisis was used by zealous Bayesians to restart the war • On the eve of the meeting preparing the 2016 statement I expressed my opinion that the statement should be about statistics and replicabiliy in general - not merely focused on the p-value • Unfortunately, I see no connection between the StatWars and replicability.
  • 24. When P values are reported for multiple outcomes without adjustment for multiplicity, the probability of declaring a treatment difference when none exists can be much higher than 5%. (July ‘19) The Casualties: NEJM guidelines
  • 25. 1. P-values may not be reported (for secondary endpoints) if multiplicity correction method was not specified in the protocol or in the statistical analysis plan 2. Unadjusted (marginal) 95% CIs reported for all secondary endpoints The Casualties: NEJM guidelines
  • 26. Wu et al, citing results of Manson et al NEJM 2018 Nature Reviews Cardiology (2019) Fish oil supplementation … had no significant effect on the composite primary end point of CHD, stroke or death from CVD but reduced the risk of total CHD* (HR 0.83, 95% CI 0.71–0.97), percutaneous corona intervention (HR 0.78, 95% CI 0.63–0.95), total myocardial infarction* (HR 0.72, 95% CI 0.59–0.90), fatal myocardial infarction (HR 0.50, 95% CI 0.26–0.97). The only 4 out of 17 that excluded 1 and were not exploratory.
  • 27. What’s the problem? • CI as decision tool not crossing the no-effect value ó significance testing at 0.05 The issue of multiplicity, as recognized by NEJM, does not disappear • CIs coverage Coverage deteriorates: Taking the 17 estimators from above as parameters. Generated random means with SNR as estimated above times k Selected the CIs not crossing above 1 ; 10,000 simulation Checked the average non-coverage over the so selected For K=1 0.11 ; For K=0.5 0.18 ; For K=0.01 0.56
  • 28. Selective inference by CIs is totally ignored We1 suggest: Select if nominal 95%CI does not cross 0 (=log(1)) Assure False Coverage-Rate control over the so selected via the general BY-CIs: * Replace the 4 nominal CIs that do not cross 1 by 95(1-0.05*4/17)% CIs *For others use nominal CIs 1YB, Heller, Panagiotou arXiv ‘21 To offer NEJM default guidelines retaining power for exploration
  • 29. An epidemiologist with 13k followers The Casualties: 2
  • 31. From a Lancet reviewer • How can statistical hypotheses and strategies for addressing multiplicity issues be pre-specified in an observational study, which typically requires exploratory analyses to reduce bias? • Can an author claim to have performed a confirmatory test without this pre-specification? Logical Fallacy: Confirmatory analysis => Multiplicity adjustment Therefore: Multiplicity adjustment => Confirmatory analysis
  • 32. Addressing the relevant variability Frequentists Ignored by many but increasing awareness Bayesians Recognized (Hierarchical modelling) Addressing evident selective inference Frequentists Recognized more for p-values less for CIs & exploratory research Bayesian Ignored as a matter of principle The 2 pillars in Frequentist & Bayesian eyes
  • 33. E.g. Gelman, Hill & Yajima, M. (‘12) Why we (usually) don't have to worry about multiple comparisons. The underlying theoretical justification: Since we condition on all the data, Any selection after the data is viewed is already reflected in the posterior distribution.
  • 34. Are Bayesian intervals immune from selection’s harms? Assumed Prior µi~N(0,0.52); yi~N(µi ,1); i=1,2,…,106 (Gelman’s Ex.) Parameters generated by N(0,.52) 0.999*N(0,.52)+0.001*N(0,.52+32) Type of 95% confidence/credence intervals Marginal Bayesian Credibility Marginal FCR- adjusted Bayesian Credibility BH-Selected FCR- adjusted Intervals not covering their parameter 5.0% 5.0% 5.1% 2.1% Intervals not covering 0: Selected 7.3% 0.01% 0.03% 0.03% Intervals not covering their parameter: Out of the Selected 48% 3.4% 1.0% 71.5% 2.1% YB HDSR ‘19
  • 35. From Gelman’s Blog August ‘22 Eric van Zwet offers a solution to the above example: “We can mix the N(0,0.5) with almost any wider normal distribution with almost any probability and then very large effects will hardly be shrunken. He demonstrates it by the prior 0.99*N(0,0.52)+0.01*N(0,62) Of 741 credible intervals not covering 0, the proportion not covering the parameter is 0.07 (CI: 0.05 to 0.09) Gelman response: I continue to think that Bayesian inference completely solves the multiple comparisons problem. So, I generated data from this prior to which I add 5% N(4.052) And got 29% of the so selected not covering their parameter. (with an order of magnitude power loss relative to frequentist FDR adjusted testing and FCR CIs)
  • 36. My point: Bayesians should worry about selective inference if they care about replicability, And cannot hide behind the theoretical guarantees. Some Bayesians do that Connections with FDR in large inferential problems Genovese & Wasserman, ’02 Storey et al ’03… Fdr and fdr variations on FDR in empirical Bayes framework Efron et al ’13 … Purely Bayes model where selection should be addressed Yekutieli et al ’13 Thresholding of posterior odds using BH
  • 37. Take away massages Replicability can be enhanced mainly by addressing Selective inference • Evident selective inference is as harmful as non-evident • Needed in exploratory research even when spool is mall • Needed for CIs • ASA attitude against p-values and statistical significance is political and harms replicability The relevant variability • Prefer random effect (mixed model) analysis • Many small studies are better than one/few large ones Thanks to JWT for the insight
  • 38. Take away massages Most Bayesians ignore selective inference But appreciate addressing the relevant variability For frequentists it’s the opposite In both research communities practitioners try to avoid addressing them, Until the research complexity is so large that selective inference is addressed Until results are so non-replicable that the relevant variability is addressed This usually takes too long Thanks to JWT for the insight
  • 39. Thanks! www.replicability.tau.ac.il The industrialization of the scientific process 1888 1999 1950 2010