SlideShare a Scribd company logo
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
On the interpretation of the mathematical
characteristics of statistical tests
Christian Hennig
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
1. Introduction
Misunderstanding of statistical tests
and what they can tell us about reality
is a major reason for the current controversy around them.
Is it in the nature of tests to be misunderstood?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
1. Introduction
Misunderstanding of statistical tests
and what they can tell us about reality
is a major reason for the current controversy around them.
Is it in the nature of tests to be misunderstood?
I’d say statistical reasoning as a whole
(not only tests, also all proposed alternatives)
is difficult and prone to misinterpretation.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
I How mathematical modelling can help with understanding;
I how mathematical modelling can inspire misunderstanding.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
I How mathematical modelling can help with understanding;
I how mathematical modelling can inspire misunderstanding.
Warning: Messages in this talk are ambivalent!
Much of what follows will tell the practitioner:
“There are good reasons to do X,
but X can also go badly wrong.”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What is going on?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Statistical inference is based on mathematical reasoning
in the “model world”.
The model world is essentially different from the real world.
Data connect model world and real world,
but it is far from trivial to understand
what model world results mean for the real world.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
“Model-based statistical inference is valid
if and only if the model is true.”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
“Model-based statistical inference is valid
if and only if the model is true.”
This is misleading!
It’s not the job of models to be “true”.
Models are tools for thinking.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
2. Some basics of statistical testing
Some data: Comparing course results from two years.
Teacher A results
Marks out of 100
Frequency
0 20 40 60 80 100
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Teacher B results
Marks out of 100
Frequency
0 20 40 60 80 100
0
5
10
15
10 20 30 40 50 60 70 80 90 100
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
Did the students do substantially better
with one of the teachers?
x̄ = 58.6, ȳ = 56.9, teacher A students do better on average,
but is the difference meaningful?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
Did the students do substantially better
with one of the teachers?
x̄ = 58.6, ȳ = 56.9, teacher A students do better on average,
but is the difference meaningful?
“How large a difference is too large?”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
Key idea: Set up problem in model world!
X1, . . . , Xn ∼ N(µ1, σ2
1) i.i.d.,
Y1, . . . , Ym ∼ N(µ2, σ2
2) i.i.d.,
derive t-distribution of
T =
X̄ − Ȳ
Sp
q
1
n1
+ 1
n2
,
evaluate t = 0.75, p = P{|T| ≥ t} = 0.45 assuming µ1 = µ2.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
p = P{|T| ≥ t} = 0.45 assuming µ1 = µ2.
That’s a big probability!
Observed mean differences like this or bigger
can easily happen given µ1 = µ2.
Data are compatible with µ1 = µ2!
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
The idea of tests is very elementary.
Set up a mathematical model for the real process,
with µ1 = µ2 corresponding to “no meaningful difference”,
then we check whether |T| is so big
that we wouldn’t expect it to happen
under “no meaningful difference” model.
Elementary general principle
for checking compatibility of data with models!
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
Now consider the model. . .
X1, . . . , Xn ∼ N(µ1, σ2
1) i.i.d.,
Y1, . . . , Ym ∼ N(µ2, σ2
2) i.i.d..
Reality is not like this!
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
Sometimes issues can be seen from the data.
Teacher A results
Marks out of 100
Frequency
0 20 40 60 80 100
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Teacher B results
Marks out of 100
Frequency
0 20 40 60 80 100
0
5
10
15
10 20 30 40 50 60 70 80 90 100
Shapiro-Wilks rejects normality for Teacher B.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
Sometimes issues cannot be seen from the data.
Constant correlation.
X1, . . . , Xn marginally N(µ, σ2),
ρ(Xi, Xj) = 0.1 ∀i, j.
0 200 400 600 800 1000
−3
−2
−1
0
1
2
3
Observation
x
0 200 400 600 800 1000
−2
−1
0
1
2
Observation
x
This is pretty bad for inference. . .
but it’s indistinguishable from i.i.d.! (Hennig, 2021)
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
Some correlation between students in same class
is actually realistic,
as they communicate and learn together.
But unless we have information about individual behaviour,
there is no way to see this from the data.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
The key idea
Reality is not like the model
Sometimes issues can be seen from data
(or background knowledge) but are irrelevant.
E.g., student marks are integer numbers between 0 and 100.
Data sets with only integer numbers between 0 and 100
can never happen under normal distribution!
Normality assumption is routinely made for discrete data
with limited value range.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
3. More understanding helped by mathematics (or not)
What happens to our test if the model is not true?
Remember I claimed:
correlation “pretty bad for inference”,
discrete data, limited value range “irrelevant”.
How can I know?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics (or simulation) can tell us!
We can model deviations from assumed nominal model,
then derive what our method will deliver.
(Even though a modelled deviation from nominal model
isn’t really true either.)
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics (or simulation) can tell us!
We can model deviations from assumed nominal model,
then derive what our method will deliver.
(Even though a modelled deviation from nominal model
isn’t really true either.)
E.g. model data as normal with correlation 0.1,
or discretised normal between 0 and 100,
compute distribution of T.
Does it still have (roughly) same characteristics
as under nominal model?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics (or simulation) can tell us!
We can model deviations from assumed nominal model,
then derive what our method will deliver.
(Even though a modelled deviation from nominal model
isn’t really true either.)
E.g. model data as normal with correlation 0.1,
or discretised normal between 0 and 100,
compute distribution of T.
Does it still have (roughly) same characteristics
as under nominal model?
No (correlation),
approximately yes (discretisation)
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
I “If truth is close to the assumed model,
distribution of T will be close to assumed.”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
I “If truth is close to the assumed model,
distribution of T will be close to assumed.”
Not necessarily!
And depends on formal definition of “close”.
E.g., gross error model 0.99N(µ, σ2) + 0.01δx ,
x very far from µ.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
I “If truth is close to the assumed model,
distribution of T will be close to assumed.”
Not necessarily!
And depends on formal definition of “close”.
E.g., gross error model 0.99N(µ, σ2) + 0.01δx ,
x very far from µ.
I “If data look like typical data generated
from assumed model,
distribution of T will be close to assumed.”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
I “If truth is close to the assumed model,
distribution of T will be close to assumed.”
Not necessarily!
And depends on formal definition of “close”.
E.g., gross error model 0.99N(µ, σ2) + 0.01δx ,
x very far from µ.
I “If data look like typical data generated
from assumed model,
distribution of T will be close to assumed.”
Not necessarily (e.g., correlation model above).
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
I “If assumed model is clearly violated,
distribution of T will be very different from assumed.”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
I “If assumed model is clearly violated,
distribution of T will be very different from assumed.”
Not necessarily either (e.g., Central Limit Theorem).
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
I “If assumed model is clearly violated,
distribution of T will be very different from assumed.”
Not necessarily either (e.g., Central Limit Theorem).
Need understand which violations of assumed model
lead to problems, and which don’t.
(Standard misspecification testing isn’t always good at that;
Bancroft 1944, Shamsudheen & Hennig 2021)
Need to look at data, but also background information
to know potential issues that data won’t show.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Neyman-Pearson Optimality
Given a testing problem like H0 : µ1 = µ2 above,
what is the best way to construct a test?
NP: Define alternative hypothesis, optimise power against it.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Neyman-Pearson Optimality
Given a testing problem like H0 : µ1 = µ2 above,
what is the best way to construct a test?
NP: Define alternative hypothesis, optimise power against it.
“Non-rejection indicates the H0,
rejection indicates the alternative.”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Neyman-Pearson Optimality
Given a testing problem like H0 : µ1 = µ2 above,
what is the best way to construct a test?
NP: Define alternative hypothesis, optimise power against it.
“Non-rejection indicates the H0,
rejection indicates the alternative.”
I’m afraid not!
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Being a model, the alternative can’t be true either.
The alternative is a device to guide test construction
via enabling the optimality statement.
This is a clever and sensible idea,
but in a real situation need question the model.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
John W. Tukey (1962): “Danger only comes from mathematical
optimisation when the results are taken too seriously. It offers
guidance, not the answer”
Optimal test is good only if it is good for a wider range
of situations than the one where it’s optimal.
Non-optimal tests can be preferable
if robust for a larger class of models of interest.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Misinterpretation of mathematics
Mathematical statements are proved, uncontroversial,
“objective”.
Objectivity is a key aim of science!
Temptation to identify reality with mathematics,
and to take mathematics as saying more about reality (science)
than it actually does.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics does not say how reality really is,
neither does it say what a scientist should do!
Mathematics characterises methods;
what to make of the characteristics is context-dependent.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics: Test T is optimal for testing null hypothesis
against alternative in a specific model setup.
Misinterpretation 1: Test T is optimal in reality.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics: Test T is optimal for testing null hypothesis
against alternative in a specific model setup.
Misinterpretation 1: Test T is optimal in reality.
Misinterpretation 2: Either null hypothesis or alternative is true
in reality.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics: Test T is optimal for testing null hypothesis
against alternative in a specific model setup.
Misinterpretation 1: Test T is optimal in reality.
Misinterpretation 2: Either null hypothesis or alternative is true
in reality.
Misinterpretation 3: We have to make sure the model is true.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics: Test T is optimal for testing null hypothesis
against alternative in a specific model setup.
Misinterpretation 1: Test T is optimal in reality.
Misinterpretation 2: Either null hypothesis or alternative is true
in reality.
Misinterpretation 3: We have to make sure the model is true.
Misinterpretation 4: As the model is not true anyway, the test is
not informative.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics: Test T is optimal for testing null hypothesis
against alternative in a specific model setup.
Misinterpretation 1: Test T is optimal in reality.
Misinterpretation 2: Either null hypothesis or alternative is true
in reality.
Misinterpretation 3: We have to make sure the model is true.
Misinterpretation 4: As the model is not true anyway, the test is
not informative.
Mathematics: Optimality/good performance of test T is
assured for a binary decision problem.
Misinterpretation: Binary decisions should be made in science.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics: Test T is optimal for testing null hypothesis
against alternative in a specific model setup.
Misinterpretation 1: Test T is optimal in reality.
Misinterpretation 2: Either null hypothesis or alternative is true
in reality.
Misinterpretation 3: We have to make sure the model is true.
Misinterpretation 4: As the model is not true anyway, the test is
not informative.
Mathematics: Optimality/good performance of test T is
assured for a binary decision problem.
Misinterpretation: Binary decisions should be made in science.
Mathematics: Test T can reject a model of “no effect” against
an alternative model of effect.
Misinterpretation 1: It is necessary to reject “no effect”.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
Mathematics: Test T is optimal for testing null hypothesis
against alternative in a specific model setup.
Misinterpretation 1: Test T is optimal in reality.
Misinterpretation 2: Either null hypothesis or alternative is true
in reality.
Misinterpretation 3: We have to make sure the model is true.
Misinterpretation 4: As the model is not true anyway, the test is
not informative.
Mathematics: Optimality/good performance of test T is
assured for a binary decision problem.
Misinterpretation: Binary decisions should be made in science.
Mathematics: Test T can reject a model of “no effect” against
an alternative model of effect.
Misinterpretation 1: It is necessary to reject “no effect”.
Misinterpretation 2: It is sufficient to reject “no effect”.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
4. Interpretative and effective hypotheses
Extending hypotheses to non-nominal models
Inference target parameter is defined in “model world”;
but we’re interested in real world.
µ1, µ2 are thought constructs
defined within the normal model.
The real hypothesis of interest is about whether
one of the teachers gives systematically higher marks.
There’s no i.i.d., and no distribution shape implied.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
If we’re curious about how test performs
if nominal model doesn’t hold (e.g., error probabilities),
we need to define what an “error” is, i.e.,
when we should reject.
This is normally only defined within nominal model!
Amounts to deciding what parameters belong to
“interpretative H0/interpretative alternative”.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Interpretative H0/H1: All distributions that model
real (unformalised) null/alternative hypothesis of interest.
Promote awareness that real hypotheses are informal
and could be modelled by many distributions.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
E.g., Beta-distributions on scale between 0 and 100:
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
x
Beta
density
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Means are same:
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
x
Beta
density
E(X)=E(Y)
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Medians are different - what is relevant to us?
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
x
Beta
density
Med(X)
Med(Y)
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Test based on means will likely not reject H0,
test based on medians will likely reject.
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
x
Beta
density
Med(X)
Med(Y)
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
E.g., 0.99N(µ, σ2) + 0.01δx
0.
0
0.
1
0.
2
0.
3
0.
4
Gross error model
x
densi
t
y
−4 −3 −2 −1 0 1 2 3 4
4 1000
Are we interested in. . .
I E(X) = 0.99µ + 0.01x (potentially far from µ),
I or µ,
I or maybe the median?
This needs judgment
- data cannot decide this, neither can mathematics!
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Note that Central Limit Theorem is about estimating E(X),
which may not be in line with interpretative hypothesis,
so what the t-test does based on CLT may be misleading
even though CLT applies.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Interpretative similarity under nominal model:
Tests of point null hypotheses are often criticised
for rejecting H0 for too large n in presence of
substantially meaningless deviations from H0.
This is a problem because test ignores that
parameter values very close to H0 are often
interpretatively more similar to H0 than H1.
(Need consider effect size, severity etc.
to not misinterpret rejection of formal H0.)
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
What do tests actually do
if we don’t take model assumptions for granted?
Rejection region R ⇒ tests
I “effective H0:” any P for which P(R) ≤ α against
I “effective H1:” any P for which P(R) large.
This provides a nonparametric definition of a test
that originally might well be parametric.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Note that under P with α < P(R) but P(R) not large,
the test will reject more easily than under H0,
but can’t be expected to reject.
Such distributions are in a “grey area” w.r.t. the test.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
t-test with T = X̄−Ȳ
Sp/
√
n
,
rejecting H0 for |T| > cα
can be interpreted as testing general nonparametric
effective H0 : P is such that P{|T| > cα} ≤ α against
effective H1 : P is such that P{|T| > cα} large.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
t-test with T = X̄−Ȳ
Sp/
√
n
,
rejecting H0 for |T| > cα
can be interpreted as testing general nonparametric
effective H0 : P is such that P{|T| > cα} ≤ α against
effective H1 : P is such that P{|T| > cα} large.
The key issue then is:
Does definition of T indicate the desired direction
of deviation from the interpretative H0?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
t-test with T = X̄−Ȳ
Sp/
√
n
,
rejecting H0 for |T| > cα
can be interpreted as testing general nonparametric
effective H0 : P is such that P{|T| > cα} ≤ α against
effective H1 : P is such that P{|T| > cα} large.
The key issue then is:
Does definition of T indicate the desired direction
of deviation from the interpretative H0?
Rather than “are the assumptions fulfilled”? (Which they aren’t.)
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
This amounts to understanding whether T = X̄−Ȳ
Sp/
√
n
as aggregation of the information in the data
is “interpretatively correct”;
effective H0/H1 correspond well to interpretative H0/H1.
Need to understand properties of X̄, Ȳ, and Sp such as
breakdown under gross outliers,
behaviour under skewness.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
This amounts to understanding whether T = X̄−Ȳ
Sp/
√
n
as aggregation of the information in the data
is “interpretatively correct”;
effective H0/H1 correspond well to interpretative H0/H1.
Need to understand properties of X̄, Ȳ, and Sp such as
breakdown under gross outliers,
behaviour under skewness.
Statisticians tend to think of these statistics as
optimal under certain models,
but they have a data analytic meaning on top of it,
and this is crucial to understand for use in inference
without taking model for granted.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
5. Some further issues
Robust and nonparametric methods
Good options but not always better in line
with interpretative hypotheses.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Binary thinking
“Data are either compatible with model or not?”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Binary thinking
“Data are either compatible with model or not?”
Actually it’s gradual, reflected by p-value.
It makes a difference whether p = 10−6 or p = 0.035.
It does not make much of a difference
whether p = 0.45 or p = 0.75.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Decision thresholds?
Sometimes decisions have to be made.
Concepts such as error probabilities,
false discovery rates, replication rely on thresholds.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Decision thresholds?
Sometimes decisions have to be made.
Concepts such as error probabilities,
false discovery rates, replication rely on thresholds.
Interpretation in language is essentially discrete,
implicitly requires thresholds.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Decision thresholds?
Sometimes decisions have to be made.
Concepts such as error probabilities,
false discovery rates, replication rely on thresholds.
Interpretation in language is essentially discrete,
implicitly requires thresholds.
It seems hard to swallow
that thresholds are essentially arbitrary,
yet are needed!
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Multiple testing
increases probability for “false positives”.
Methods can mathematically control
overall type I error probability (Bonferroni)
or “false discovery rate” (Benjamini-Yekutieli).
When to control how?
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
“In our study we run k tests (of several kinds).
How should we adjust our p-values for multiple testing?”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
“In our study we run k tests (of several kinds).
How should we adjust our p-values for multiple testing?”
“k research groups run the same k tests
and publish the results in k papers.
Should they adjust for multiple testing in the same way?”
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Mathematics doesn’t address this,
and there is no unique objective answer!
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
Mathematics doesn’t address this,
and there is no unique objective answer!
Problem is binary thinking.
Researchers want to know whether their results
are ultimately significant discoveries or not
so they feel there should be an unambiguous rule.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
But once more, no way around judgment.
Even with multiple tests, individual test with p = 0.045
indicates a tendency against specific H0,
if quite weak.
Multiple testing corrections trade
assurance against false positives against power.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
6. Overview
Message 1 When working with models,
always keep difference
between models and reality in mind.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
6. Overview
Message 1 When working with models,
always keep difference
between models and reality in mind.
Message 2 Objectivity of mathematics implies
temptation to identify
mathematical definitions and results with reality.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
6. Overview
Message 1 When working with models,
always keep difference
between models and reality in mind.
Message 2 Objectivity of mathematics implies
temptation to identify
mathematical definitions and results with reality.
Idea 1 Model deviation of reality from nominal model.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
6. Overview
Message 1 When working with models,
always keep difference
between models and reality in mind.
Message 2 Objectivity of mathematics implies
temptation to identify
mathematical definitions and results with reality.
Idea 1 Model deviation of reality from nominal model.
Message 3 We are not safe.
Dangerous deviations from model
cannot be reliably detected
(it makes sense to try though).
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Idea 2 Interpretative hypotheses:
models corresponding to real informal hypotheses;
far “bigger” than nominal hypotheses
Need understand them
to understand test performance
if nominal model does not hold.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Idea 2 Interpretative hypotheses:
models corresponding to real informal hypotheses;
far “bigger” than nominal hypotheses
Need understand them
to understand test performance
if nominal model does not hold.
Idea 3 Effective hypotheses:
By definition, parametric test distinguishes
nonparametric classes of distributions!
Understand how this relates
to interpretative hypotheses.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Message 4 Beware of binary thinking,
use thresholds anyway.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Message 4 Beware of binary thinking,
use thresholds anyway.
Message 5 Robustness considerations are central,
but robust/nonparametric methods are not always
better.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Message 4 Beware of binary thinking,
use thresholds anyway.
Message 5 Robustness considerations are central,
but robust/nonparametric methods are not always
better.
Message 6 There’s no unique objective correction
for multiple testing.
Needs subjective and context-dependent
judgment,
as much if not all of statistics.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Experience from statistical advisory
It’s often the most intelligent clients
who believe they don’t understand statistics.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
Experience from statistical advisory
It’s often the most intelligent clients
who believe they don’t understand statistics.
Why? Because they don’t understand
what cannot be understood
(e.g., why should we believe in a model?),
but what they’re made believe they have to accept.
Statistics is hard and can be confusing;
if we’re honest, we don’t present it as easy.
Christian Hennig Interpretation of tests
Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
References
Bancroft, T. A. (1944) On biases in estimation due to the use of preliminary tests of significance. Annals of
Mathematical Statistics 15, 190-204.
Benjamini, Y., Yekutieli, D. (2001) The control of the false discovery rate in multiple testing with dependency.
Annals of Statistics 29, 1165-1188.
Davies, P. L. (2014) Data Analysis and Approximate Models. Chapman and Hall/CRC, New York
Gelman, A. and Hennig, C. (2017) Beyond subjective and objective in statistics (with discussion). Journal of the
Royal Statistical Society. Series A: Statistics in Society 180, 967-1033.
Hampel, F. R. (1998) Is Statistics Too Difficult? The Canadian Journal of Statistics, 26, 497-513.
Hampel, F. R., Ronchetti E. M., Rousseeuw P. J., Stahel W. A. (1986) Robust Statistics: The Approach Based on
Influence Functions. Wiley, New York
Hennig, C. (2010) Mathematical models and reality: A constructivist perspective. Foundations of Science 15,
29-48.
Hennig, C. (2020) Frequentism-as-model. arXiv:2007.05748
Hennig, C. (2021) Parameters not identifiable or distinguishable from data, including correlation between
Gaussian observations. arXiv:2108.09227
Mayo, D. G. (2018) Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge
University Press, Cambridge
Shamsudheen, M. I. and Hennig, C. (2021) Should we test the model assumptions before running a model-based
test? arXiv:1908.02218.
Tukey, J. W. (1962) The future of data analysis. Annals of Mathematical Statistics 33, 1-67.
Tukey, J. W. (1997) More honest foundations for data analysis. Journal of Statistical Planning and Inference 57,
21-28.
Christian Hennig Interpretation of tests

More Related Content

PDF
A. spanos slides ch14-2013 (4)
PDF
Testing &amp; measurement
PDF
Research Methodology Module-05
PDF
D. G. Mayo Columbia slides for Workshop on Probability &Learning
PPTX
theory testing in psychology: risky predictions and that pesky data prior
DOCX
Data Mining Avoiding False DiscoveriesLecture Notes for Chapt
PPTX
Pearson Correlation
PPT
EEX 501 Assessment
A. spanos slides ch14-2013 (4)
Testing &amp; measurement
Research Methodology Module-05
D. G. Mayo Columbia slides for Workshop on Probability &Learning
theory testing in psychology: risky predictions and that pesky data prior
Data Mining Avoiding False DiscoveriesLecture Notes for Chapt
Pearson Correlation
EEX 501 Assessment

Similar to On the interpretation of the mathematical characteristics of statistical tests .pdf (20)

PPT
EEX 501 Assess Ch4,5,6,7,All
PPTX
Research Hypotheses Saroj (1).pptxztkkxlfludyyodyo
PPT
MELJUN CORTES research lectures_evaluating_data_statistical_treatment
PPTX
Seminar iv
PPTX
Crash Course in A/B testing
PDF
Basic Statistical Concepts.pdf
PPTX
Inferential Statistics - DAY 4 - B.Ed - AIOU
PPTX
Inferential Statistics
PPTX
Practical significance of effect size in O I evaluation.pptx
PPTX
Using SPSS in Education Part 2
PDF
ECONOMETRICS I ASA
PPTX
CTT, Reliability, IRT, Factor Analysis.pptx
PDF
What is the Philosophy of Statistics? (and how I was drawn to it)
DOCX
Parametric Statistics
PPT
1. Understanding research and statistics.ppt
PDF
SPD-531 Professional Development Presentation_ Descriptive Statistics.pdf
PPT
PPTX
Ability tests and Achievement tests
PPT
Quantitative measurement
EEX 501 Assess Ch4,5,6,7,All
Research Hypotheses Saroj (1).pptxztkkxlfludyyodyo
MELJUN CORTES research lectures_evaluating_data_statistical_treatment
Seminar iv
Crash Course in A/B testing
Basic Statistical Concepts.pdf
Inferential Statistics - DAY 4 - B.Ed - AIOU
Inferential Statistics
Practical significance of effect size in O I evaluation.pptx
Using SPSS in Education Part 2
ECONOMETRICS I ASA
CTT, Reliability, IRT, Factor Analysis.pptx
What is the Philosophy of Statistics? (and how I was drawn to it)
Parametric Statistics
1. Understanding research and statistics.ppt
SPD-531 Professional Development Presentation_ Descriptive Statistics.pdf
Ability tests and Achievement tests
Quantitative measurement
Ad

More from jemille6 (20)

PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
PDF
Severity as a basic concept in philosophy of statistics
PDF
“The importance of philosophy of science for statistical science and vice versa”
PDF
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
PDF
D. Mayo JSM slides v2.pdf
PDF
reid-postJSM-DRC.pdf
PDF
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
PDF
Causal inference is not statistical inference
PDF
What are questionable research practices?
PDF
What's the question?
PDF
The neglected importance of complexity in statistics and Metascience
PDF
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
PDF
On Severity, the Weight of Evidence, and the Relationship Between the Two
PDF
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
PDF
Comparing Frequentists and Bayesian Control of Multiple Testing
PPTX
Good Data Dredging
PDF
The Duality of Parameters and the Duality of Probability
PDF
Error Control and Severity
PDF
The Statistics Wars and Their Causalities (refs)
PDF
The Statistics Wars and Their Casualties (w/refs)
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
Severity as a basic concept in philosophy of statistics
“The importance of philosophy of science for statistical science and vice versa”
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
D. Mayo JSM slides v2.pdf
reid-postJSM-DRC.pdf
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Causal inference is not statistical inference
What are questionable research practices?
What's the question?
The neglected importance of complexity in statistics and Metascience
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
On Severity, the Weight of Evidence, and the Relationship Between the Two
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Comparing Frequentists and Bayesian Control of Multiple Testing
Good Data Dredging
The Duality of Parameters and the Duality of Probability
Error Control and Severity
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Casualties (w/refs)
Ad

Recently uploaded (20)

PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Lesson notes of climatology university.
PDF
Classroom Observation Tools for Teachers
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
01-Introduction-to-Information-Management.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Pharma ospi slides which help in ospi learning
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
O5-L3 Freight Transport Ops (International) V1.pdf
Computing-Curriculum for Schools in Ghana
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
202450812 BayCHI UCSC-SV 20250812 v17.pptx
O7-L3 Supply Chain Operations - ICLT Program
Module 4: Burden of Disease Tutorial Slides S2 2025
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Lesson notes of climatology university.
Classroom Observation Tools for Teachers
Supply Chain Operations Speaking Notes -ICLT Program
Pharmacology of Heart Failure /Pharmacotherapy of CHF
01-Introduction-to-Information-Management.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Final Presentation General Medicine 03-08-2024.pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Anesthesia in Laparoscopic Surgery in India
Pharma ospi slides which help in ospi learning

On the interpretation of the mathematical characteristics of statistical tests .pdf

  • 1. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview On the interpretation of the mathematical characteristics of statistical tests Christian Hennig Christian Hennig Interpretation of tests
  • 2. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview 1. Introduction Misunderstanding of statistical tests and what they can tell us about reality is a major reason for the current controversy around them. Is it in the nature of tests to be misunderstood? Christian Hennig Interpretation of tests
  • 3. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview 1. Introduction Misunderstanding of statistical tests and what they can tell us about reality is a major reason for the current controversy around them. Is it in the nature of tests to be misunderstood? I’d say statistical reasoning as a whole (not only tests, also all proposed alternatives) is difficult and prone to misinterpretation. Christian Hennig Interpretation of tests
  • 4. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview I How mathematical modelling can help with understanding; I how mathematical modelling can inspire misunderstanding. Christian Hennig Interpretation of tests
  • 5. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview I How mathematical modelling can help with understanding; I how mathematical modelling can inspire misunderstanding. Warning: Messages in this talk are ambivalent! Much of what follows will tell the practitioner: “There are good reasons to do X, but X can also go badly wrong.” Christian Hennig Interpretation of tests
  • 6. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What is going on? Christian Hennig Interpretation of tests
  • 7. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Christian Hennig Interpretation of tests
  • 8. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Christian Hennig Interpretation of tests
  • 9. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Christian Hennig Interpretation of tests
  • 10. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Statistical inference is based on mathematical reasoning in the “model world”. The model world is essentially different from the real world. Data connect model world and real world, but it is far from trivial to understand what model world results mean for the real world. Christian Hennig Interpretation of tests
  • 11. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview “Model-based statistical inference is valid if and only if the model is true.” Christian Hennig Interpretation of tests
  • 12. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview “Model-based statistical inference is valid if and only if the model is true.” This is misleading! It’s not the job of models to be “true”. Models are tools for thinking. Christian Hennig Interpretation of tests
  • 13. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model 2. Some basics of statistical testing Some data: Comparing course results from two years. Teacher A results Marks out of 100 Frequency 0 20 40 60 80 100 0 1 2 3 4 5 10 20 30 40 50 60 70 80 90 100 Teacher B results Marks out of 100 Frequency 0 20 40 60 80 100 0 5 10 15 10 20 30 40 50 60 70 80 90 100 Christian Hennig Interpretation of tests
  • 14. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model Did the students do substantially better with one of the teachers? x̄ = 58.6, ȳ = 56.9, teacher A students do better on average, but is the difference meaningful? Christian Hennig Interpretation of tests
  • 15. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model Did the students do substantially better with one of the teachers? x̄ = 58.6, ȳ = 56.9, teacher A students do better on average, but is the difference meaningful? “How large a difference is too large?” Christian Hennig Interpretation of tests
  • 16. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model Key idea: Set up problem in model world! X1, . . . , Xn ∼ N(µ1, σ2 1) i.i.d., Y1, . . . , Ym ∼ N(µ2, σ2 2) i.i.d., derive t-distribution of T = X̄ − Ȳ Sp q 1 n1 + 1 n2 , evaluate t = 0.75, p = P{|T| ≥ t} = 0.45 assuming µ1 = µ2. Christian Hennig Interpretation of tests
  • 17. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model p = P{|T| ≥ t} = 0.45 assuming µ1 = µ2. That’s a big probability! Observed mean differences like this or bigger can easily happen given µ1 = µ2. Data are compatible with µ1 = µ2! Christian Hennig Interpretation of tests
  • 18. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model The idea of tests is very elementary. Set up a mathematical model for the real process, with µ1 = µ2 corresponding to “no meaningful difference”, then we check whether |T| is so big that we wouldn’t expect it to happen under “no meaningful difference” model. Elementary general principle for checking compatibility of data with models! Christian Hennig Interpretation of tests
  • 19. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model Now consider the model. . . X1, . . . , Xn ∼ N(µ1, σ2 1) i.i.d., Y1, . . . , Ym ∼ N(µ2, σ2 2) i.i.d.. Reality is not like this! Christian Hennig Interpretation of tests
  • 20. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model Sometimes issues can be seen from the data. Teacher A results Marks out of 100 Frequency 0 20 40 60 80 100 0 1 2 3 4 5 10 20 30 40 50 60 70 80 90 100 Teacher B results Marks out of 100 Frequency 0 20 40 60 80 100 0 5 10 15 10 20 30 40 50 60 70 80 90 100 Shapiro-Wilks rejects normality for Teacher B. Christian Hennig Interpretation of tests
  • 21. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model Sometimes issues cannot be seen from the data. Constant correlation. X1, . . . , Xn marginally N(µ, σ2), ρ(Xi, Xj) = 0.1 ∀i, j. 0 200 400 600 800 1000 −3 −2 −1 0 1 2 3 Observation x 0 200 400 600 800 1000 −2 −1 0 1 2 Observation x This is pretty bad for inference. . . but it’s indistinguishable from i.i.d.! (Hennig, 2021) Christian Hennig Interpretation of tests
  • 22. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model Some correlation between students in same class is actually realistic, as they communicate and learn together. But unless we have information about individual behaviour, there is no way to see this from the data. Christian Hennig Interpretation of tests
  • 23. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview The key idea Reality is not like the model Sometimes issues can be seen from data (or background knowledge) but are irrelevant. E.g., student marks are integer numbers between 0 and 100. Data sets with only integer numbers between 0 and 100 can never happen under normal distribution! Normality assumption is routinely made for discrete data with limited value range. Christian Hennig Interpretation of tests
  • 24. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics 3. More understanding helped by mathematics (or not) What happens to our test if the model is not true? Remember I claimed: correlation “pretty bad for inference”, discrete data, limited value range “irrelevant”. How can I know? Christian Hennig Interpretation of tests
  • 25. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics (or simulation) can tell us! We can model deviations from assumed nominal model, then derive what our method will deliver. (Even though a modelled deviation from nominal model isn’t really true either.) Christian Hennig Interpretation of tests
  • 26. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics (or simulation) can tell us! We can model deviations from assumed nominal model, then derive what our method will deliver. (Even though a modelled deviation from nominal model isn’t really true either.) E.g. model data as normal with correlation 0.1, or discretised normal between 0 and 100, compute distribution of T. Does it still have (roughly) same characteristics as under nominal model? Christian Hennig Interpretation of tests
  • 27. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics (or simulation) can tell us! We can model deviations from assumed nominal model, then derive what our method will deliver. (Even though a modelled deviation from nominal model isn’t really true either.) E.g. model data as normal with correlation 0.1, or discretised normal between 0 and 100, compute distribution of T. Does it still have (roughly) same characteristics as under nominal model? No (correlation), approximately yes (discretisation) Christian Hennig Interpretation of tests
  • 28. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Christian Hennig Interpretation of tests
  • 29. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics I “If truth is close to the assumed model, distribution of T will be close to assumed.” Christian Hennig Interpretation of tests
  • 30. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics I “If truth is close to the assumed model, distribution of T will be close to assumed.” Not necessarily! And depends on formal definition of “close”. E.g., gross error model 0.99N(µ, σ2) + 0.01δx , x very far from µ. Christian Hennig Interpretation of tests
  • 31. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics I “If truth is close to the assumed model, distribution of T will be close to assumed.” Not necessarily! And depends on formal definition of “close”. E.g., gross error model 0.99N(µ, σ2) + 0.01δx , x very far from µ. I “If data look like typical data generated from assumed model, distribution of T will be close to assumed.” Christian Hennig Interpretation of tests
  • 32. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics I “If truth is close to the assumed model, distribution of T will be close to assumed.” Not necessarily! And depends on formal definition of “close”. E.g., gross error model 0.99N(µ, σ2) + 0.01δx , x very far from µ. I “If data look like typical data generated from assumed model, distribution of T will be close to assumed.” Not necessarily (e.g., correlation model above). Christian Hennig Interpretation of tests
  • 33. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics I “If assumed model is clearly violated, distribution of T will be very different from assumed.” Christian Hennig Interpretation of tests
  • 34. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics I “If assumed model is clearly violated, distribution of T will be very different from assumed.” Not necessarily either (e.g., Central Limit Theorem). Christian Hennig Interpretation of tests
  • 35. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics I “If assumed model is clearly violated, distribution of T will be very different from assumed.” Not necessarily either (e.g., Central Limit Theorem). Need understand which violations of assumed model lead to problems, and which don’t. (Standard misspecification testing isn’t always good at that; Bancroft 1944, Shamsudheen & Hennig 2021) Need to look at data, but also background information to know potential issues that data won’t show. Christian Hennig Interpretation of tests
  • 36. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Christian Hennig Interpretation of tests
  • 37. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Neyman-Pearson Optimality Given a testing problem like H0 : µ1 = µ2 above, what is the best way to construct a test? NP: Define alternative hypothesis, optimise power against it. Christian Hennig Interpretation of tests
  • 38. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Neyman-Pearson Optimality Given a testing problem like H0 : µ1 = µ2 above, what is the best way to construct a test? NP: Define alternative hypothesis, optimise power against it. “Non-rejection indicates the H0, rejection indicates the alternative.” Christian Hennig Interpretation of tests
  • 39. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Neyman-Pearson Optimality Given a testing problem like H0 : µ1 = µ2 above, what is the best way to construct a test? NP: Define alternative hypothesis, optimise power against it. “Non-rejection indicates the H0, rejection indicates the alternative.” I’m afraid not! Christian Hennig Interpretation of tests
  • 40. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Being a model, the alternative can’t be true either. The alternative is a device to guide test construction via enabling the optimality statement. This is a clever and sensible idea, but in a real situation need question the model. Christian Hennig Interpretation of tests
  • 41. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics John W. Tukey (1962): “Danger only comes from mathematical optimisation when the results are taken too seriously. It offers guidance, not the answer” Optimal test is good only if it is good for a wider range of situations than the one where it’s optimal. Non-optimal tests can be preferable if robust for a larger class of models of interest. Christian Hennig Interpretation of tests
  • 42. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Misinterpretation of mathematics Mathematical statements are proved, uncontroversial, “objective”. Objectivity is a key aim of science! Temptation to identify reality with mathematics, and to take mathematics as saying more about reality (science) than it actually does. Christian Hennig Interpretation of tests
  • 43. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics does not say how reality really is, neither does it say what a scientist should do! Mathematics characterises methods; what to make of the characteristics is context-dependent. Christian Hennig Interpretation of tests
  • 44. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics: Test T is optimal for testing null hypothesis against alternative in a specific model setup. Misinterpretation 1: Test T is optimal in reality. Christian Hennig Interpretation of tests
  • 45. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics: Test T is optimal for testing null hypothesis against alternative in a specific model setup. Misinterpretation 1: Test T is optimal in reality. Misinterpretation 2: Either null hypothesis or alternative is true in reality. Christian Hennig Interpretation of tests
  • 46. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics: Test T is optimal for testing null hypothesis against alternative in a specific model setup. Misinterpretation 1: Test T is optimal in reality. Misinterpretation 2: Either null hypothesis or alternative is true in reality. Misinterpretation 3: We have to make sure the model is true. Christian Hennig Interpretation of tests
  • 47. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics: Test T is optimal for testing null hypothesis against alternative in a specific model setup. Misinterpretation 1: Test T is optimal in reality. Misinterpretation 2: Either null hypothesis or alternative is true in reality. Misinterpretation 3: We have to make sure the model is true. Misinterpretation 4: As the model is not true anyway, the test is not informative. Christian Hennig Interpretation of tests
  • 48. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics: Test T is optimal for testing null hypothesis against alternative in a specific model setup. Misinterpretation 1: Test T is optimal in reality. Misinterpretation 2: Either null hypothesis or alternative is true in reality. Misinterpretation 3: We have to make sure the model is true. Misinterpretation 4: As the model is not true anyway, the test is not informative. Mathematics: Optimality/good performance of test T is assured for a binary decision problem. Misinterpretation: Binary decisions should be made in science. Christian Hennig Interpretation of tests
  • 49. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics: Test T is optimal for testing null hypothesis against alternative in a specific model setup. Misinterpretation 1: Test T is optimal in reality. Misinterpretation 2: Either null hypothesis or alternative is true in reality. Misinterpretation 3: We have to make sure the model is true. Misinterpretation 4: As the model is not true anyway, the test is not informative. Mathematics: Optimality/good performance of test T is assured for a binary decision problem. Misinterpretation: Binary decisions should be made in science. Mathematics: Test T can reject a model of “no effect” against an alternative model of effect. Misinterpretation 1: It is necessary to reject “no effect”. Christian Hennig Interpretation of tests
  • 50. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview What if the model is not true? Neyman-Pearson Optimality Misinterpretation of mathematics Mathematics: Test T is optimal for testing null hypothesis against alternative in a specific model setup. Misinterpretation 1: Test T is optimal in reality. Misinterpretation 2: Either null hypothesis or alternative is true in reality. Misinterpretation 3: We have to make sure the model is true. Misinterpretation 4: As the model is not true anyway, the test is not informative. Mathematics: Optimality/good performance of test T is assured for a binary decision problem. Misinterpretation: Binary decisions should be made in science. Mathematics: Test T can reject a model of “no effect” against an alternative model of effect. Misinterpretation 1: It is necessary to reject “no effect”. Misinterpretation 2: It is sufficient to reject “no effect”. Christian Hennig Interpretation of tests
  • 51. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? 4. Interpretative and effective hypotheses Extending hypotheses to non-nominal models Inference target parameter is defined in “model world”; but we’re interested in real world. µ1, µ2 are thought constructs defined within the normal model. The real hypothesis of interest is about whether one of the teachers gives systematically higher marks. There’s no i.i.d., and no distribution shape implied. Christian Hennig Interpretation of tests
  • 52. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? If we’re curious about how test performs if nominal model doesn’t hold (e.g., error probabilities), we need to define what an “error” is, i.e., when we should reject. This is normally only defined within nominal model! Amounts to deciding what parameters belong to “interpretative H0/interpretative alternative”. Christian Hennig Interpretation of tests
  • 53. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Interpretative H0/H1: All distributions that model real (unformalised) null/alternative hypothesis of interest. Promote awareness that real hypotheses are informal and could be modelled by many distributions. Christian Hennig Interpretation of tests
  • 54. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Christian Hennig Interpretation of tests
  • 55. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? E.g., Beta-distributions on scale between 0 and 100: 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 x Beta density Christian Hennig Interpretation of tests
  • 56. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Means are same: 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 x Beta density E(X)=E(Y) Christian Hennig Interpretation of tests
  • 57. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Medians are different - what is relevant to us? 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 x Beta density Med(X) Med(Y) Christian Hennig Interpretation of tests
  • 58. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Test based on means will likely not reject H0, test based on medians will likely reject. 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 x Beta density Med(X) Med(Y) Christian Hennig Interpretation of tests
  • 59. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? E.g., 0.99N(µ, σ2) + 0.01δx 0. 0 0. 1 0. 2 0. 3 0. 4 Gross error model x densi t y −4 −3 −2 −1 0 1 2 3 4 4 1000 Are we interested in. . . I E(X) = 0.99µ + 0.01x (potentially far from µ), I or µ, I or maybe the median? This needs judgment - data cannot decide this, neither can mathematics! Christian Hennig Interpretation of tests
  • 60. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Note that Central Limit Theorem is about estimating E(X), which may not be in line with interpretative hypothesis, so what the t-test does based on CLT may be misleading even though CLT applies. Christian Hennig Interpretation of tests
  • 61. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Interpretative similarity under nominal model: Tests of point null hypotheses are often criticised for rejecting H0 for too large n in presence of substantially meaningless deviations from H0. This is a problem because test ignores that parameter values very close to H0 are often interpretatively more similar to H0 than H1. (Need consider effect size, severity etc. to not misinterpret rejection of formal H0.) Christian Hennig Interpretation of tests
  • 62. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Christian Hennig Interpretation of tests
  • 63. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? What do tests actually do if we don’t take model assumptions for granted? Rejection region R ⇒ tests I “effective H0:” any P for which P(R) ≤ α against I “effective H1:” any P for which P(R) large. This provides a nonparametric definition of a test that originally might well be parametric. Christian Hennig Interpretation of tests
  • 64. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Note that under P with α < P(R) but P(R) not large, the test will reject more easily than under H0, but can’t be expected to reject. Such distributions are in a “grey area” w.r.t. the test. Christian Hennig Interpretation of tests
  • 65. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? Christian Hennig Interpretation of tests
  • 66. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? t-test with T = X̄−Ȳ Sp/ √ n , rejecting H0 for |T| > cα can be interpreted as testing general nonparametric effective H0 : P is such that P{|T| > cα} ≤ α against effective H1 : P is such that P{|T| > cα} large. Christian Hennig Interpretation of tests
  • 67. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? t-test with T = X̄−Ȳ Sp/ √ n , rejecting H0 for |T| > cα can be interpreted as testing general nonparametric effective H0 : P is such that P{|T| > cα} ≤ α against effective H1 : P is such that P{|T| > cα} large. The key issue then is: Does definition of T indicate the desired direction of deviation from the interpretative H0? Christian Hennig Interpretation of tests
  • 68. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? t-test with T = X̄−Ȳ Sp/ √ n , rejecting H0 for |T| > cα can be interpreted as testing general nonparametric effective H0 : P is such that P{|T| > cα} ≤ α against effective H1 : P is such that P{|T| > cα} large. The key issue then is: Does definition of T indicate the desired direction of deviation from the interpretative H0? Rather than “are the assumptions fulfilled”? (Which they aren’t.) Christian Hennig Interpretation of tests
  • 69. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? This amounts to understanding whether T = X̄−Ȳ Sp/ √ n as aggregation of the information in the data is “interpretatively correct”; effective H0/H1 correspond well to interpretative H0/H1. Need to understand properties of X̄, Ȳ, and Sp such as breakdown under gross outliers, behaviour under skewness. Christian Hennig Interpretation of tests
  • 70. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Extending hypotheses to non-nominal models What do tests actually do? This amounts to understanding whether T = X̄−Ȳ Sp/ √ n as aggregation of the information in the data is “interpretatively correct”; effective H0/H1 correspond well to interpretative H0/H1. Need to understand properties of X̄, Ȳ, and Sp such as breakdown under gross outliers, behaviour under skewness. Statisticians tend to think of these statistics as optimal under certain models, but they have a data analytic meaning on top of it, and this is crucial to understand for use in inference without taking model for granted. Christian Hennig Interpretation of tests
  • 71. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing 5. Some further issues Robust and nonparametric methods Good options but not always better in line with interpretative hypotheses. Christian Hennig Interpretation of tests
  • 72. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Christian Hennig Interpretation of tests
  • 73. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Binary thinking “Data are either compatible with model or not?” Christian Hennig Interpretation of tests
  • 74. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Binary thinking “Data are either compatible with model or not?” Actually it’s gradual, reflected by p-value. It makes a difference whether p = 10−6 or p = 0.035. It does not make much of a difference whether p = 0.45 or p = 0.75. Christian Hennig Interpretation of tests
  • 75. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Decision thresholds? Sometimes decisions have to be made. Concepts such as error probabilities, false discovery rates, replication rely on thresholds. Christian Hennig Interpretation of tests
  • 76. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Decision thresholds? Sometimes decisions have to be made. Concepts such as error probabilities, false discovery rates, replication rely on thresholds. Interpretation in language is essentially discrete, implicitly requires thresholds. Christian Hennig Interpretation of tests
  • 77. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Decision thresholds? Sometimes decisions have to be made. Concepts such as error probabilities, false discovery rates, replication rely on thresholds. Interpretation in language is essentially discrete, implicitly requires thresholds. It seems hard to swallow that thresholds are essentially arbitrary, yet are needed! Christian Hennig Interpretation of tests
  • 78. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Multiple testing increases probability for “false positives”. Methods can mathematically control overall type I error probability (Bonferroni) or “false discovery rate” (Benjamini-Yekutieli). When to control how? Christian Hennig Interpretation of tests
  • 79. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing “In our study we run k tests (of several kinds). How should we adjust our p-values for multiple testing?” Christian Hennig Interpretation of tests
  • 80. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing “In our study we run k tests (of several kinds). How should we adjust our p-values for multiple testing?” “k research groups run the same k tests and publish the results in k papers. Should they adjust for multiple testing in the same way?” Christian Hennig Interpretation of tests
  • 81. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Mathematics doesn’t address this, and there is no unique objective answer! Christian Hennig Interpretation of tests
  • 82. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing Mathematics doesn’t address this, and there is no unique objective answer! Problem is binary thinking. Researchers want to know whether their results are ultimately significant discoveries or not so they feel there should be an unambiguous rule. Christian Hennig Interpretation of tests
  • 83. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Robust and nonparametric methods Binary thinking Multiple testing But once more, no way around judgment. Even with multiple tests, individual test with p = 0.045 indicates a tendency against specific H0, if quite weak. Multiple testing corrections trade assurance against false positives against power. Christian Hennig Interpretation of tests
  • 84. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview 6. Overview Message 1 When working with models, always keep difference between models and reality in mind. Christian Hennig Interpretation of tests
  • 85. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview 6. Overview Message 1 When working with models, always keep difference between models and reality in mind. Message 2 Objectivity of mathematics implies temptation to identify mathematical definitions and results with reality. Christian Hennig Interpretation of tests
  • 86. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview 6. Overview Message 1 When working with models, always keep difference between models and reality in mind. Message 2 Objectivity of mathematics implies temptation to identify mathematical definitions and results with reality. Idea 1 Model deviation of reality from nominal model. Christian Hennig Interpretation of tests
  • 87. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview 6. Overview Message 1 When working with models, always keep difference between models and reality in mind. Message 2 Objectivity of mathematics implies temptation to identify mathematical definitions and results with reality. Idea 1 Model deviation of reality from nominal model. Message 3 We are not safe. Dangerous deviations from model cannot be reliably detected (it makes sense to try though). Christian Hennig Interpretation of tests
  • 88. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Idea 2 Interpretative hypotheses: models corresponding to real informal hypotheses; far “bigger” than nominal hypotheses Need understand them to understand test performance if nominal model does not hold. Christian Hennig Interpretation of tests
  • 89. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Idea 2 Interpretative hypotheses: models corresponding to real informal hypotheses; far “bigger” than nominal hypotheses Need understand them to understand test performance if nominal model does not hold. Idea 3 Effective hypotheses: By definition, parametric test distinguishes nonparametric classes of distributions! Understand how this relates to interpretative hypotheses. Christian Hennig Interpretation of tests
  • 90. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Message 4 Beware of binary thinking, use thresholds anyway. Christian Hennig Interpretation of tests
  • 91. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Message 4 Beware of binary thinking, use thresholds anyway. Message 5 Robustness considerations are central, but robust/nonparametric methods are not always better. Christian Hennig Interpretation of tests
  • 92. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Message 4 Beware of binary thinking, use thresholds anyway. Message 5 Robustness considerations are central, but robust/nonparametric methods are not always better. Message 6 There’s no unique objective correction for multiple testing. Needs subjective and context-dependent judgment, as much if not all of statistics. Christian Hennig Interpretation of tests
  • 93. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Experience from statistical advisory It’s often the most intelligent clients who believe they don’t understand statistics. Christian Hennig Interpretation of tests
  • 94. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview Experience from statistical advisory It’s often the most intelligent clients who believe they don’t understand statistics. Why? Because they don’t understand what cannot be understood (e.g., why should we believe in a model?), but what they’re made believe they have to accept. Statistics is hard and can be confusing; if we’re honest, we don’t present it as easy. Christian Hennig Interpretation of tests
  • 95. Introduction Some basics More understanding helped by mathematics (or not) Interpretative and effective hypotheses Some further issues Overview References Bancroft, T. A. (1944) On biases in estimation due to the use of preliminary tests of significance. Annals of Mathematical Statistics 15, 190-204. Benjamini, Y., Yekutieli, D. (2001) The control of the false discovery rate in multiple testing with dependency. Annals of Statistics 29, 1165-1188. Davies, P. L. (2014) Data Analysis and Approximate Models. Chapman and Hall/CRC, New York Gelman, A. and Hennig, C. (2017) Beyond subjective and objective in statistics (with discussion). Journal of the Royal Statistical Society. Series A: Statistics in Society 180, 967-1033. Hampel, F. R. (1998) Is Statistics Too Difficult? The Canadian Journal of Statistics, 26, 497-513. Hampel, F. R., Ronchetti E. M., Rousseeuw P. J., Stahel W. A. (1986) Robust Statistics: The Approach Based on Influence Functions. Wiley, New York Hennig, C. (2010) Mathematical models and reality: A constructivist perspective. Foundations of Science 15, 29-48. Hennig, C. (2020) Frequentism-as-model. arXiv:2007.05748 Hennig, C. (2021) Parameters not identifiable or distinguishable from data, including correlation between Gaussian observations. arXiv:2108.09227 Mayo, D. G. (2018) Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge University Press, Cambridge Shamsudheen, M. I. and Hennig, C. (2021) Should we test the model assumptions before running a model-based test? arXiv:1908.02218. Tukey, J. W. (1962) The future of data analysis. Annals of Mathematical Statistics 33, 1-67. Tukey, J. W. (1997) More honest foundations for data analysis. Journal of Statistical Planning and Inference 57, 21-28. Christian Hennig Interpretation of tests