On the interpretation of the mathematical characteristics of statistical tests .pdf

Introduction
Some basics
More understanding helped by mathematics (or not)
Interpretative and effective hypotheses
Some further issues
Overview
On the interpretation of the mathematical
characteristics of statistical tests
Christian Hennig
Christian Hennig Interpretation of tests

Introduction
Some basics
Some further issues
Overview
1. Introduction
Misunderstanding of statistical tests
and what they can tell us about reality
is a major reason for the current controversy around them.
Is it in the nature of tests to be misunderstood?

Introduction
Some basics
Some further issues
Overview
1. Introduction
Misunderstanding of statistical tests
and what they can tell us about reality
is a major reason for the current controversy around them.
Is it in the nature of tests to be misunderstood?
I’d say statistical reasoning as a whole
(not only tests, also all proposed alternatives)
is difficult and prone to misinterpretation.

Introduction
Some basics
Some further issues
Overview
I How mathematical modelling can help with understanding;
I how mathematical modelling can inspire misunderstanding.

Introduction
Some basics
Some further issues
Overview
I How mathematical modelling can help with understanding;
I how mathematical modelling can inspire misunderstanding.
Warning: Messages in this talk are ambivalent!
Much of what follows will tell the practitioner:
“There are good reasons to do X,
but X can also go badly wrong.”

Introduction
Some basics
Some further issues
Overview
What is going on?

Introduction
Some basics
Some further issues
Overview

Introduction
Some basics
Some further issues
Overview
Statistical inference is based on mathematical reasoning
in the “model world”.
The model world is essentially different from the real world.
Data connect model world and real world,
but it is far from trivial to understand
what model world results mean for the real world.

Introduction
Some basics
Some further issues
Overview
“Model-based statistical inference is valid
if and only if the model is true.”

Introduction
Some basics
Some further issues
Overview
“Model-based statistical inference is valid
if and only if the model is true.”
This is misleading!
It’s not the job of models to be “true”.
Models are tools for thinking.

Introduction
Some basics
Some further issues
Overview
The key idea
Reality is not like the model
2. Some basics of statistical testing
Some data: Comparing course results from two years.
Teacher A results
Marks out of 100
Frequency
0 20 40 60 80 100
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Teacher B results
Marks out of 100
Frequency
0 20 40 60 80 100
0
5
10
15
10 20 30 40 50 60 70 80 90 100

Introduction
Some basics
Some further issues
Overview
The key idea
Did the students do substantially better
with one of the teachers?
x̄ = 58.6, ȳ = 56.9, teacher A students do better on average,
but is the difference meaningful?

Introduction
Some basics
Some further issues
Overview
The key idea
Did the students do substantially better
with one of the teachers?
x̄ = 58.6, ȳ = 56.9, teacher A students do better on average,
but is the difference meaningful?
“How large a difference is too large?”

Introduction
Some basics
Some further issues
Overview
The key idea
Key idea: Set up problem in model world!
X1, . . . , Xn ∼ N(µ1, σ2
1) i.i.d.,
Y1, . . . , Ym ∼ N(µ2, σ2
2) i.i.d.,
derive t-distribution of
T =
X̄ − Ȳ
Sp
q
1
n1
+ 1
n2
,
evaluate t = 0.75, p = P{|T| ≥ t} = 0.45 assuming µ1 = µ2.

Introduction
Some basics
Some further issues
Overview
The key idea
p = P{|T| ≥ t} = 0.45 assuming µ1 = µ2.
That’s a big probability!
Observed mean differences like this or bigger
can easily happen given µ1 = µ2.
Data are compatible with µ1 = µ2!

Introduction
Some basics
Some further issues
Overview
The key idea
The idea of tests is very elementary.
Set up a mathematical model for the real process,
with µ1 = µ2 corresponding to “no meaningful difference”,
then we check whether |T| is so big
that we wouldn’t expect it to happen
under “no meaningful difference” model.
Elementary general principle
for checking compatibility of data with models!

Introduction
Some basics
Some further issues
Overview
The key idea
Now consider the model. . .
X1, . . . , Xn ∼ N(µ1, σ2
1) i.i.d.,
Y1, . . . , Ym ∼ N(µ2, σ2
2) i.i.d..
Reality is not like this!

Introduction
Some basics
Some further issues
Overview
The key idea
Sometimes issues can be seen from the data.
Teacher A results
Marks out of 100
Frequency
0 20 40 60 80 100
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Teacher B results
Marks out of 100
Frequency
0 20 40 60 80 100
0
5
10
15
10 20 30 40 50 60 70 80 90 100
Shapiro-Wilks rejects normality for Teacher B.

Introduction
Some basics
Some further issues
Overview
The key idea
Sometimes issues cannot be seen from the data.
Constant correlation.
X1, . . . , Xn marginally N(µ, σ2),
ρ(Xi, Xj) = 0.1 ∀i, j.
0 200 400 600 800 1000
−3
−2
−1
0
1
2
3
Observation
x
0 200 400 600 800 1000
−2
−1
0
1
2
Observation
x
This is pretty bad for inference. . .
but it’s indistinguishable from i.i.d.! (Hennig, 2021)

Introduction
Some basics
Some further issues
Overview
The key idea
Some correlation between students in same class
is actually realistic,
as they communicate and learn together.
But unless we have information about individual behaviour,
there is no way to see this from the data.

Introduction
Some basics
Some further issues
Overview
The key idea
Sometimes issues can be seen from data
(or background knowledge) but are irrelevant.
E.g., student marks are integer numbers between 0 and 100.
Data sets with only integer numbers between 0 and 100
can never happen under normal distribution!
Normality assumption is routinely made for discrete data
with limited value range.

Introduction
Some basics
Some further issues
Overview
What if the model is not true?
Neyman-Pearson Optimality
Misinterpretation of mathematics
3. More understanding helped by mathematics (or not)
What happens to our test if the model is not true?
Remember I claimed:
correlation “pretty bad for inference”,
discrete data, limited value range “irrelevant”.
How can I know?

Introduction
Some basics
Some further issues
Overview
Mathematics (or simulation) can tell us!
We can model deviations from assumed nominal model,
then derive what our method will deliver.
(Even though a modelled deviation from nominal model
isn’t really true either.)

Introduction
Some basics
Some further issues
Overview
E.g. model data as normal with correlation 0.1,
or discretised normal between 0 and 100,
compute distribution of T.
Does it still have (roughly) same characteristics
as under nominal model?

Introduction
Some basics
Some further issues
Overview
E.g. model data as normal with correlation 0.1,
or discretised normal between 0 and 100,
compute distribution of T.
Does it still have (roughly) same characteristics
as under nominal model?
No (correlation),
approximately yes (discretisation)

Introduction
Some basics
Some further issues
Overview

Introduction
Some basics
Some further issues
Overview
I “If truth is close to the assumed model,
distribution of T will be close to assumed.”

Introduction
Some basics
Some further issues
Overview
Not necessarily!
And depends on formal definition of “close”.
E.g., gross error model 0.99N(µ, σ2) + 0.01δx ,
x very far from µ.

Introduction
Some basics
Some further issues
Overview
Not necessarily!
x very far from µ.
I “If data look like typical data generated
from assumed model,

Introduction
Some basics
Some further issues
Overview
Not necessarily!
x very far from µ.
I “If data look like typical data generated
from assumed model,
Not necessarily (e.g., correlation model above).

Introduction
Some basics
Some further issues
Overview
I “If assumed model is clearly violated,
distribution of T will be very different from assumed.”

Introduction
Some basics
Some further issues
Overview
Not necessarily either (e.g., Central Limit Theorem).

Introduction
Some basics
Some further issues
Overview
Not necessarily either (e.g., Central Limit Theorem).
Need understand which violations of assumed model
lead to problems, and which don’t.
(Standard misspecification testing isn’t always good at that;
Bancroft 1944, Shamsudheen & Hennig 2021)
Need to look at data, but also background information
to know potential issues that data won’t show.

Introduction
Some basics
Some further issues
Overview
Given a testing problem like H0 : µ1 = µ2 above,
what is the best way to construct a test?
NP: Define alternative hypothesis, optimise power against it.

Introduction
Some basics
Some further issues
Overview
“Non-rejection indicates the H0,
rejection indicates the alternative.”

Introduction
Some basics
Some further issues
Overview
“Non-rejection indicates the H0,
rejection indicates the alternative.”
I’m afraid not!

Introduction
Some basics
Some further issues
Overview
Being a model, the alternative can’t be true either.
The alternative is a device to guide test construction
via enabling the optimality statement.
This is a clever and sensible idea,
but in a real situation need question the model.

Introduction
Some basics
Some further issues
Overview
John W. Tukey (1962): “Danger only comes from mathematical
optimisation when the results are taken too seriously. It offers
guidance, not the answer”
Optimal test is good only if it is good for a wider range
of situations than the one where it’s optimal.
Non-optimal tests can be preferable
if robust for a larger class of models of interest.

Introduction
Some basics
Some further issues
Overview
Mathematical statements are proved, uncontroversial,
“objective”.
Objectivity is a key aim of science!
Temptation to identify reality with mathematics,
and to take mathematics as saying more about reality (science)
than it actually does.

Introduction
Some basics
Some further issues
Overview
Mathematics does not say how reality really is,
neither does it say what a scientist should do!
Mathematics characterises methods;
what to make of the characteristics is context-dependent.

Introduction
Some basics
Some further issues
Overview
Mathematics: Test T is optimal for testing null hypothesis
against alternative in a specific model setup.
Misinterpretation 1: Test T is optimal in reality.

Introduction
Some basics
Some further issues
Overview
Misinterpretation 2: Either null hypothesis or alternative is true
in reality.

Introduction
Some basics
Some further issues
Overview
in reality.
Misinterpretation 3: We have to make sure the model is true.

Introduction
Some basics
Some further issues
Overview
in reality.
Misinterpretation 4: As the model is not true anyway, the test is
not informative.

Introduction
Some basics
Some further issues
Overview
in reality.
not informative.
Mathematics: Optimality/good performance of test T is
assured for a binary decision problem.
Misinterpretation: Binary decisions should be made in science.

Introduction
Some basics
Some further issues
Overview
in reality.
not informative.
Mathematics: Test T can reject a model of “no effect” against
an alternative model of effect.
Misinterpretation 1: It is necessary to reject “no effect”.

Introduction
Some basics
Some further issues
Overview
in reality.
not informative.
Mathematics: Test T can reject a model of “no effect” against
an alternative model of effect.
Misinterpretation 1: It is necessary to reject “no effect”.
Misinterpretation 2: It is sufficient to reject “no effect”.

Introduction
Some basics
Some further issues
Overview
Extending hypotheses to non-nominal models
What do tests actually do?
4. Interpretative and effective hypotheses
Inference target parameter is defined in “model world”;
but we’re interested in real world.
µ1, µ2 are thought constructs
defined within the normal model.
The real hypothesis of interest is about whether
one of the teachers gives systematically higher marks.
There’s no i.i.d., and no distribution shape implied.

Introduction
Some basics
Some further issues
Overview
If we’re curious about how test performs
if nominal model doesn’t hold (e.g., error probabilities),
we need to define what an “error” is, i.e.,
when we should reject.
This is normally only defined within nominal model!
Amounts to deciding what parameters belong to
“interpretative H0/interpretative alternative”.

Introduction
Some basics
Some further issues
Overview
Interpretative H0/H1: All distributions that model
real (unformalised) null/alternative hypothesis of interest.
Promote awareness that real hypotheses are informal
and could be modelled by many distributions.

Introduction
Some basics
Some further issues
Overview

Introduction
Some basics
Some further issues
Overview
E.g., Beta-distributions on scale between 0 and 100:
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
x
Beta
density

Introduction
Some basics
Some further issues
Overview
Means are same:
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
x
Beta
density
E(X)=E(Y)

Introduction
Some basics
Some further issues
Overview
Medians are different - what is relevant to us?
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
x
Beta
density
Med(X)
Med(Y)

Introduction
Some basics
Some further issues
Overview
Test based on means will likely not reject H0,
test based on medians will likely reject.
0 20 40 60 80 100
0.0
0.5
1.0
1.5
2.0
2.5
x
Beta
density
Med(X)
Med(Y)

Introduction
Some basics
Some further issues
Overview
E.g., 0.99N(µ, σ2) + 0.01δx
0.
0
0.
1
0.
2
0.
3
0.
4
Gross error model
x
densi
t
y
−4 −3 −2 −1 0 1 2 3 4
4 1000
Are we interested in. . .
I E(X) = 0.99µ + 0.01x (potentially far from µ),
I or µ,
I or maybe the median?
This needs judgment
- data cannot decide this, neither can mathematics!

Introduction
Some basics
Some further issues
Overview
Note that Central Limit Theorem is about estimating E(X),
which may not be in line with interpretative hypothesis,
so what the t-test does based on CLT may be misleading
even though CLT applies.

Introduction
Some basics
Some further issues
Overview
Interpretative similarity under nominal model:
Tests of point null hypotheses are often criticised
for rejecting H0 for too large n in presence of
substantially meaningless deviations from H0.
This is a problem because test ignores that
parameter values very close to H0 are often
interpretatively more similar to H0 than H1.
(Need consider effect size, severity etc.
to not misinterpret rejection of formal H0.)

Introduction
Some basics
Some further issues
Overview
What do tests actually do
if we don’t take model assumptions for granted?
Rejection region R ⇒ tests
I “effective H0:” any P for which P(R) ≤ α against
I “effective H1:” any P for which P(R) large.
This provides a nonparametric definition of a test
that originally might well be parametric.

Introduction
Some basics
Some further issues
Overview
Note that under P with α < P(R) but P(R) not large,
the test will reject more easily than under H0,
but can’t be expected to reject.
Such distributions are in a “grey area” w.r.t. the test.

Introduction
Some basics
Some further issues
Overview
t-test with T = X̄−Ȳ
Sp/
√
n
,
rejecting H0 for |T| > cα
can be interpreted as testing general nonparametric
effective H0 : P is such that P{|T| > cα} ≤ α against
effective H1 : P is such that P{|T| > cα} large.

Introduction
Some basics
Some further issues
Overview
Sp/
√
n
,
The key issue then is:
Does definition of T indicate the desired direction
of deviation from the interpretative H0?

Introduction
Some basics
Some further issues
Overview
Sp/
√
n
,
The key issue then is:
Does definition of T indicate the desired direction
of deviation from the interpretative H0?
Rather than “are the assumptions fulfilled”? (Which they aren’t.)

Introduction
Some basics
Some further issues
Overview
This amounts to understanding whether T = X̄−Ȳ
Sp/
√
n
as aggregation of the information in the data
is “interpretatively correct”;
effective H0/H1 correspond well to interpretative H0/H1.
Need to understand properties of X̄, Ȳ, and Sp such as
breakdown under gross outliers,
behaviour under skewness.

Introduction
Some basics
Some further issues
Overview
This amounts to understanding whether T = X̄−Ȳ
Sp/
√
n
as aggregation of the information in the data
is “interpretatively correct”;
effective H0/H1 correspond well to interpretative H0/H1.
Need to understand properties of X̄, Ȳ, and Sp such as
breakdown under gross outliers,
behaviour under skewness.
Statisticians tend to think of these statistics as
optimal under certain models,
but they have a data analytic meaning on top of it,
and this is crucial to understand for use in inference
without taking model for granted.

Introduction
Some basics
Some further issues
Overview
Robust and nonparametric methods
Binary thinking
Multiple testing
5. Some further issues
Good options but not always better in line
with interpretative hypotheses.

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
Binary thinking
“Data are either compatible with model or not?”

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
Binary thinking
“Data are either compatible with model or not?”
Actually it’s gradual, reflected by p-value.
It makes a difference whether p = 10−6 or p = 0.035.
It does not make much of a difference
whether p = 0.45 or p = 0.75.

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
Decision thresholds?
Sometimes decisions have to be made.
Concepts such as error probabilities,
false discovery rates, replication rely on thresholds.

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
Interpretation in language is essentially discrete,
implicitly requires thresholds.

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
Interpretation in language is essentially discrete,
implicitly requires thresholds.
It seems hard to swallow
that thresholds are essentially arbitrary,
yet are needed!

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
Multiple testing
increases probability for “false positives”.
Methods can mathematically control
overall type I error probability (Bonferroni)
or “false discovery rate” (Benjamini-Yekutieli).
When to control how?

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
“In our study we run k tests (of several kinds).
How should we adjust our p-values for multiple testing?”

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
“In our study we run k tests (of several kinds).
How should we adjust our p-values for multiple testing?”
“k research groups run the same k tests
and publish the results in k papers.
Should they adjust for multiple testing in the same way?”

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
Mathematics doesn’t address this,
and there is no unique objective answer!

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
Mathematics doesn’t address this,
and there is no unique objective answer!
Problem is binary thinking.
Researchers want to know whether their results
are ultimately significant discoveries or not
so they feel there should be an unambiguous rule.

Introduction
Some basics
Some further issues
Overview
Binary thinking
Multiple testing
But once more, no way around judgment.
Even with multiple tests, individual test with p = 0.045
indicates a tendency against specific H0,
if quite weak.
Multiple testing corrections trade
assurance against false positives against power.

Introduction
Some basics
Some further issues
Overview
6. Overview
Message 1 When working with models,
always keep difference
between models and reality in mind.

Introduction
Some basics
Some further issues
Overview
6. Overview
Message 2 Objectivity of mathematics implies
temptation to identify
mathematical definitions and results with reality.

Introduction
Some basics
Some further issues
Overview
6. Overview
Idea 1 Model deviation of reality from nominal model.

Introduction
Some basics
Some further issues
Overview
6. Overview
Idea 1 Model deviation of reality from nominal model.
Message 3 We are not safe.
Dangerous deviations from model
cannot be reliably detected
(it makes sense to try though).

Introduction
Some basics
Some further issues
Overview
Idea 2 Interpretative hypotheses:
models corresponding to real informal hypotheses;
far “bigger” than nominal hypotheses
Need understand them
to understand test performance
if nominal model does not hold.

Introduction
Some basics
Some further issues
Overview
Idea 2 Interpretative hypotheses:
models corresponding to real informal hypotheses;
far “bigger” than nominal hypotheses
Need understand them
to understand test performance
if nominal model does not hold.
Idea 3 Effective hypotheses:
By definition, parametric test distinguishes
nonparametric classes of distributions!
Understand how this relates
to interpretative hypotheses.

Introduction
Some basics
Some further issues
Overview
Message 4 Beware of binary thinking,
use thresholds anyway.

Introduction
Some basics
Some further issues
Overview
Message 5 Robustness considerations are central,
but robust/nonparametric methods are not always
better.

Introduction
Some basics
Some further issues
Overview
Message 5 Robustness considerations are central,
but robust/nonparametric methods are not always
better.
Message 6 There’s no unique objective correction
for multiple testing.
Needs subjective and context-dependent
judgment,
as much if not all of statistics.

Introduction
Some basics
Some further issues
Overview
Experience from statistical advisory
It’s often the most intelligent clients
who believe they don’t understand statistics.

Introduction
Some basics
Some further issues
Overview
Experience from statistical advisory
It’s often the most intelligent clients
who believe they don’t understand statistics.
Why? Because they don’t understand
what cannot be understood
(e.g., why should we believe in a model?),
but what they’re made believe they have to accept.
Statistics is hard and can be confusing;
if we’re honest, we don’t present it as easy.

Introduction
Some basics
Some further issues
Overview
References
Bancroft, T. A. (1944) On biases in estimation due to the use of preliminary tests of significance. Annals of
Mathematical Statistics 15, 190-204.
Benjamini, Y., Yekutieli, D. (2001) The control of the false discovery rate in multiple testing with dependency.
Annals of Statistics 29, 1165-1188.
Davies, P. L. (2014) Data Analysis and Approximate Models. Chapman and Hall/CRC, New York
Gelman, A. and Hennig, C. (2017) Beyond subjective and objective in statistics (with discussion). Journal of the
Royal Statistical Society. Series A: Statistics in Society 180, 967-1033.
Hampel, F. R. (1998) Is Statistics Too Difficult? The Canadian Journal of Statistics, 26, 497-513.
Hampel, F. R., Ronchetti E. M., Rousseeuw P. J., Stahel W. A. (1986) Robust Statistics: The Approach Based on
Influence Functions. Wiley, New York
Hennig, C. (2010) Mathematical models and reality: A constructivist perspective. Foundations of Science 15,
29-48.
Hennig, C. (2020) Frequentism-as-model. arXiv:2007.05748
Hennig, C. (2021) Parameters not identifiable or distinguishable from data, including correlation between
Gaussian observations. arXiv:2108.09227
Mayo, D. G. (2018) Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge
University Press, Cambridge
Shamsudheen, M. I. and Hennig, C. (2021) Should we test the model assumptions before running a model-based
test? arXiv:1908.02218.
Tukey, J. W. (1962) The future of data analysis. Annals of Mathematical Statistics 33, 1-67.
Tukey, J. W. (1997) More honest foundations for data analysis. Journal of Statistical Planning and Inference 57,
21-28.

On the interpretation of the mathematical characteristics of statistical tests .pdf

More Related Content

Similar to On the interpretation of the mathematical characteristics of statistical tests .pdf (20)

More from jemille6 (20)

Recently uploaded (20)

On the interpretation of the mathematical characteristics of statistical tests .pdf