SlideShare a Scribd company logo
PHIL 6334 - Probability/Statistics Lecture Notes 6:
An Introduction to Mis-Specification (M-S) Testing
Aris Spanos [Spring 2014]
1 Introduction
The primary objective of empirical modeling is ‘to learn
from data’ about observable stochastic phenomena of interest
using a statistical model Mθ(x). An important precondi-
tion for learning in statistical inference is that the probabilistic
assumptions of Mθ(x) representing the statistical premises,
are valid for the particular data x0.
1.1 Statistical adequacy
The generic form of a statistical model is:
Mθ(x)={(x; θ) θ∈Θ} x∈R
 for θ∈Θ⊂R
 
where (x; θ) x∈R
 denotes the (joint) distribution of the
sample X:=(1  )
The link between Mθ(x) and the phenomenon of interest
comes in the form of viewing data x0:=(1  ) as a typ-
ical realization of the process { ∈N}. The ‘typicality’
of x0 can — and should — be assessed using trenchant Mis-
Specification (M-S) testing.
¥ Statistical adequacy. Testing the validity of the prob-
abilistic assumptions of the statistical model Mθ(x) vis-a-vis
data x0 is of paramount importance in practice because with-
out it the error reliability of inference is at best dubious. Why?
When any of the model assumptions are invalid, the nom-
inal (assumed) error probabilities used to calibrate the
1
‘reliability’ of inductive inferences are likely to be very dif-
ferent from the actual ones, rendering the inference results
unreliable.
I Rejecting a null hypothesis at a nominal =05 when
the actual type I error probability is closer to 90, provides
the surest way for an erroneous inference!
H It is important to note that all statistical methods (frequentist,
Bayesian, nonparametric) rely on an underlying statis-
tical model M(z), and thus they are equally vulnerable to
statistical misspecification.
What goes wrong when Mθ(z) is statistically mis-
specified? Since the likelihood function is defined via the
distribution of the sample:
(; z0) ∝ (x0; θ) θ∈Θ
invalid (x; θ) → invalid (; z0)⇒
⎧
⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
Frequentist inference
incorrect error probabilities
incorrect fit/prediction measure
Bayesisia inference
erroneous posterior:
(|z0)∝()(; z0)
Error statistics proposes a methodology on how to spec-
ify (Specification) and validate statistical models by prob-
ing model assumptions (Mis-Specification (M-S) test-
ing), isolate the sources of departures, and account for them
in a respecified model (Respecification) with a view to se-
cure statistical adequacy. Such a model is then used to probe
the substantive hypotheses of interest.
Model validation plays a pivotal role in providing an
objective scrutiny of the reliability of inductive procedures;
objectivity in scientific inference is inextricably bound up
with the reliability of its methods.
2
1.2 Misspecification and the unreliability of inference
Before we discuss M-S testing, it is important to see how par-
ticular departures from the model assumptions can affect the
reliability of inference by distorting the nominal error proba-
bilities and rendering them non-ascertainable.
Table 1 - The simple (one parameter) Normal model
Statistical GM: = +  ∈N
[1] Normal:  v N( )
[2] Constant mean: ()= for all ∈N
[3] Constant variance:  ()=2
-known, for all ∈N
[4] Independence: { ∈N} - independent process.
To simplify the discussion that follows, let us focus on the
simple Normal (one parameter) model (table 1).
It was shown above that for testing the hypotheses:
0: =0 vs. 1:   0 (1)
there is an -level UMP defined by: :={(X) 1()} :
(X)=
√
(−0)

 1()={x: (x)  } (2)
where =1

P
=1   is the threshold rejection value. Given
that:
(i) (X)=
√
(−0)

=0
v N(0 1) (3)
one can evaluate the type I error probability (significance level)
 using:
P((X)  ; 0 true)=
where  is the type I error. To evaluate the type II error
probability and the power one needs to know the sampling
distribution of (X) when 0 is false. However, since 0 is
3
false refers to 1 :   0 this evaluation will involve all
values of  greater than 0 (i.e. 10) :
(1) =P((X) ≤ ; =1)
(1) =1−(1)=P((X)  ; =1)
¾
∀(10)
The relevant sampling distribution takes the form:
(ii) (X)=
√
(−0)

=1
v N(1 1) 1=
√
(1−0)

 ∀10
(4)
What is often insufficiently emphasized in statistics text-
books is that the above nominal error probabilities, i.e.
the significance , as well as the power of test  will be
different from the actual error probabilities when any of
the assumptions [1]-[4] are invalid for data x0 Indeed, such de-
partures are likely to create significant discrepancies between
the nominal and actual error probabilities that often render
inferences based on (2) unreliable.
To illustrate how the nominal and actual error probabilities
can differ when any of the assumptions [1]-[4] are invalid, let
us take the case where the independence assumption [4] is false
for the underlying process { ∈N} and instead:
( )=, 0    1 for all 6=  =1  (5)
How does such a misspecification affect the reliability of test ?
The actual distribution of (X) under 0 and 1 are:
(i)* (X)=
√
(−0)

=0
v N (0 ()) 
(ii)* (X)=
√
(−0)

=1
v N
³√
(1−0)

 ()
´ (6)
()=(1+(−1))  1 for 01 and 1
How does this change affect the relevant error probabilities?
4
Example 1. Consider the case: =05 (=1645) =1
and =100 To find the actual type I error probability
we need to evaluate the tail area of the distribution in (i)*
beyond =1645:
∗
=P((X)  ; 0)=P(  1645√
()
; =0)
where  v N(0 1) The results in table 2 for different values
of  indicate that test  has now become ‘unreliable’ because
∗
 . One will apply test  thinking that it will reject a
true 0 only 5% of the time, when, in fact it is much higher.
Table 2 - Type I error of  when ( )=
 .0 .05 .1 .2 .3 .5 .75 .8 .9
∗
.05 249 309 359 383 408 425 427 431
The actual power should now be evaluated using:
∗
(1) =P(  (1
p
())
h
−
√
(1−0)

i
; =1)
giving rise to the results in table 3.
Table 3 - Power ∗
(1) of  when ( )=
 ∗
(01) ∗
(02) ∗
(05) ∗
(1) ∗
(2) ∗
(3) ∗
(4)
0 061 074 121 258 637 911 991
05 262 276 318 395 557 710 832
1 319 330 364 422 542 659 762
3 390 397 418 453 525 596 664
5 414 419 436 464 520 575 630
8 431 436 449 471 515 560 603
9 435 439 452 473 514 556 598
For small values of 1 (01 02 051) the power increases
as  → 1, but for larger values of 1 (2 3 4) the power
decreases, ruining the ‘probativeness’ of a test! It has become
5
like a defective smoke alarm which has the tendency to go off
when burning toast, but it will not be triggered by real smoke
until the house is fully ablaze; Mayo (1996).
The above example is only indicative of an actual situa-
tion in practice where several of the model assumptions are
often invalid, rendering the reliability of inference a lot
more dire than this example might suggest; see Spanos and
McGuirk (2001).
1.3 On the reluctance to validate statistical models
The key reason why model validation is extremely important
is that No trustworthy evidence for or against a sub-
stantive claim (or theory) can be secured on the basis of a
statistically misspecified model.
In light of this, ‘why has model validation been neglected?
There are several reasons, including the following.
(a) Inadequate appreciation of the serious implications of
statistical misspecification for the reliability of inference.
(b) Inadequate understanding of how one can secure statis-
tical adequacy using thorough M-S testing.
(c) Inadequate understanding M-S testing and confusion
with N-P testing render it vulnerable to charges like: (i) infi-
nite regress and circularity, and (ii) illicit double-use of data.
(d) There is an erroneous impression that statistical mis-
specification is inevitable since modeling involves abstraction,
simplification and approximation. Hence, the slogan "All
models are wrong, but some are useful" is used as the excuse
for neglecting model validation.
This aphorism is especially pernicious because confuses two
different aspects of empirical modeling:
6
(i) the adequacy of the substantive (structural) model Mϕ(z)
(substantive adequacy), vis-a-vis the phenomenon of interest,
(ii) the validity of the (implicit) statistical model Mθ(z) (sta-
tistical adequacy) vis- a-vis the data z0.
It’s one thing to claim that the structural model Mϕ(z)
is wrong in the sense that it’s false to claim it is an exact
picture of reality in a substantive sense, and quite another
to claim that the implicit statistical model Mθ(z) could not
have generated data z0 because its probabilistic assumptions
are invalid for z0. In cases where we may arrive at statistically
adequate models, we can learn true things even with idealized
and partial substantive models.
When one imposes the substantive information (theory) on
data z0 at the outset, by estimating Mϕ(z) the end result is
often a statistically and substantively misspecified model, but
one has no way to delineate the two sources of error:
(a) the inductive premises are invalid, or
(b) the substantive information is inadequate,
and apportion blame with a view to address the unreliability
of inference problem.
The key to circumventing this Duhemian ambiguity is to
find a way to disentangle the statistical Mθ(z) from the sub-
stantive premises Mϕ(z). What is often insufficiently appre-
ciated is the fact that behind every substantive model Mϕ(z)
there is (often implicit) a statistical model Mθ(z) which pro-
vides the inductive premises for the reliability of statistical
inference based on data z0 The latter is just a set of proba-
bilistic assumptions pertaing to the chance regularities in data
z0 Statistical adequacy ensures error reliability in the sense
that the actual error probabilities approximately closely the
nominal ones.
7
2 M-S testing: a first encounter
To get some idea of what M-S testing is all about, let us focus
on a few simple tests to assess assumptions [1]-[4] of the simple
Normal model (table 4).
Table 4 - The simple Normal model
Statistical GM: = +  ∈N
[1] Normal:  v N( )
[2] Constant mean: ()= for all ∈N
[3] Constant variance:  ()=2
, for all ∈N
[4] Independence: { ∈N} - independent process.
Mis-Specification (M-S) testing differs from Neyman-Pearson
(N-P) testing in several respects, the most important of which
is that the latter is testing within boundaries of the assumed
statistical model Mθ(x), but the former is testing outside
those boundaries. N-P testing partitions the assumed model
using the parameters as an index. Conceptually, M-S testing
partitions the set P(x) of all possible statistical models that
could have given rise to data x0 into Mθ(x) and its compli-
ment P(x) − Mθ(x). However, P(x) − Mθ(x) cannot be
expressed in a parametric form and thus M-S testing is more
open-ended than N-P testing.
P ( )x
H0
H1
Fig. 1: N-P testing within Mθ(x)
P ( )x
M ( )x
Fig. 2: M-S testing outside Mθ(x)
8
2.1 Omnibus (nonparametric) M-S tests
2.1.1 The ‘Runs M-S test’ for the IID assumptions [2]-[4]
The hypothesis of interest concerns the ordering of the sam-
ple X:=(1 2  ) in the sense that the distribution of
the sample remains the same under for any random reordering
of X i.e.
0: (1 2  ; θ)=(1
 2
  
; θ)
for any permutation (1 2  ) of the index (=1 2  )
Step 1: transform data x0:=(1 2  ) into a sequence
of differences ( − −1), =2 3  
Step 2: replace each (−−1)0 with ‘+’ and each (−−1)0
with ‘-’. A ‘run’ is a segment of the sequence consisting of ad-
jacent identical elements which are followed and preceeded by
a different symbol.
The transformation takes the form:
(1  ) → {(−−1) =2  } → (+ + − + · · · + − − +)
(7)
Step 3: count the number of runs.
Example:
++|{z}
1
−|{z}
2
+ + +| {z }
3
−−|{z}
4
++|{z}
5
− − −| {z }
6
+|{z}
7
−|{z}
8
+|{z}
9
−−|{z}
10
+ + + + +| {z }
11
−|{z}
12
· · ·
consists of 12 runs; the first is a run of 2 positive signs, the
second a run of 1 negative sign, etc.
Runs test. One of the simplest runs test is based on
comparing the actual number of runs  with the number of
expected runs assuming that the data represent a realization
of an IID process { ∈N}. The test takes the form:
(X)=[−()]
√
 ()
 1()={x: |(x)|  
2
}
9
Using simple combinatorics with a sample size  one can
derive:
()=
¡2−1
3
¢
  ()=16−29
90

and show that the distribution of (X) for  ≥ 40 is:
(X)= [−()] 
p
 ()
IID
≈ N(0 1)
Note that this test is insensitive to departures from Nor-
mality because all distributional information has been lost in
the transformation (7).
Example- exam scores. Let us return to the exam
score data, shown below in both the alphabetical and sitting
arrangement.
Case 1. The exam scores data arranged in alphabetical
order (fig, 2) we observe the following runs:
{1 1 4 1 1 3 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 2 3 1 1}
+

{1 1 3 1 1 2 2 1 2 1 2 2 3 1 2 1 2 1 2 1 2 1 1 1 1}
−

(8)
Hence, the actual number of runs is 50 which is close to the
number of runs expected under IID: (2(70)−1)3 ' 46. Ap-
plying the above runs test yields:
(x0)=50−
³
2(70)−1
3
´

q
16(70)−29
90
=1053[292]
where the p-value is in square brackets. This indicates no
departure from the IID ([2]-[4]) assumptions.
Fig. 3: −alphabetical order Fig. 4: −sitting order
10
Case 2. Consider the scores data ordered according to the
sitting order in figure 4. This data exhibit cycles which
yield the following runs up and down:
{3 2 4 4 1 4 3 6 1 4}
+
 {2 2 2 4 3 3 7 4 6 1 3}
−
 (9)
The difference between the patterns in (8) and (9) is in that
there is more clustering and thus fewer runs in the latter case.
The actual number of runs is 21; less than half of what were
expected under IID.
(x0)=21−
³
2(70)−1
3
´

q
16(70)−29
90
= −7276[0000]
which clearly indicates strong departures from the IID ([2]-[4])
assumptions.
2.1.2 Kolmogorov’s M-S test for Normality ([1])
The Kolmogorov M-S test for assessing the validity of a dis-
tributional assumption under two key conditions:
(i) the data x0:=(1 2  ) can be viewed as a realiza-
tion of a random (IID) sample X:=( 1 2  ), and
(ii) the random variables 1 2   are continuous (not
discrete).
The test relies on the empirical cumulative distribution
function (ecdf):
b()= [no of (12) that do not exceed ]

 ∀∈R
Under (i)-(ii), the ecdf is a strongly consistent estimator of
the cumulative distribution function (cdf): ()=( ≤ )
∀∈R
The generic hypothesis being tested takes the form:
0: ∗
()=0() ∈R (10)
where ∗
() denotes the true cdf, and 0() the cdf assumed
by the statistical model Mθ(x)
11
Kolmogorov (1933) proposed the distance function:
∆(X)= sup∈R | b() − 0()|
and proved that under (i)-(ii):
lim
→∞
P(
√
∆(X) ≤ )=() for   0 uniformly in  (11)
where () denotes the cdf of the Kolmogorov distribution:
()=1−2
P∞
=1(−1)+1
−222
' 1 − 2 exp(−22
)
Since () is known (approximated), one can define a M-S
test based on the test statistic (X)=
√
∆(X) giving rise
to the p-value:
P((X)  (x0); 0)=(x0)
Example. Applying the Kolmogorov test to the scores
data in fig. 3 yielded:
P((X)  039; 0)=15
which does not indicate any serious departures from the Nor-
mality assumption. The graph below provides a pictorial de-
piction of what this test is measuring in terms of the discrep-
ancies from the line to the observed points.
12011010090807060504030
99.9
99
95
90
80
70
60
50
40
30
20
10
5
1
0.1
x
Percent
M ean 71.69
S tD ev 13.61
N 70
KS 0.039
P -Valu e >0.150
Probability Plot of x
Norm al
fig. 5: P-P Normality plot
Note that this particular test might be too sensitive to out-
liers because it picks up only the biggest distance!
12
2.1.3 The role for omnibus M-S tests
The key advantage of the above omnibus tests is that they
probe more broadly around the Mθ(x) than directional (para-
metric) M-S tests at the expense of lower power. However,
tests with low power are useful in M-S testing because when
they detect a departure, they provide better evidence for its
presence than a test with very high power!
A key weakness of the above omnibus tests is that when
the null hypothesis is rejected, the test does not provide any
information as to the direction of departure. Such information
is needed for the next stage of modeling, that of respecifying
the original model Mθ(x) with a view to account for the sys-
tematic information not accounted for by Mθ(x).
2.2 Directional (parametric) M-S tests
2.2.1 A parametric M-S test for independence ([4])
A general approach to deriving M-S tests is to return to the
original probabilistic assumptions of the process { ∈N}
underlying data x0:=(1 2  ) and replace one or more
assumptions with more general ones and derive relevant dis-
tance functions using the two statistical Generating Mecha-
nisms (GMs).
In the case of the simple Normal model, the process { ∈N}
is assumed to be NIID. Let us relax the IID assumptions to
Markov dependence and stationarity, which gives rise to the
AutoRegressive (AR(1)), model based on (|−1; θ), whose
statistical GM is:
 = 0 + 1−1 +  vN(0 2
0) ∈N (12)
where 0=(1−1)∈R 1=(1)
(0)
∈(−1 1) 2
0=(0)(1−2
1)∈R+;
13
=() (0)= () (1)=( −1) =1   
Fig. 6: M-S testing by encompassing
The AR(1) parametrically nests (includes as a special case)
the simple Normal model because when 1=0 :
0= (1−1)|1=0 = 2
0= (0)(1−2
1)
¯
¯
1=0
=(0)
the AR(1) reduces to the simple Normal:
 = 0 + 1−1 + 
1=0
→ = +  ∈N
This suggests that a way to assess assumption [4] (table 4) is
to test the hypotheses:
0: 1=0 vs. 1: 1 6= 0 (13)
in the context of the AR(1) model. This will give rise to a
t-type test :={(X) 1()}:
(X)= (b1−0)
√
 (b1)
0
≈ (−2) 1()={x: |(x)|  }
where b1=
P
=1(−)(−1−)
P
=1(−1−)2   (b1)= 2
P
=1(−1−)2 
2
= 1
−2
P
=1(−b0−b1−1)2
 b0=(1−b1)
Example. For the data in figure 4, (12) yields:
=39593
(7790)
+ 0441
(0106)
−1 + b 2
=2 2
=14342 =69
14
The M-S t-test for (13) yields:
(x0)=
¡441
106
¢
=4160 (x0)=0000
indicating a clear departure from assumption [4].
It is straightforward to extend the above test to Markov()
by estimating the auxiliary regression:
=0 +
P
=1 − +  ∈N (14)
and testing the coefficient restrictions:
0: 1=2= · · · ==0 for   ( − 1) (15)
This gives rise to an F-type test, analogous to Ljung and Box
(1978) test, with one big difference: the estimated coefficients
in (14) can also be assessed individually using t-tests in order
to avoid the large  problem raised above. For the case  =
2 the auxiliary regression is:
=0 + 1−1 + 2−2 +  ∈N (16)
and the F-test for the joint significance of 1 and 2 will take
the form:
(x)=−

¡−3
2
¢ 0
v (2  − 3)
where =
P
=1( − )2
 denote the Restricted [re-
strictions 1=2=0 imposed]Residuals Sum of Squares, and
=
P
=1 b2
  b=−b0 − b1−1 − b2−2 the Unre-
stricted Residuals Sum of Squares, (2  − 3) denotes the F
distribution with 2 and  − 3 degrees of freedom.
One of the key advantages of this approach is that it can
easily be extended to derive joint M-S tests that assess more
than one assumption.
15
2.2.2 A parametric M-S test for IID ([2]-[3])
The above t-type parametric test based on the auxiliary Au-
toregression (12) can be extended to provide a joint test for
assumptions [2] and [4], by replacing the stationarity assump-
tion of { ∈N} with mean non-stationarity, gives rise to a
heterogeneous AR(1) model with a statistical GM:
=0 +
[2]
z}|{
1 +
[4]
z }| {
1−1 +  ∈N (17)
0=+1(1-) 1=(1-1)1 1=((1)(0)) 2
0=(0)(1-2
1)
The AR(1) with a trend nests the simple Normal model:
=0 + 1 + 1−1 + 
1=0
1=0
→ = +  ∈N
This suggests that a way to assess assumptions [2]&[4] (table
4) jointly is to test the hypotheses:
0: 1=0 and 1=0 vs. 1: 16=0 or 16=0
This will give rise to a F-type test :={(X) 1()}:
(X)=RRSS-URSS
URSS
¡−3
2
¢ 0
≈ F(2 −3) 1()={x: (x)}
URSS=
P
=1(−b0−b1−b1−1)2
 RRSS=
P
=1(−)2

where URSS and RRSS denote the Unrestricted and Restricted
Residual Sum of Squares, respectively, and F(2,n-3) denotes
the F distribution with 2 and −3 degrees of freedom.
Example. For the data in figure 4, the restricted and
unrestricted models yielded, respectively:
=7169
(1631)
+ b 2
=18523 =69
=38156
(8034)
+ 055
(073)
 + 0434
(0107)
−1 + b 2
=14434 =69 (18)
16
where RRSS=26845 URSS=21543 yielding:
(x0)=
¡26845−21543
21543
¢ ¡67
2
¢
=8245 (x0)=0006
indicating a clear departure from the null ([2]&[4]).
What is particularly notable about the auxiliary autore-
gression (18) is that a closer look at the t-ratios indicates
that the source of the problem is dependence and not t-
heterogeneity. The t-ratio of the coefficient of  is statistically
insignificant:
(x0)=
¡055
073
¢
=753 (x0)=226
but the coefficient of −1 is statistically significant:
(x0)=
¡434
107
¢
=4056 (x0)=0000
indicating a clear departure from assumption [4], but not from
[2]. This information that enables one to apportion blame
cannot be gleaned from the runs test.
An alternative, and more preferable, way to specify the
above auxiliary regressions is in terms of the residuals:
b= ( − ) =( − 7169) =1 2  
in the sense that the auxiliary regression:
b= −33534
(8034)
+055
(073)
 + 0434
(0107)
−1 + b 2
=14434 =69 (19)
is a mirror image of (18):
=38156
(8034)
+ 055
(073)
 + 0434
(0107)
−1 + b 2
=14434 =69 (20)
with identical parameter estimates, apart from the constant
(−33534=38156−716), which is irrelevant for M-S testing
purposes.
17
2.2.3 A parametric M-S test for assumptions [3]-[4]
In light of the fact that 2
= ()=(2
 ) one can test the
variance constancy [3] and independence [4] assumptions using
the residuals squared in the context of the auxiliary regression:
b2
 =0 +
[3]
z}|{
1 +
[4]
z }| {
22
−1 +  =1 2  
Using the above data, this gives rise to:
b2
 =29526
(8943)
− 1035
(1353)
 − 016
(014)
2
−1 + b (21)
The non-significance of the coefficients 1 and 2 indicate no
departures from assumptions [3] and [4].
Note that one could test assumption [3] individually using
the auxiliary regression:
b2
 =24384
(5530)
− 1728
(1354)
 + b1 (22)
where the t-test for the coefficient of  yields: 1728
1354
=1276[206];
the p-value is given in square brackets.
2.2.4 Extending the above auxiliary regression
The auxiliary regression (17), providing the basis of the joint
test for assumptions [2]-[4] can be easily extended to include
higher order trends (up to order  ≥ 1) and additional lags
( ≥ 1):
 = 0 +
P
=1 
+
P
=1 − +  ∈N (23)
2.2.5 A parametric M-S test for Normality ([1])
An alternative way to test Normality is to use parametric tests
relying on key features of the distribution. An example of this
type of test is the Skewness-Kurtosis test.
18
A key feature of the Pearson family is that it is specified
using the first four moments. Within this family we can char-
acterize several distributions using the skewness and kurtosis
coefficients:
3=(−())3
³√
 ()
´3  4=(−())4
³√
 ()
´4 
The skewness is the standardized third central moment and
provides a measure of asymmetry of () and the kurtosis is
the standardized fourth central moment and is a measure of
the peakness in relation to the tails of ()
The Normal distribution is characterized within the Pearson
family via the restrictions:
(3=0 4=3) ⇒ ∗
()=() for all ∈R
where ∗
() and () denote the true density and the Normal
density, respectively.
These moments can be used to derive a M-S test for the
Normality assumption [1] (table 4), using the hypotheses:
0: 3=0 and 4=3 vs. 1: 36=0 or 46=3
The Skewness-Kurtosis test is given by:
(X)=
6
b2
3 + 
24
(b4−3)
2 0
v

2
(2)
P((X)  (x0); 0)=(x0)
(24)
where 2
(2) denotes the chi-square distribution with 2 degrees
of freedom, and:
b3=
1

P
=1(−)3
³√1

P
=1(−)2
´3  b4=
1

P
=1(−)4
³√1

P
=1(−)2
´4 
Example. For the scores data in fig. 3: b3= −03,
b4=262:
(x0)=70
6
(−03)2
+ 70
24
(−38)
2
=432 (x0)=806
19
indicating no departure from the Normality assumption [1].
How is this test different from Kolmogorov’s nonparamet-
ric test? Depending on whether b3 6= 0 or b4 6= 3 one
can conclude whether the underlying distribution () is non-
symmetric or leptokurtic and that information can be useful
at the respecfication stage.
2.3 Simple Normal model: a summary of M-S testing
The first auxiliary regression specifies how departures from
different assumptions might affect the mean:
(i) b=10 +
[2]
z }| {
11 + 122
+
[4]
z }| {
13−1 + 14−2 + 1
0 : 11=12=13=14=0
The second auxiliary regression specifies how departures
from different assumptions might affect the variance:
(ii) b2
 =20 +
[3]
z }| {
21 + 222
+
[4]
z }| {
232
−1 + 242
−2 + 2
0 : 21=22=23=24=0
When NO departures from assumptions [2]-[4] are detected
one can proceed to test the Normality assumption using tests
like the skewnesss-kurtosis, the Kolmogorov or the Anderson-
Darling. Otherwise, one uses the residuals from auxiliary re-
gression (i) as a basis for a Normality test.
Example. Consider the casting of two dice data in table 5.
Evaluating the sample mean, variance, skewness and kurtosis
for the dice data yields:
=1

P
=1 =7080 2
= 1
−1
P
=1( − )2
=5993
b3= − 035 b4=2362
20
Table 5 - Observed data on dice casting
3 10 11 5 6 7 10 8 5 11 2 9 9 6 8 4 7 6 5 12
7 8 5 4 6 11 7 10 5 8 7 5 9 8 10 2 7 3 8 10
11 8 9 5 7 3 4 9 10 4 7 4 6 9 7 6 12 8 11 9
10 3 6 9 7 5 8 6 2 9 6 4 7 8 10 5 8 7 9 6
5 7 7 6 12 9 10 4 8 6 5 4 7 8 6 7 11 7 8 3
1009080706050403020101
12
10
8
6
4
2
Inde x
x
Time S eries P lot of x
Fig. 7: t-plot of the dice data
(a) Testing assumptions [2]-[4] using the runs test, requires
the counting of runs:
+ + - + + + - - + - + + - + - + - - + - + - - + + - + - + - - + - + - + - + + + - + - + - + + + -
+ - + + - - + - + - + - + + - - + - - + - - + + + - + - + - - + + - + - + - + - - - + + - + + - + -
For  ≥ 40, the type I error probability evaluation is based
on:
(X)= −([2−1]3)
√
[16−29]90
[1]-[4]
v N(0 1)
For the above data: =100 =50:
()=(200−1)3=66333  ()=(16(100)−29)90=17456
(X)=72−66333√
17456
=1356 P(|(X)|  1356; )=175
This does not indicate any departure from the IID assump-
tions.
(b) Test the independence assumption [4] using the auxil-
iary regression:
=0 + 1−1 +  =1 2  
21
=7856
(759)
− 103
(101)
−1 + b
(2425)

and the t-test for the significance of 1 yields: (x)=103
101
=1021[310]
where the p-value in square brackets indicates no clear depar-
ture from the Independence assumption; see Spanos (1999),
p. 774.
(c) Test the identically distributed assumptions [2]-[3] using
the auxiliary regression:
=0 + 1 +  =1 2  
=7193
(496)
− 002
(008)
 + b
(2460)

and the t-test for the significance of 1 yields: (x)=0022
0085
=259[793]
where the p-value indicates no departure from the ID assump-
tion; see Spanos (1999), p. 774.
(d) One can test the IID assumptions [2]-[4] jointly using
the auxiliary regression:
=0 + 1 + 2−1 +  =1 2  
=8100
(877)
− 0048
(0086)
 − 103
(101)
−1 + b
(2434)

where the F-test for the joint significance of 1 and 2 i.e.
0: 1=2=0 vs. 1: 16=0 or 26=0
(x)=−

¡−3

¢
=5766−568540
568540
¡96
2
¢
=680[511]
where =
P
=1( −)2
 denote the Restricted Resid-
uals Sum of Squares [the sum of squares of the residuals with
the restrictions imposed], and =
P
=1 b2
  the Unre-
stricted Residuals Sum of Squares [the sum of squares of the
residuals without the restrictions], respectively; note that
(−) is often called the Explained Sum of Squares
(ESS). The p-value in square brackets indicates no departure
from the IID assumptions, confirming the previous M-S test-
ing results.
22
1 21 08642
1 8
1 6
1 4
1 2
1 0
8
6
4
2
0
x
Frequency
H is to g r a m o f x
Fig. 8: Histogram of the dice data
(e) Testing the Normality assumption [1] using the SK test
yields:
(x0)=100
6
(−0035)2
+ 100
24
(2362 − 3)2
=1716[424]
The p-value indicates no departure from the Normality as-
sumption, but as shown in Spanos (1999), p. 775, this does
not mean that the assumption is valid; the test has very low
power. This is to be expected because the data come from a
discrete triangular distribution with values from 2 to 12, as
shown by the histogram (fig. 8).
Using the more powerful Anderson and Darling (1952) test,
which for the ordered X sample simplifies to:
-(X)= −−1

P
=1
©
(2−1)
£
ln []− ln(1− ln [+1−]
¤ª

however, provides evidence against Normality:
-(x0)=772[041]
In light of the M-S results in (a)-(e) one needs to replace the
Normality assumption with a triangular discrete distribution
in order to get a more adequate statistical model.
23
3 Mis-Specification (M-S) testing: a formalization
3.1 The nature of M-S testing
The basic question posed by M-S testing is whether or not the
particular data x0:=(1 2  ) constitute a ‘truly typical
realization’ of the stochastic process { ∈N} underlying
the (predesignated) statistical model:
Mθ(x)={(x; θ) θ∈Θ} x∈R

Remember also that the primary aim of the frequentist ap-
proach is to learn from data x0 about the true statistical Data-
Generating Mechanism (DGM) M∗
(x)={(x; θ∗
)} x∈R
.
P ( )x
H0
H1
Fig. 9: N-P testing within Mθ(x)
P ( )x
M ( )x
Fig. 10: M-S testing outside Mθ(x)
Hence, the primary role of M-S testing is to probe, vis-
a-vis data x0 for possible departures from M(x) beyond its
boundaries, but within P(x) the set of all possible statistical
models that could have given rise to x0 In this sense, the
generic form of M-S testing is probing outside Mθ(x):
0: ∗
(x)∈Mθ(x) vs. 0: ∗
(x)∈ [P(x)−M(x)] 
where ∗
(x)=(x; θ∗
) denotes the ‘true’ distribution of the
sample.
24
In contrast, N-P testing is always within the boundaries
of Mθ(x). It presupposes that Mθ(z) is statistically ade-
quate and the hypotheses are ultimately concerned with learn-
ing from data about the ‘true’  say ∗
that could have
given rise to data x0 In general, the expression ‘θ∗
denotes
the true value of θ’ is a shorthand for saying that ‘data x0
constitute a realization of the sample X with distribution
(x; θ∗
)’ By defining the partition of Θ=(−∞ ∞) in terms of
Θ0=(−∞ 0] and Θ1=(0 ∞) and the associated partition of
Mθ(x), M0(x)={(x; ) ∈Θ0} and M1(x)={(x; ) ∈Θ1}
the hypotheses in (1) can be framed equivalently, but more
perceptively, as:
0: (x; ∗
)∈M0(x) vs. 1: (x; ∗
)∈M1(x), x∈R

Indeed, the test statistic (X)=
√
(−0)

 for the optimal (UMP)
N-P test, is, in essence, the standardized difference between
∗
and 0 with ∗
replaced by its best estimator 
The fact that M-S testing is probing [P(x)−M(x)] raises
certain technical and conceptual problems pertaining to how
one can operationalize such investigating. In practice, one
needs to replace the broad 0 with a more specific opera-
tional 1. This operationalization has a very wide scope, ex-
tending from vague omnibus (local), to specific directional
(broader) alternatives, like the tests based on the auxiliary
autoregressions and the Skewness-Kurtosis test. In all cases,
however, 1 does not span 0 and that raises additional
issues, including:
(a) The higher vulnerability of M-S testing to the fallacy of
rejection: (mis)interpreting reject 0 [evidence against 0] as
evidence for the specific 1. Rejecting the null in a M-S test
provides evidence against the original model Mθ(x) but that
does not imply good evidence for the particular alternative
25
1. Hence in practice one should never accept 1 without
further probing because that will be a classic example of the
fallacy of rejection.
(b) In M-S testing the type II error [accepting the null when
false] is often the more serious of the two errors. This is be-
cause for the type I error [rejecting the null when true] one will
have another chance to correct the error at the respecification
stage. When one, after a battery of M-S tests, erroneously
concludes that M(x) is statistically adequate, one will pro-
ceed to draw inferences oblivious to the fact that the actual
error probabilities might be very different from the nominal
(assumed) ones.
(c) In M-S testing the objective is to probe [P(x)−M(x)]
as exhaustively as possible, using a combination of om-
nibus M-S tests whose probing is more broad but have low
power and directional M-S tests whose probing in narrower
but goes much further and have higher power.
(d) Applying several M-S tests in probing the validity of one
or a combination of assumptions does not necessarily increase
the relevant type I error probability because the framing of the
hypotheses of interest renders them different from the multiple
hypothesis testing problem as construed in the N-P framework.
3.2 Respecification
After a reliable diagnosis of the sources of misspecification,
stemming from a reasoned scrutiny of the M-S testing re-
sults as a whole, one needs to respecify the original statisti-
cal model. Tracing the symptoms back to the source, enables
one to return to the three-way partitioning, based on the three
types of probabilistic assumptions and re-partition using more
appropriate reduction assumptions.
26
4 M-S testing: revisiting methodological issues
In this section we discuss some of the key criticisms of M-
S testing in order to bring out some of the confusions they
conceal.
4.1 Securing the effectiveness/reliability of M-S testing
There are a number of strategies designed to enhance the ef-
fectiveness/reliability of M-S probing thus render the diagnosis
more reliable.
¥ A most efficient way to probe [P(x)−M(x)] is to con-
struct M-S tests by modifying the original tripartite parti-
tioning that gave rise to Mθ(x) in directions of educated
departures gleaned from Exploratory Data Analysis. This gives
rise to encompassing models or directions of departure, which
enable one to eliminate an infinite number of alternative mod-
els at a time; Spanos (1999). This should be contrasted
with a most inefficient way to do this, that involves probing
[P(x)−M(x)] one model at a time Mϕ
(x) =1 2  This
is a hopeless task because there is an infinite number of such
alternative models to probe for and eliminate.
¥ Judicious combinations of omnibus (non-parametric),
directional (parametric) and simulation-based tests, probing
as broadly as possible and upholding dissimilar assumptions.
The interdependence of the model assumptions, stemming
fromM(x) being a parametrization of the process { ∈N}
plays a crucial role in the self-correction of M-S testing results.
¥ Astute ordering of M-S tests so as to exploit the in-
terrelationship among the model assumptions with a view to
‘correct’ each other’s diagnosis. For instance, the probabilistic
assumptions [1]-[3] of the Normal, Linear Regression model
(table 8) are interrelated because all three stem from the
27
assumption of Normality for the vector process {Z ∈N}
where Z:=( ) assumed to be NIID. This information is
also useful in narrowing down the possible alternatives. It is
important to note that the Normality assumption [1] should
be tested last because most of the M-S tests for it assume
that the other assumptions are valid, rendering the results
questionable when that clause is invalid.
¥ Joint M-S tests (testing several assumptions simul-
taneously) designed to avoid ‘erroneous’ diagnoses as well as
minimize the maintained assumptions.
The above strategies enable one to argue with severity
that when no departures from the model assumptions are de-
tected, the model provides a reliable basis for inference, in-
cluding appraising substantive claims (Mayo & Spanos, 2004).
4.2 The infinite regress and circularity charges
The infinite regress charge is often articulated by claiming
that each M-S test relies on a set of assumptions, and thus it
assesses the assumptions of the model Mθ(x) by invoking the
validity of its own assumptions, trading one set of assumptions
with another ad infinitum. Indeed, some go as far as to claim
that this reasoning is often circular because some M-S tests
inadvertently assume the validity of the very assumption they
aim to test!
A closer look at the reasoning underlying M-S testing re-
veals that both charges are misplaced.
¥ First, the scenario used in evaluating the type I error
invokes no assumptions beyond those of Mθ(x), since every
M-S test is evaluated under:
: all the probabilistic assumptions of Mθ(x) are valid.
Moreover, when any one (or more) of the model assumptions
28
is rejected, the model Mθ(x) as a whole, is considered mis-
specified.
Example. In the context of the simple Normal model
(table 6), the runs test is an example of an omnibus M-S test
for assumptions [2]-[4]. The original data, or the residuals, are
replaced with a + when the next data point is an up and with
a − when it’s a down. A run is a sub-sequence of one type
(+ or −) immediately preceded and succeeded by an element
of the other type.
For  ≥ 40, the type I error probability evaluation is based
on:
(X)= −([2−1]3)
√
[16−29]90
[1]-[4]
v N(0 1)
It is important to emphasize that the runs test is insensitive
to departures from Normality, and thus the effective scenario
for deriving the type I error is under assumptions [2]-[4].
¥ Second, the power for any M-S test, is determined by
evaluating the test statistic under certain forms of departures
from the assumptions being appraised [no circularity], but re-
taining the rest of the model assumptions.
For the runs test, the evaluation of power is based on:
(X)
[1]&[2]-[4]
v N( 2
) 6=0 2
 0
where [2]-[4] denote specific departures from these assump-
tions considered by the test in question. However, since the
test is insensitive to departures from [1], the effective scenario
does not have any retained assumptions. One of the advan-
tages of nonparametric tests is that they are insensitive to
departures from certain retained assumptions.
Bottom line: in M-S testing the evaluations under the
null and alternative hypotheses invoke only the model assump-
tions; no additional assumptions are involved. Moreover, the
29
use of joint M-S tests aims to minimize the number of model
assumptions retained when evaluating under the alternative.
4.3 Illegitimate double-use of data charge
In the context of the error statistical approach it is certainly
true that the same data x0 are being used for two different
purposes:
I (a) to test primary hypotheses in terms of the unknown
parameter(s) θ, and
I (b) to assess the validity of the prespecified model Mθ(x)
but ‘does that constitute an illegitimate double-use of data?’
Mayo (1981) answered that question in the negative, ar-
guing that the original data x0 are commonly remodeled to
r0=G(x0) r0∈R
  ≤  and thus rendered distinct from x0
when testing M(x)’s assumptions:
“What is relevant for our purposes is that the data used to
test the probability of heads [primary hypothesis] is distinct from
the data used in the subsequent test of independence [model as-
sumption]. Hence, no illegitimate double use of data is required.”
(Mayo, 1981, p. 195).
Hendry (1995), p. 545, interpreted this statement to mean:
“... following Mayo (1981), diagnostic test information is ef-
fectively independent of the sufficient statistics, so ‘discounting’
for such tests is not necessary.”
Combining these two views offers a more formal answer.
First, (a) and (b) pose very different questions to data x0
and second, the probing takes place within vs. outside Mθ(x)
respectively.
Indeed, one can go further to argue that the answers to the
questions posed in (a) and (b) rely on distinct informa-
tion in data x0.
30
Under certain conditions, the sample can be split into two
components:
X → (S(X) R(X))
that induce the following reduction in (x; θ):
(x; θ)=·(s; θ) · (r) ∀ (s r) ∈R
 ×R−
 
where = || is the Jacobian of: X → (S(X) R(X))
S(X):=(1  ) is a complete sufficient statistic,
R(X):=(1  −) a maximal ancillary statistic, and
S(Z) and R(Z) are independent.
What does this reduction mean?
(z; θ) =  ·
inference
z }| {
(s; θ) ·
model validation
z}|{
(r)
(25)
I [a] all primary inferences are based exclusively on (s; θ),
and
I [b] (r) can be used to validate Mθ(z) using error prob-
abilities that are free of θ.
Example. For the simple Normal model (table 4) holds
for S:=( 2
) :
 = 1

P
=1  2
= 1
−1
P
=1( − )2

as the minimal sufficient statistic, and R(X)=(b3  b) where:
b= (
√
(−)) v St(−1) =1 2  
known as the studentized residuals, the maximal ancillary
statistic.
I This explains why M-S testing is often based on the
residuals, and confirms Mayo’s (1981) claimthat R(X)=(b3  b)
provides information distinct from R(X) upon which the pri-
mary inferences are based.
31
The crucial argument for relying on (r) for model valida-
tion purposes is that the probing for departures from Mθ(x)
is based on error probabilities that do not depend on θ.
Generality of result in (25). This result holds for al-
most all statistical models routinely used in statistical infer-
ence, including the simple Normal, the simple Bernoulli, the
Linear Regression and related models and all statistical mod-
els based on the (natural) Exponential family of distributions,
such as the Normal, exponential, gamma, chi-squared, beta,
Dirichlet, Bernoulli, Poisson, Wishart, geometric, Laplace,
Levy, log-Normal, Pareto, Weibull, binomial (with fixed num-
ber of trials), multinomial (with fixed number of trials), and
negative binomial (with fixed number of failures) and many
others. Finally, the above result in (25) holds ‘approximately’
in all cases of statistical models whose inference relies on as-
ymptotic Normality.
5 Summary and Conclusions
Approximations, limited data and uncertainty lead to the use of
statistical models in learning from data about phenomena
of interest.
All statistical methods (frequentist, Bayesian, non-
parametric) rely on a prespecified statistical model Mθ(x)
as the primary basis of inference. The sound application and
the objectivity of their methods turns on the validity of these
assumed statistical models for the particular data.
Fundamental aim: How to specify and validate statisti-
cal models.
Unfortunately, model validation has been a neglected
aspect of empirical modeling. At best, one often finds more
32
of a grab-bag of techniques than a systematic account.
Error statistics attempts to remedy that by proposing a
coherent account of statistical model specification and valida-
tion that puts the entire process on a sounder philosophical
footing (Spanos, 1986, 1999, 2000, 2010; Mayo and Spanos,
2004).
Crucial strengths of frequentist error statistical methods
in this context:
I There is a clear goal to achieve: the statistical model
is sufficiently adequate so that the actual error probabilities
approximate well the nominal ones.
I It supplies a trenchant battery of Mis-Specification (M-S)
tests for model-validation (non-parametric and parametric)
with a view to minimize both types of errors and generate a
reliable diagnosis thru self-correction.
I In offers a seemless transition from model validation
to subsequent use in the sense that the same error statistical
reasoning is used.
The focus is on the question: What is the nature and warrant
for frequentist error statistical model specification and validation?
Failing to grasp the correct rationale of M-S testing has
led many to think that merely finding a statistical model that
‘fits’ the data well in some sense is tantamount to showing it
is statistically adequate. It is not!
Minimal Principle of Evidence: if the procedure had
no capacity to uncover departures from a hypothesis , then
not finding any is poor evidence for 
Failing to satisfy so minimal a principle leads to models
which, while acceptable according to its own self-scrutiny, are
in fact inadequate and they give rise to untrustworthy evi-
dence.
33

More Related Content

PDF
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
PDF
Probability/Statistics Lecture Notes 4: Hypothesis Testing
PDF
Spanos lecture+3-6334-estimation
PDF
Spurious correlation (updated)
PDF
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
PDF
Phil 6334 Mayo slides Day 1
PPTX
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
PPTX
Chap09 hypothesis testing
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Spanos lecture+3-6334-estimation
Spurious correlation (updated)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Phil 6334 Mayo slides Day 1
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Chap09 hypothesis testing

What's hot (20)

PPT
Fundamentals of Testing Hypothesis
PDF
IN ORDER TO IMPLEMENT A SET OF RULES / TUTORIALOUTLET DOT COM
PDF
Testing as estimation: the demise of the Bayes factor
PPTX
6 estimation hypothesis testing t test
PDF
C2 st lecture 11 the t-test handout
PDF
Discussion of Persi Diaconis' lecture at ISBA 2016
PDF
ISBA 2016: Foundations
PDF
testing as a mixture estimation problem
PPTX
Hypothesis testing
PPTX
Statistical computing2
PPTX
Statistical Inference Part II: Types of Sampling Distribution
PDF
Quantitative Analysis For Management 11th Edition Render Solutions Manual
PPT
Hypothesis Testing
PPT
Chapter11
PPTX
PDF
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
PDF
Big Data Analysis
PPTX
The siegel-tukey-test-for-equal-variability
PDF
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
Fundamentals of Testing Hypothesis
IN ORDER TO IMPLEMENT A SET OF RULES / TUTORIALOUTLET DOT COM
Testing as estimation: the demise of the Bayes factor
6 estimation hypothesis testing t test
C2 st lecture 11 the t-test handout
Discussion of Persi Diaconis' lecture at ISBA 2016
ISBA 2016: Foundations
testing as a mixture estimation problem
Hypothesis testing
Statistical computing2
Statistical Inference Part II: Types of Sampling Distribution
Quantitative Analysis For Management 11th Edition Render Solutions Manual
Hypothesis Testing
Chapter11
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
Big Data Analysis
The siegel-tukey-test-for-equal-variability
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
Ad

Similar to An Introduction to Mis-Specification (M-S) Testing (20)

PDF
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
PDF
A. Spanos slides-Ontology & Methodology 2013 conference
PPT
More Statistics
PDF
Probability and basic statistics with R
PPT
Morestatistics22 091208004743-phpapp01
PDF
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
PPTX
statistics assignment help
PPT
chap4_Parametric_Methods.ppt
PDF
hypothesis_testing-ch9-39-14402.pdf
PPTX
ders 5 hypothesis testing.pptx
PPTX
TEST OF SIGNIFICANCE.pptx
PPTX
Statistical tests of significance and Student`s T-Test
PDF
2013.03.26 Bayesian Methods for Modern Statistical Analysis
PDF
2013.03.26 An Introduction to Modern Statistical Analysis using Bayesian Methods
PDF
advanced_statistics.pdf
DOCX
HW1_STAT206.pdfStatistical Inference II J. Lee Assignment.docx
PPTX
Chapter_9.pptx
PDF
V. pacáková, d. brebera
PDF
Advanced Engineering Mathematics (Statistical Techniques - II)
PPTX
Hypothesis Test _One-sample t-test, Z-test, Proportion Z-test
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
A. Spanos slides-Ontology & Methodology 2013 conference
More Statistics
Probability and basic statistics with R
Morestatistics22 091208004743-phpapp01
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
statistics assignment help
chap4_Parametric_Methods.ppt
hypothesis_testing-ch9-39-14402.pdf
ders 5 hypothesis testing.pptx
TEST OF SIGNIFICANCE.pptx
Statistical tests of significance and Student`s T-Test
2013.03.26 Bayesian Methods for Modern Statistical Analysis
2013.03.26 An Introduction to Modern Statistical Analysis using Bayesian Methods
advanced_statistics.pdf
HW1_STAT206.pdfStatistical Inference II J. Lee Assignment.docx
Chapter_9.pptx
V. pacáková, d. brebera
Advanced Engineering Mathematics (Statistical Techniques - II)
Hypothesis Test _One-sample t-test, Z-test, Proportion Z-test
Ad

More from jemille6 (20)

PDF
What is the Philosophy of Statistics? (and how I was drawn to it)
PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
PDF
Severity as a basic concept in philosophy of statistics
PDF
“The importance of philosophy of science for statistical science and vice versa”
PDF
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
PDF
D. Mayo JSM slides v2.pdf
PDF
reid-postJSM-DRC.pdf
PDF
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
PDF
Causal inference is not statistical inference
PDF
What are questionable research practices?
PDF
What's the question?
PDF
The neglected importance of complexity in statistics and Metascience
PDF
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
PDF
On Severity, the Weight of Evidence, and the Relationship Between the Two
PDF
Comparing Frequentists and Bayesian Control of Multiple Testing
PPTX
Good Data Dredging
PDF
The Duality of Parameters and the Duality of Probability
PDF
Error Control and Severity
PDF
The Statistics Wars and Their Causalities (refs)
PDF
The Statistics Wars and Their Casualties (w/refs)
What is the Philosophy of Statistics? (and how I was drawn to it)
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
Severity as a basic concept in philosophy of statistics
“The importance of philosophy of science for statistical science and vice versa”
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
D. Mayo JSM slides v2.pdf
reid-postJSM-DRC.pdf
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Causal inference is not statistical inference
What are questionable research practices?
What's the question?
The neglected importance of complexity in statistics and Metascience
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
On Severity, the Weight of Evidence, and the Relationship Between the Two
Comparing Frequentists and Bayesian Control of Multiple Testing
Good Data Dredging
The Duality of Parameters and the Duality of Probability
Error Control and Severity
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Casualties (w/refs)

Recently uploaded (20)

PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
master seminar digital applications in india
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Business Ethics Teaching Materials for college
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Cell Structure & Organelles in detailed.
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Classroom Observation Tools for Teachers
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Final Presentation General Medicine 03-08-2024.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Microbial diseases, their pathogenesis and prophylaxis
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Insiders guide to clinical Medicine.pdf
master seminar digital applications in india
Microbial disease of the cardiovascular and lymphatic systems
O7-L3 Supply Chain Operations - ICLT Program
Module 4: Burden of Disease Tutorial Slides S2 2025
Business Ethics Teaching Materials for college
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Cell Structure & Organelles in detailed.
102 student loan defaulters named and shamed – Is someone you know on the list?
2.FourierTransform-ShortQuestionswithAnswers.pdf
Classroom Observation Tools for Teachers
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
O5-L3 Freight Transport Ops (International) V1.pdf
Renaissance Architecture: A Journey from Faith to Humanism
PPH.pptx obstetrics and gynecology in nursing
Final Presentation General Medicine 03-08-2024.pptx

An Introduction to Mis-Specification (M-S) Testing

  • 1. PHIL 6334 - Probability/Statistics Lecture Notes 6: An Introduction to Mis-Specification (M-S) Testing Aris Spanos [Spring 2014] 1 Introduction The primary objective of empirical modeling is ‘to learn from data’ about observable stochastic phenomena of interest using a statistical model Mθ(x). An important precondi- tion for learning in statistical inference is that the probabilistic assumptions of Mθ(x) representing the statistical premises, are valid for the particular data x0. 1.1 Statistical adequacy The generic form of a statistical model is: Mθ(x)={(x; θ) θ∈Θ} x∈R  for θ∈Θ⊂R   where (x; θ) x∈R  denotes the (joint) distribution of the sample X:=(1  ) The link between Mθ(x) and the phenomenon of interest comes in the form of viewing data x0:=(1  ) as a typ- ical realization of the process { ∈N}. The ‘typicality’ of x0 can — and should — be assessed using trenchant Mis- Specification (M-S) testing. ¥ Statistical adequacy. Testing the validity of the prob- abilistic assumptions of the statistical model Mθ(x) vis-a-vis data x0 is of paramount importance in practice because with- out it the error reliability of inference is at best dubious. Why? When any of the model assumptions are invalid, the nom- inal (assumed) error probabilities used to calibrate the 1
  • 2. ‘reliability’ of inductive inferences are likely to be very dif- ferent from the actual ones, rendering the inference results unreliable. I Rejecting a null hypothesis at a nominal =05 when the actual type I error probability is closer to 90, provides the surest way for an erroneous inference! H It is important to note that all statistical methods (frequentist, Bayesian, nonparametric) rely on an underlying statis- tical model M(z), and thus they are equally vulnerable to statistical misspecification. What goes wrong when Mθ(z) is statistically mis- specified? Since the likelihood function is defined via the distribution of the sample: (; z0) ∝ (x0; θ) θ∈Θ invalid (x; θ) → invalid (; z0)⇒ ⎧ ⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎩ Frequentist inference incorrect error probabilities incorrect fit/prediction measure Bayesisia inference erroneous posterior: (|z0)∝()(; z0) Error statistics proposes a methodology on how to spec- ify (Specification) and validate statistical models by prob- ing model assumptions (Mis-Specification (M-S) test- ing), isolate the sources of departures, and account for them in a respecified model (Respecification) with a view to se- cure statistical adequacy. Such a model is then used to probe the substantive hypotheses of interest. Model validation plays a pivotal role in providing an objective scrutiny of the reliability of inductive procedures; objectivity in scientific inference is inextricably bound up with the reliability of its methods. 2
  • 3. 1.2 Misspecification and the unreliability of inference Before we discuss M-S testing, it is important to see how par- ticular departures from the model assumptions can affect the reliability of inference by distorting the nominal error proba- bilities and rendering them non-ascertainable. Table 1 - The simple (one parameter) Normal model Statistical GM: = +  ∈N [1] Normal:  v N( ) [2] Constant mean: ()= for all ∈N [3] Constant variance:  ()=2 -known, for all ∈N [4] Independence: { ∈N} - independent process. To simplify the discussion that follows, let us focus on the simple Normal (one parameter) model (table 1). It was shown above that for testing the hypotheses: 0: =0 vs. 1:   0 (1) there is an -level UMP defined by: :={(X) 1()} : (X)= √ (−0)   1()={x: (x)  } (2) where =1  P =1   is the threshold rejection value. Given that: (i) (X)= √ (−0)  =0 v N(0 1) (3) one can evaluate the type I error probability (significance level)  using: P((X)  ; 0 true)= where  is the type I error. To evaluate the type II error probability and the power one needs to know the sampling distribution of (X) when 0 is false. However, since 0 is 3
  • 4. false refers to 1 :   0 this evaluation will involve all values of  greater than 0 (i.e. 10) : (1) =P((X) ≤ ; =1) (1) =1−(1)=P((X)  ; =1) ¾ ∀(10) The relevant sampling distribution takes the form: (ii) (X)= √ (−0)  =1 v N(1 1) 1= √ (1−0)   ∀10 (4) What is often insufficiently emphasized in statistics text- books is that the above nominal error probabilities, i.e. the significance , as well as the power of test  will be different from the actual error probabilities when any of the assumptions [1]-[4] are invalid for data x0 Indeed, such de- partures are likely to create significant discrepancies between the nominal and actual error probabilities that often render inferences based on (2) unreliable. To illustrate how the nominal and actual error probabilities can differ when any of the assumptions [1]-[4] are invalid, let us take the case where the independence assumption [4] is false for the underlying process { ∈N} and instead: ( )=, 0    1 for all 6=  =1  (5) How does such a misspecification affect the reliability of test ? The actual distribution of (X) under 0 and 1 are: (i)* (X)= √ (−0)  =0 v N (0 ())  (ii)* (X)= √ (−0)  =1 v N ³√ (1−0)   () ´ (6) ()=(1+(−1))  1 for 01 and 1 How does this change affect the relevant error probabilities? 4
  • 5. Example 1. Consider the case: =05 (=1645) =1 and =100 To find the actual type I error probability we need to evaluate the tail area of the distribution in (i)* beyond =1645: ∗ =P((X)  ; 0)=P(  1645√ () ; =0) where  v N(0 1) The results in table 2 for different values of  indicate that test  has now become ‘unreliable’ because ∗  . One will apply test  thinking that it will reject a true 0 only 5% of the time, when, in fact it is much higher. Table 2 - Type I error of  when ( )=  .0 .05 .1 .2 .3 .5 .75 .8 .9 ∗ .05 249 309 359 383 408 425 427 431 The actual power should now be evaluated using: ∗ (1) =P(  (1 p ()) h − √ (1−0)  i ; =1) giving rise to the results in table 3. Table 3 - Power ∗ (1) of  when ( )=  ∗ (01) ∗ (02) ∗ (05) ∗ (1) ∗ (2) ∗ (3) ∗ (4) 0 061 074 121 258 637 911 991 05 262 276 318 395 557 710 832 1 319 330 364 422 542 659 762 3 390 397 418 453 525 596 664 5 414 419 436 464 520 575 630 8 431 436 449 471 515 560 603 9 435 439 452 473 514 556 598 For small values of 1 (01 02 051) the power increases as  → 1, but for larger values of 1 (2 3 4) the power decreases, ruining the ‘probativeness’ of a test! It has become 5
  • 6. like a defective smoke alarm which has the tendency to go off when burning toast, but it will not be triggered by real smoke until the house is fully ablaze; Mayo (1996). The above example is only indicative of an actual situa- tion in practice where several of the model assumptions are often invalid, rendering the reliability of inference a lot more dire than this example might suggest; see Spanos and McGuirk (2001). 1.3 On the reluctance to validate statistical models The key reason why model validation is extremely important is that No trustworthy evidence for or against a sub- stantive claim (or theory) can be secured on the basis of a statistically misspecified model. In light of this, ‘why has model validation been neglected? There are several reasons, including the following. (a) Inadequate appreciation of the serious implications of statistical misspecification for the reliability of inference. (b) Inadequate understanding of how one can secure statis- tical adequacy using thorough M-S testing. (c) Inadequate understanding M-S testing and confusion with N-P testing render it vulnerable to charges like: (i) infi- nite regress and circularity, and (ii) illicit double-use of data. (d) There is an erroneous impression that statistical mis- specification is inevitable since modeling involves abstraction, simplification and approximation. Hence, the slogan "All models are wrong, but some are useful" is used as the excuse for neglecting model validation. This aphorism is especially pernicious because confuses two different aspects of empirical modeling: 6
  • 7. (i) the adequacy of the substantive (structural) model Mϕ(z) (substantive adequacy), vis-a-vis the phenomenon of interest, (ii) the validity of the (implicit) statistical model Mθ(z) (sta- tistical adequacy) vis- a-vis the data z0. It’s one thing to claim that the structural model Mϕ(z) is wrong in the sense that it’s false to claim it is an exact picture of reality in a substantive sense, and quite another to claim that the implicit statistical model Mθ(z) could not have generated data z0 because its probabilistic assumptions are invalid for z0. In cases where we may arrive at statistically adequate models, we can learn true things even with idealized and partial substantive models. When one imposes the substantive information (theory) on data z0 at the outset, by estimating Mϕ(z) the end result is often a statistically and substantively misspecified model, but one has no way to delineate the two sources of error: (a) the inductive premises are invalid, or (b) the substantive information is inadequate, and apportion blame with a view to address the unreliability of inference problem. The key to circumventing this Duhemian ambiguity is to find a way to disentangle the statistical Mθ(z) from the sub- stantive premises Mϕ(z). What is often insufficiently appre- ciated is the fact that behind every substantive model Mϕ(z) there is (often implicit) a statistical model Mθ(z) which pro- vides the inductive premises for the reliability of statistical inference based on data z0 The latter is just a set of proba- bilistic assumptions pertaing to the chance regularities in data z0 Statistical adequacy ensures error reliability in the sense that the actual error probabilities approximately closely the nominal ones. 7
  • 8. 2 M-S testing: a first encounter To get some idea of what M-S testing is all about, let us focus on a few simple tests to assess assumptions [1]-[4] of the simple Normal model (table 4). Table 4 - The simple Normal model Statistical GM: = +  ∈N [1] Normal:  v N( ) [2] Constant mean: ()= for all ∈N [3] Constant variance:  ()=2 , for all ∈N [4] Independence: { ∈N} - independent process. Mis-Specification (M-S) testing differs from Neyman-Pearson (N-P) testing in several respects, the most important of which is that the latter is testing within boundaries of the assumed statistical model Mθ(x), but the former is testing outside those boundaries. N-P testing partitions the assumed model using the parameters as an index. Conceptually, M-S testing partitions the set P(x) of all possible statistical models that could have given rise to data x0 into Mθ(x) and its compli- ment P(x) − Mθ(x). However, P(x) − Mθ(x) cannot be expressed in a parametric form and thus M-S testing is more open-ended than N-P testing. P ( )x H0 H1 Fig. 1: N-P testing within Mθ(x) P ( )x M ( )x Fig. 2: M-S testing outside Mθ(x) 8
  • 9. 2.1 Omnibus (nonparametric) M-S tests 2.1.1 The ‘Runs M-S test’ for the IID assumptions [2]-[4] The hypothesis of interest concerns the ordering of the sam- ple X:=(1 2  ) in the sense that the distribution of the sample remains the same under for any random reordering of X i.e. 0: (1 2  ; θ)=(1  2    ; θ) for any permutation (1 2  ) of the index (=1 2  ) Step 1: transform data x0:=(1 2  ) into a sequence of differences ( − −1), =2 3   Step 2: replace each (−−1)0 with ‘+’ and each (−−1)0 with ‘-’. A ‘run’ is a segment of the sequence consisting of ad- jacent identical elements which are followed and preceeded by a different symbol. The transformation takes the form: (1  ) → {(−−1) =2  } → (+ + − + · · · + − − +) (7) Step 3: count the number of runs. Example: ++|{z} 1 −|{z} 2 + + +| {z } 3 −−|{z} 4 ++|{z} 5 − − −| {z } 6 +|{z} 7 −|{z} 8 +|{z} 9 −−|{z} 10 + + + + +| {z } 11 −|{z} 12 · · · consists of 12 runs; the first is a run of 2 positive signs, the second a run of 1 negative sign, etc. Runs test. One of the simplest runs test is based on comparing the actual number of runs  with the number of expected runs assuming that the data represent a realization of an IID process { ∈N}. The test takes the form: (X)=[−()] √  ()  1()={x: |(x)|   2 } 9
  • 10. Using simple combinatorics with a sample size  one can derive: ()= ¡2−1 3 ¢   ()=16−29 90  and show that the distribution of (X) for  ≥ 40 is: (X)= [−()]  p  () IID ≈ N(0 1) Note that this test is insensitive to departures from Nor- mality because all distributional information has been lost in the transformation (7). Example- exam scores. Let us return to the exam score data, shown below in both the alphabetical and sitting arrangement. Case 1. The exam scores data arranged in alphabetical order (fig, 2) we observe the following runs: {1 1 4 1 1 3 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 2 3 1 1} +  {1 1 3 1 1 2 2 1 2 1 2 2 3 1 2 1 2 1 2 1 2 1 1 1 1} −  (8) Hence, the actual number of runs is 50 which is close to the number of runs expected under IID: (2(70)−1)3 ' 46. Ap- plying the above runs test yields: (x0)=50− ³ 2(70)−1 3 ´  q 16(70)−29 90 =1053[292] where the p-value is in square brackets. This indicates no departure from the IID ([2]-[4]) assumptions. Fig. 3: −alphabetical order Fig. 4: −sitting order 10
  • 11. Case 2. Consider the scores data ordered according to the sitting order in figure 4. This data exhibit cycles which yield the following runs up and down: {3 2 4 4 1 4 3 6 1 4} +  {2 2 2 4 3 3 7 4 6 1 3} −  (9) The difference between the patterns in (8) and (9) is in that there is more clustering and thus fewer runs in the latter case. The actual number of runs is 21; less than half of what were expected under IID. (x0)=21− ³ 2(70)−1 3 ´  q 16(70)−29 90 = −7276[0000] which clearly indicates strong departures from the IID ([2]-[4]) assumptions. 2.1.2 Kolmogorov’s M-S test for Normality ([1]) The Kolmogorov M-S test for assessing the validity of a dis- tributional assumption under two key conditions: (i) the data x0:=(1 2  ) can be viewed as a realiza- tion of a random (IID) sample X:=( 1 2  ), and (ii) the random variables 1 2   are continuous (not discrete). The test relies on the empirical cumulative distribution function (ecdf): b()= [no of (12) that do not exceed ]   ∀∈R Under (i)-(ii), the ecdf is a strongly consistent estimator of the cumulative distribution function (cdf): ()=( ≤ ) ∀∈R The generic hypothesis being tested takes the form: 0: ∗ ()=0() ∈R (10) where ∗ () denotes the true cdf, and 0() the cdf assumed by the statistical model Mθ(x) 11
  • 12. Kolmogorov (1933) proposed the distance function: ∆(X)= sup∈R | b() − 0()| and proved that under (i)-(ii): lim →∞ P( √ ∆(X) ≤ )=() for   0 uniformly in  (11) where () denotes the cdf of the Kolmogorov distribution: ()=1−2 P∞ =1(−1)+1 −222 ' 1 − 2 exp(−22 ) Since () is known (approximated), one can define a M-S test based on the test statistic (X)= √ ∆(X) giving rise to the p-value: P((X)  (x0); 0)=(x0) Example. Applying the Kolmogorov test to the scores data in fig. 3 yielded: P((X)  039; 0)=15 which does not indicate any serious departures from the Nor- mality assumption. The graph below provides a pictorial de- piction of what this test is measuring in terms of the discrep- ancies from the line to the observed points. 12011010090807060504030 99.9 99 95 90 80 70 60 50 40 30 20 10 5 1 0.1 x Percent M ean 71.69 S tD ev 13.61 N 70 KS 0.039 P -Valu e >0.150 Probability Plot of x Norm al fig. 5: P-P Normality plot Note that this particular test might be too sensitive to out- liers because it picks up only the biggest distance! 12
  • 13. 2.1.3 The role for omnibus M-S tests The key advantage of the above omnibus tests is that they probe more broadly around the Mθ(x) than directional (para- metric) M-S tests at the expense of lower power. However, tests with low power are useful in M-S testing because when they detect a departure, they provide better evidence for its presence than a test with very high power! A key weakness of the above omnibus tests is that when the null hypothesis is rejected, the test does not provide any information as to the direction of departure. Such information is needed for the next stage of modeling, that of respecifying the original model Mθ(x) with a view to account for the sys- tematic information not accounted for by Mθ(x). 2.2 Directional (parametric) M-S tests 2.2.1 A parametric M-S test for independence ([4]) A general approach to deriving M-S tests is to return to the original probabilistic assumptions of the process { ∈N} underlying data x0:=(1 2  ) and replace one or more assumptions with more general ones and derive relevant dis- tance functions using the two statistical Generating Mecha- nisms (GMs). In the case of the simple Normal model, the process { ∈N} is assumed to be NIID. Let us relax the IID assumptions to Markov dependence and stationarity, which gives rise to the AutoRegressive (AR(1)), model based on (|−1; θ), whose statistical GM is:  = 0 + 1−1 +  vN(0 2 0) ∈N (12) where 0=(1−1)∈R 1=(1) (0) ∈(−1 1) 2 0=(0)(1−2 1)∈R+; 13
  • 14. =() (0)= () (1)=( −1) =1    Fig. 6: M-S testing by encompassing The AR(1) parametrically nests (includes as a special case) the simple Normal model because when 1=0 : 0= (1−1)|1=0 = 2 0= (0)(1−2 1) ¯ ¯ 1=0 =(0) the AR(1) reduces to the simple Normal:  = 0 + 1−1 +  1=0 → = +  ∈N This suggests that a way to assess assumption [4] (table 4) is to test the hypotheses: 0: 1=0 vs. 1: 1 6= 0 (13) in the context of the AR(1) model. This will give rise to a t-type test :={(X) 1()}: (X)= (b1−0) √  (b1) 0 ≈ (−2) 1()={x: |(x)|  } where b1= P =1(−)(−1−) P =1(−1−)2   (b1)= 2 P =1(−1−)2  2 = 1 −2 P =1(−b0−b1−1)2  b0=(1−b1) Example. For the data in figure 4, (12) yields: =39593 (7790) + 0441 (0106) −1 + b 2 =2 2 =14342 =69 14
  • 15. The M-S t-test for (13) yields: (x0)= ¡441 106 ¢ =4160 (x0)=0000 indicating a clear departure from assumption [4]. It is straightforward to extend the above test to Markov() by estimating the auxiliary regression: =0 + P =1 − +  ∈N (14) and testing the coefficient restrictions: 0: 1=2= · · · ==0 for   ( − 1) (15) This gives rise to an F-type test, analogous to Ljung and Box (1978) test, with one big difference: the estimated coefficients in (14) can also be assessed individually using t-tests in order to avoid the large  problem raised above. For the case  = 2 the auxiliary regression is: =0 + 1−1 + 2−2 +  ∈N (16) and the F-test for the joint significance of 1 and 2 will take the form: (x)=−  ¡−3 2 ¢ 0 v (2  − 3) where = P =1( − )2  denote the Restricted [re- strictions 1=2=0 imposed]Residuals Sum of Squares, and = P =1 b2   b=−b0 − b1−1 − b2−2 the Unre- stricted Residuals Sum of Squares, (2  − 3) denotes the F distribution with 2 and  − 3 degrees of freedom. One of the key advantages of this approach is that it can easily be extended to derive joint M-S tests that assess more than one assumption. 15
  • 16. 2.2.2 A parametric M-S test for IID ([2]-[3]) The above t-type parametric test based on the auxiliary Au- toregression (12) can be extended to provide a joint test for assumptions [2] and [4], by replacing the stationarity assump- tion of { ∈N} with mean non-stationarity, gives rise to a heterogeneous AR(1) model with a statistical GM: =0 + [2] z}|{ 1 + [4] z }| { 1−1 +  ∈N (17) 0=+1(1-) 1=(1-1)1 1=((1)(0)) 2 0=(0)(1-2 1) The AR(1) with a trend nests the simple Normal model: =0 + 1 + 1−1 +  1=0 1=0 → = +  ∈N This suggests that a way to assess assumptions [2]&[4] (table 4) jointly is to test the hypotheses: 0: 1=0 and 1=0 vs. 1: 16=0 or 16=0 This will give rise to a F-type test :={(X) 1()}: (X)=RRSS-URSS URSS ¡−3 2 ¢ 0 ≈ F(2 −3) 1()={x: (x)} URSS= P =1(−b0−b1−b1−1)2  RRSS= P =1(−)2  where URSS and RRSS denote the Unrestricted and Restricted Residual Sum of Squares, respectively, and F(2,n-3) denotes the F distribution with 2 and −3 degrees of freedom. Example. For the data in figure 4, the restricted and unrestricted models yielded, respectively: =7169 (1631) + b 2 =18523 =69 =38156 (8034) + 055 (073)  + 0434 (0107) −1 + b 2 =14434 =69 (18) 16
  • 17. where RRSS=26845 URSS=21543 yielding: (x0)= ¡26845−21543 21543 ¢ ¡67 2 ¢ =8245 (x0)=0006 indicating a clear departure from the null ([2]&[4]). What is particularly notable about the auxiliary autore- gression (18) is that a closer look at the t-ratios indicates that the source of the problem is dependence and not t- heterogeneity. The t-ratio of the coefficient of  is statistically insignificant: (x0)= ¡055 073 ¢ =753 (x0)=226 but the coefficient of −1 is statistically significant: (x0)= ¡434 107 ¢ =4056 (x0)=0000 indicating a clear departure from assumption [4], but not from [2]. This information that enables one to apportion blame cannot be gleaned from the runs test. An alternative, and more preferable, way to specify the above auxiliary regressions is in terms of the residuals: b= ( − ) =( − 7169) =1 2   in the sense that the auxiliary regression: b= −33534 (8034) +055 (073)  + 0434 (0107) −1 + b 2 =14434 =69 (19) is a mirror image of (18): =38156 (8034) + 055 (073)  + 0434 (0107) −1 + b 2 =14434 =69 (20) with identical parameter estimates, apart from the constant (−33534=38156−716), which is irrelevant for M-S testing purposes. 17
  • 18. 2.2.3 A parametric M-S test for assumptions [3]-[4] In light of the fact that 2 = ()=(2  ) one can test the variance constancy [3] and independence [4] assumptions using the residuals squared in the context of the auxiliary regression: b2  =0 + [3] z}|{ 1 + [4] z }| { 22 −1 +  =1 2   Using the above data, this gives rise to: b2  =29526 (8943) − 1035 (1353)  − 016 (014) 2 −1 + b (21) The non-significance of the coefficients 1 and 2 indicate no departures from assumptions [3] and [4]. Note that one could test assumption [3] individually using the auxiliary regression: b2  =24384 (5530) − 1728 (1354)  + b1 (22) where the t-test for the coefficient of  yields: 1728 1354 =1276[206]; the p-value is given in square brackets. 2.2.4 Extending the above auxiliary regression The auxiliary regression (17), providing the basis of the joint test for assumptions [2]-[4] can be easily extended to include higher order trends (up to order  ≥ 1) and additional lags ( ≥ 1):  = 0 + P =1  + P =1 − +  ∈N (23) 2.2.5 A parametric M-S test for Normality ([1]) An alternative way to test Normality is to use parametric tests relying on key features of the distribution. An example of this type of test is the Skewness-Kurtosis test. 18
  • 19. A key feature of the Pearson family is that it is specified using the first four moments. Within this family we can char- acterize several distributions using the skewness and kurtosis coefficients: 3=(−())3 ³√  () ´3  4=(−())4 ³√  () ´4  The skewness is the standardized third central moment and provides a measure of asymmetry of () and the kurtosis is the standardized fourth central moment and is a measure of the peakness in relation to the tails of () The Normal distribution is characterized within the Pearson family via the restrictions: (3=0 4=3) ⇒ ∗ ()=() for all ∈R where ∗ () and () denote the true density and the Normal density, respectively. These moments can be used to derive a M-S test for the Normality assumption [1] (table 4), using the hypotheses: 0: 3=0 and 4=3 vs. 1: 36=0 or 46=3 The Skewness-Kurtosis test is given by: (X)= 6 b2 3 +  24 (b4−3) 2 0 v  2 (2) P((X)  (x0); 0)=(x0) (24) where 2 (2) denotes the chi-square distribution with 2 degrees of freedom, and: b3= 1  P =1(−)3 ³√1  P =1(−)2 ´3  b4= 1  P =1(−)4 ³√1  P =1(−)2 ´4  Example. For the scores data in fig. 3: b3= −03, b4=262: (x0)=70 6 (−03)2 + 70 24 (−38) 2 =432 (x0)=806 19
  • 20. indicating no departure from the Normality assumption [1]. How is this test different from Kolmogorov’s nonparamet- ric test? Depending on whether b3 6= 0 or b4 6= 3 one can conclude whether the underlying distribution () is non- symmetric or leptokurtic and that information can be useful at the respecfication stage. 2.3 Simple Normal model: a summary of M-S testing The first auxiliary regression specifies how departures from different assumptions might affect the mean: (i) b=10 + [2] z }| { 11 + 122 + [4] z }| { 13−1 + 14−2 + 1 0 : 11=12=13=14=0 The second auxiliary regression specifies how departures from different assumptions might affect the variance: (ii) b2  =20 + [3] z }| { 21 + 222 + [4] z }| { 232 −1 + 242 −2 + 2 0 : 21=22=23=24=0 When NO departures from assumptions [2]-[4] are detected one can proceed to test the Normality assumption using tests like the skewnesss-kurtosis, the Kolmogorov or the Anderson- Darling. Otherwise, one uses the residuals from auxiliary re- gression (i) as a basis for a Normality test. Example. Consider the casting of two dice data in table 5. Evaluating the sample mean, variance, skewness and kurtosis for the dice data yields: =1  P =1 =7080 2 = 1 −1 P =1( − )2 =5993 b3= − 035 b4=2362 20
  • 21. Table 5 - Observed data on dice casting 3 10 11 5 6 7 10 8 5 11 2 9 9 6 8 4 7 6 5 12 7 8 5 4 6 11 7 10 5 8 7 5 9 8 10 2 7 3 8 10 11 8 9 5 7 3 4 9 10 4 7 4 6 9 7 6 12 8 11 9 10 3 6 9 7 5 8 6 2 9 6 4 7 8 10 5 8 7 9 6 5 7 7 6 12 9 10 4 8 6 5 4 7 8 6 7 11 7 8 3 1009080706050403020101 12 10 8 6 4 2 Inde x x Time S eries P lot of x Fig. 7: t-plot of the dice data (a) Testing assumptions [2]-[4] using the runs test, requires the counting of runs: + + - + + + - - + - + + - + - + - - + - + - - + + - + - + - - + - + - + - + + + - + - + - + + + - + - + + - - + - + - + - + + - - + - - + - - + + + - + - + - - + + - + - + - + - - - + + - + + - + - For  ≥ 40, the type I error probability evaluation is based on: (X)= −([2−1]3) √ [16−29]90 [1]-[4] v N(0 1) For the above data: =100 =50: ()=(200−1)3=66333  ()=(16(100)−29)90=17456 (X)=72−66333√ 17456 =1356 P(|(X)|  1356; )=175 This does not indicate any departure from the IID assump- tions. (b) Test the independence assumption [4] using the auxil- iary regression: =0 + 1−1 +  =1 2   21
  • 22. =7856 (759) − 103 (101) −1 + b (2425)  and the t-test for the significance of 1 yields: (x)=103 101 =1021[310] where the p-value in square brackets indicates no clear depar- ture from the Independence assumption; see Spanos (1999), p. 774. (c) Test the identically distributed assumptions [2]-[3] using the auxiliary regression: =0 + 1 +  =1 2   =7193 (496) − 002 (008)  + b (2460)  and the t-test for the significance of 1 yields: (x)=0022 0085 =259[793] where the p-value indicates no departure from the ID assump- tion; see Spanos (1999), p. 774. (d) One can test the IID assumptions [2]-[4] jointly using the auxiliary regression: =0 + 1 + 2−1 +  =1 2   =8100 (877) − 0048 (0086)  − 103 (101) −1 + b (2434)  where the F-test for the joint significance of 1 and 2 i.e. 0: 1=2=0 vs. 1: 16=0 or 26=0 (x)=−  ¡−3  ¢ =5766−568540 568540 ¡96 2 ¢ =680[511] where = P =1( −)2  denote the Restricted Resid- uals Sum of Squares [the sum of squares of the residuals with the restrictions imposed], and = P =1 b2   the Unre- stricted Residuals Sum of Squares [the sum of squares of the residuals without the restrictions], respectively; note that (−) is often called the Explained Sum of Squares (ESS). The p-value in square brackets indicates no departure from the IID assumptions, confirming the previous M-S test- ing results. 22
  • 23. 1 21 08642 1 8 1 6 1 4 1 2 1 0 8 6 4 2 0 x Frequency H is to g r a m o f x Fig. 8: Histogram of the dice data (e) Testing the Normality assumption [1] using the SK test yields: (x0)=100 6 (−0035)2 + 100 24 (2362 − 3)2 =1716[424] The p-value indicates no departure from the Normality as- sumption, but as shown in Spanos (1999), p. 775, this does not mean that the assumption is valid; the test has very low power. This is to be expected because the data come from a discrete triangular distribution with values from 2 to 12, as shown by the histogram (fig. 8). Using the more powerful Anderson and Darling (1952) test, which for the ordered X sample simplifies to: -(X)= −−1  P =1 © (2−1) £ ln []− ln(1− ln [+1−] ¤ª  however, provides evidence against Normality: -(x0)=772[041] In light of the M-S results in (a)-(e) one needs to replace the Normality assumption with a triangular discrete distribution in order to get a more adequate statistical model. 23
  • 24. 3 Mis-Specification (M-S) testing: a formalization 3.1 The nature of M-S testing The basic question posed by M-S testing is whether or not the particular data x0:=(1 2  ) constitute a ‘truly typical realization’ of the stochastic process { ∈N} underlying the (predesignated) statistical model: Mθ(x)={(x; θ) θ∈Θ} x∈R  Remember also that the primary aim of the frequentist ap- proach is to learn from data x0 about the true statistical Data- Generating Mechanism (DGM) M∗ (x)={(x; θ∗ )} x∈R . P ( )x H0 H1 Fig. 9: N-P testing within Mθ(x) P ( )x M ( )x Fig. 10: M-S testing outside Mθ(x) Hence, the primary role of M-S testing is to probe, vis- a-vis data x0 for possible departures from M(x) beyond its boundaries, but within P(x) the set of all possible statistical models that could have given rise to x0 In this sense, the generic form of M-S testing is probing outside Mθ(x): 0: ∗ (x)∈Mθ(x) vs. 0: ∗ (x)∈ [P(x)−M(x)]  where ∗ (x)=(x; θ∗ ) denotes the ‘true’ distribution of the sample. 24
  • 25. In contrast, N-P testing is always within the boundaries of Mθ(x). It presupposes that Mθ(z) is statistically ade- quate and the hypotheses are ultimately concerned with learn- ing from data about the ‘true’  say ∗ that could have given rise to data x0 In general, the expression ‘θ∗ denotes the true value of θ’ is a shorthand for saying that ‘data x0 constitute a realization of the sample X with distribution (x; θ∗ )’ By defining the partition of Θ=(−∞ ∞) in terms of Θ0=(−∞ 0] and Θ1=(0 ∞) and the associated partition of Mθ(x), M0(x)={(x; ) ∈Θ0} and M1(x)={(x; ) ∈Θ1} the hypotheses in (1) can be framed equivalently, but more perceptively, as: 0: (x; ∗ )∈M0(x) vs. 1: (x; ∗ )∈M1(x), x∈R  Indeed, the test statistic (X)= √ (−0)   for the optimal (UMP) N-P test, is, in essence, the standardized difference between ∗ and 0 with ∗ replaced by its best estimator  The fact that M-S testing is probing [P(x)−M(x)] raises certain technical and conceptual problems pertaining to how one can operationalize such investigating. In practice, one needs to replace the broad 0 with a more specific opera- tional 1. This operationalization has a very wide scope, ex- tending from vague omnibus (local), to specific directional (broader) alternatives, like the tests based on the auxiliary autoregressions and the Skewness-Kurtosis test. In all cases, however, 1 does not span 0 and that raises additional issues, including: (a) The higher vulnerability of M-S testing to the fallacy of rejection: (mis)interpreting reject 0 [evidence against 0] as evidence for the specific 1. Rejecting the null in a M-S test provides evidence against the original model Mθ(x) but that does not imply good evidence for the particular alternative 25
  • 26. 1. Hence in practice one should never accept 1 without further probing because that will be a classic example of the fallacy of rejection. (b) In M-S testing the type II error [accepting the null when false] is often the more serious of the two errors. This is be- cause for the type I error [rejecting the null when true] one will have another chance to correct the error at the respecification stage. When one, after a battery of M-S tests, erroneously concludes that M(x) is statistically adequate, one will pro- ceed to draw inferences oblivious to the fact that the actual error probabilities might be very different from the nominal (assumed) ones. (c) In M-S testing the objective is to probe [P(x)−M(x)] as exhaustively as possible, using a combination of om- nibus M-S tests whose probing is more broad but have low power and directional M-S tests whose probing in narrower but goes much further and have higher power. (d) Applying several M-S tests in probing the validity of one or a combination of assumptions does not necessarily increase the relevant type I error probability because the framing of the hypotheses of interest renders them different from the multiple hypothesis testing problem as construed in the N-P framework. 3.2 Respecification After a reliable diagnosis of the sources of misspecification, stemming from a reasoned scrutiny of the M-S testing re- sults as a whole, one needs to respecify the original statisti- cal model. Tracing the symptoms back to the source, enables one to return to the three-way partitioning, based on the three types of probabilistic assumptions and re-partition using more appropriate reduction assumptions. 26
  • 27. 4 M-S testing: revisiting methodological issues In this section we discuss some of the key criticisms of M- S testing in order to bring out some of the confusions they conceal. 4.1 Securing the effectiveness/reliability of M-S testing There are a number of strategies designed to enhance the ef- fectiveness/reliability of M-S probing thus render the diagnosis more reliable. ¥ A most efficient way to probe [P(x)−M(x)] is to con- struct M-S tests by modifying the original tripartite parti- tioning that gave rise to Mθ(x) in directions of educated departures gleaned from Exploratory Data Analysis. This gives rise to encompassing models or directions of departure, which enable one to eliminate an infinite number of alternative mod- els at a time; Spanos (1999). This should be contrasted with a most inefficient way to do this, that involves probing [P(x)−M(x)] one model at a time Mϕ (x) =1 2  This is a hopeless task because there is an infinite number of such alternative models to probe for and eliminate. ¥ Judicious combinations of omnibus (non-parametric), directional (parametric) and simulation-based tests, probing as broadly as possible and upholding dissimilar assumptions. The interdependence of the model assumptions, stemming fromM(x) being a parametrization of the process { ∈N} plays a crucial role in the self-correction of M-S testing results. ¥ Astute ordering of M-S tests so as to exploit the in- terrelationship among the model assumptions with a view to ‘correct’ each other’s diagnosis. For instance, the probabilistic assumptions [1]-[3] of the Normal, Linear Regression model (table 8) are interrelated because all three stem from the 27
  • 28. assumption of Normality for the vector process {Z ∈N} where Z:=( ) assumed to be NIID. This information is also useful in narrowing down the possible alternatives. It is important to note that the Normality assumption [1] should be tested last because most of the M-S tests for it assume that the other assumptions are valid, rendering the results questionable when that clause is invalid. ¥ Joint M-S tests (testing several assumptions simul- taneously) designed to avoid ‘erroneous’ diagnoses as well as minimize the maintained assumptions. The above strategies enable one to argue with severity that when no departures from the model assumptions are de- tected, the model provides a reliable basis for inference, in- cluding appraising substantive claims (Mayo & Spanos, 2004). 4.2 The infinite regress and circularity charges The infinite regress charge is often articulated by claiming that each M-S test relies on a set of assumptions, and thus it assesses the assumptions of the model Mθ(x) by invoking the validity of its own assumptions, trading one set of assumptions with another ad infinitum. Indeed, some go as far as to claim that this reasoning is often circular because some M-S tests inadvertently assume the validity of the very assumption they aim to test! A closer look at the reasoning underlying M-S testing re- veals that both charges are misplaced. ¥ First, the scenario used in evaluating the type I error invokes no assumptions beyond those of Mθ(x), since every M-S test is evaluated under: : all the probabilistic assumptions of Mθ(x) are valid. Moreover, when any one (or more) of the model assumptions 28
  • 29. is rejected, the model Mθ(x) as a whole, is considered mis- specified. Example. In the context of the simple Normal model (table 6), the runs test is an example of an omnibus M-S test for assumptions [2]-[4]. The original data, or the residuals, are replaced with a + when the next data point is an up and with a − when it’s a down. A run is a sub-sequence of one type (+ or −) immediately preceded and succeeded by an element of the other type. For  ≥ 40, the type I error probability evaluation is based on: (X)= −([2−1]3) √ [16−29]90 [1]-[4] v N(0 1) It is important to emphasize that the runs test is insensitive to departures from Normality, and thus the effective scenario for deriving the type I error is under assumptions [2]-[4]. ¥ Second, the power for any M-S test, is determined by evaluating the test statistic under certain forms of departures from the assumptions being appraised [no circularity], but re- taining the rest of the model assumptions. For the runs test, the evaluation of power is based on: (X) [1]&[2]-[4] v N( 2 ) 6=0 2  0 where [2]-[4] denote specific departures from these assump- tions considered by the test in question. However, since the test is insensitive to departures from [1], the effective scenario does not have any retained assumptions. One of the advan- tages of nonparametric tests is that they are insensitive to departures from certain retained assumptions. Bottom line: in M-S testing the evaluations under the null and alternative hypotheses invoke only the model assump- tions; no additional assumptions are involved. Moreover, the 29
  • 30. use of joint M-S tests aims to minimize the number of model assumptions retained when evaluating under the alternative. 4.3 Illegitimate double-use of data charge In the context of the error statistical approach it is certainly true that the same data x0 are being used for two different purposes: I (a) to test primary hypotheses in terms of the unknown parameter(s) θ, and I (b) to assess the validity of the prespecified model Mθ(x) but ‘does that constitute an illegitimate double-use of data?’ Mayo (1981) answered that question in the negative, ar- guing that the original data x0 are commonly remodeled to r0=G(x0) r0∈R   ≤  and thus rendered distinct from x0 when testing M(x)’s assumptions: “What is relevant for our purposes is that the data used to test the probability of heads [primary hypothesis] is distinct from the data used in the subsequent test of independence [model as- sumption]. Hence, no illegitimate double use of data is required.” (Mayo, 1981, p. 195). Hendry (1995), p. 545, interpreted this statement to mean: “... following Mayo (1981), diagnostic test information is ef- fectively independent of the sufficient statistics, so ‘discounting’ for such tests is not necessary.” Combining these two views offers a more formal answer. First, (a) and (b) pose very different questions to data x0 and second, the probing takes place within vs. outside Mθ(x) respectively. Indeed, one can go further to argue that the answers to the questions posed in (a) and (b) rely on distinct informa- tion in data x0. 30
  • 31. Under certain conditions, the sample can be split into two components: X → (S(X) R(X)) that induce the following reduction in (x; θ): (x; θ)=·(s; θ) · (r) ∀ (s r) ∈R  ×R−   where = || is the Jacobian of: X → (S(X) R(X)) S(X):=(1  ) is a complete sufficient statistic, R(X):=(1  −) a maximal ancillary statistic, and S(Z) and R(Z) are independent. What does this reduction mean? (z; θ) =  · inference z }| { (s; θ) · model validation z}|{ (r) (25) I [a] all primary inferences are based exclusively on (s; θ), and I [b] (r) can be used to validate Mθ(z) using error prob- abilities that are free of θ. Example. For the simple Normal model (table 4) holds for S:=( 2 ) :  = 1  P =1  2 = 1 −1 P =1( − )2  as the minimal sufficient statistic, and R(X)=(b3  b) where: b= ( √ (−)) v St(−1) =1 2   known as the studentized residuals, the maximal ancillary statistic. I This explains why M-S testing is often based on the residuals, and confirms Mayo’s (1981) claimthat R(X)=(b3  b) provides information distinct from R(X) upon which the pri- mary inferences are based. 31
  • 32. The crucial argument for relying on (r) for model valida- tion purposes is that the probing for departures from Mθ(x) is based on error probabilities that do not depend on θ. Generality of result in (25). This result holds for al- most all statistical models routinely used in statistical infer- ence, including the simple Normal, the simple Bernoulli, the Linear Regression and related models and all statistical mod- els based on the (natural) Exponential family of distributions, such as the Normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, Poisson, Wishart, geometric, Laplace, Levy, log-Normal, Pareto, Weibull, binomial (with fixed num- ber of trials), multinomial (with fixed number of trials), and negative binomial (with fixed number of failures) and many others. Finally, the above result in (25) holds ‘approximately’ in all cases of statistical models whose inference relies on as- ymptotic Normality. 5 Summary and Conclusions Approximations, limited data and uncertainty lead to the use of statistical models in learning from data about phenomena of interest. All statistical methods (frequentist, Bayesian, non- parametric) rely on a prespecified statistical model Mθ(x) as the primary basis of inference. The sound application and the objectivity of their methods turns on the validity of these assumed statistical models for the particular data. Fundamental aim: How to specify and validate statisti- cal models. Unfortunately, model validation has been a neglected aspect of empirical modeling. At best, one often finds more 32
  • 33. of a grab-bag of techniques than a systematic account. Error statistics attempts to remedy that by proposing a coherent account of statistical model specification and valida- tion that puts the entire process on a sounder philosophical footing (Spanos, 1986, 1999, 2000, 2010; Mayo and Spanos, 2004). Crucial strengths of frequentist error statistical methods in this context: I There is a clear goal to achieve: the statistical model is sufficiently adequate so that the actual error probabilities approximate well the nominal ones. I It supplies a trenchant battery of Mis-Specification (M-S) tests for model-validation (non-parametric and parametric) with a view to minimize both types of errors and generate a reliable diagnosis thru self-correction. I In offers a seemless transition from model validation to subsequent use in the sense that the same error statistical reasoning is used. The focus is on the question: What is the nature and warrant for frequentist error statistical model specification and validation? Failing to grasp the correct rationale of M-S testing has led many to think that merely finding a statistical model that ‘fits’ the data well in some sense is tantamount to showing it is statistically adequate. It is not! Minimal Principle of Evidence: if the procedure had no capacity to uncover departures from a hypothesis , then not finding any is poor evidence for  Failing to satisfy so minimal a principle leads to models which, while acceptable according to its own self-scrutiny, are in fact inadequate and they give rise to untrustworthy evi- dence. 33