An Introduction to Mis-Speciﬁcation (M-S) Testing

PHIL 6334 - Probability/Statistics Lecture Notes 6:
An Introduction to Mis-Speciﬁcation (M-S) Testing
Aris Spanos [Spring 2014]
1 Introduction
The primary objective of empirical modeling is ‘to learn
from data’ about observable stochastic phenomena of interest
using a statistical model Mθ(x). An important precondi-
tion for learning in statistical inference is that the probabilistic
assumptions of Mθ(x) representing the statistical premises,
are valid for the particular data x0.
1.1 Statistical adequacy
The generic form of a statistical model is:
Mθ(x)={(x; θ) θ∈Θ} x∈R
 for θ∈Θ⊂R
 
where (x; θ) x∈R
 denotes the (joint) distribution of the
sample X:=(1  )
The link between Mθ(x) and the phenomenon of interest
comes in the form of viewing data x0:=(1  ) as a typ-
ical realization of the process { ∈N}. The ‘typicality’
of x0 can — and should — be assessed using trenchant Mis-
Speciﬁcation (M-S) testing.
¥ Statistical adequacy. Testing the validity of the prob-
abilistic assumptions of the statistical model Mθ(x) vis-a-vis
data x0 is of paramount importance in practice because with-
out it the error reliability of inference is at best dubious. Why?
When any of the model assumptions are invalid, the nom-
inal (assumed) error probabilities used to calibrate the
1

‘reliability’ of inductive inferences are likely to be very dif-
ferent from the actual ones, rendering the inference results
unreliable.
I Rejecting a null hypothesis at a nominal =05 when
the actual type I error probability is closer to 90, provides
the surest way for an erroneous inference!
H It is important to note that all statistical methods (frequentist,
Bayesian, nonparametric) rely on an underlying statis-
tical model M(z), and thus they are equally vulnerable to
statistical misspecification.
What goes wrong when Mθ(z) is statistically mis-
specified? Since the likelihood function is defined via the
distribution of the sample:
(; z0) ∝ (x0; θ) θ∈Θ
invalid (x; θ) → invalid (; z0)⇒
⎧
⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
Frequentist inference
incorrect error probabilities
incorrect fit/prediction measure
Bayesisia inference
erroneous posterior:
(|z0)∝()(; z0)
Error statistics proposes a methodology on how to spec-
ify (Specification) and validate statistical models by prob-
ing model assumptions (Mis-Specification (M-S) test-
ing), isolate the sources of departures, and account for them
in a respecified model (Respecification) with a view to se-
cure statistical adequacy. Such a model is then used to probe
the substantive hypotheses of interest.
Model validation plays a pivotal role in providing an
objective scrutiny of the reliability of inductive procedures;
objectivity in scientific inference is inextricably bound up
with the reliability of its methods.
2

1.2 Misspecification and the unreliability of inference
Before we discuss M-S testing, it is important to see how par-
ticular departures from the model assumptions can affect the
reliability of inference by distorting the nominal error proba-
bilities and rendering them non-ascertainable.
Table 1 - The simple (one parameter) Normal model
Statistical GM: = +  ∈N
[1] Normal:  v N( )
[2] Constant mean: ()= for all ∈N
[3] Constant variance:  ()=2
-known, for all ∈N
[4] Independence: { ∈N} - independent process.
To simplify the discussion that follows, let us focus on the
simple Normal (one parameter) model (table 1).
It was shown above that for testing the hypotheses:
0: =0 vs. 1:   0 (1)
there is an -level UMP defined by: :={(X) 1()} :
(X)=
√
(−0)

 1()={x: (x)  } (2)
where =1

P
=1   is the threshold rejection value. Given
that:
(i) (X)=
√
(−0)

=0
v N(0 1) (3)
one can evaluate the type I error probability (significance level)
 using:
P((X)  ; 0 true)=
where  is the type I error. To evaluate the type II error
probability and the power one needs to know the sampling
distribution of (X) when 0 is false. However, since 0 is
3

false refers to 1 :   0 this evaluation will involve all
values of  greater than 0 (i.e. 10) :
(1) =P((X) ≤ ; =1)
(1) =1−(1)=P((X)  ; =1)
¾
∀(10)
The relevant sampling distribution takes the form:
(ii) (X)=
√
(−0)

=1
v N(1 1) 1=
√
(1−0)

 ∀10
(4)
What is often insufficiently emphasized in statistics text-
books is that the above nominal error probabilities, i.e.
the significance , as well as the power of test  will be
different from the actual error probabilities when any of
the assumptions [1]-[4] are invalid for data x0 Indeed, such de-
partures are likely to create significant discrepancies between
the nominal and actual error probabilities that often render
inferences based on (2) unreliable.
To illustrate how the nominal and actual error probabilities
can differ when any of the assumptions [1]-[4] are invalid, let
us take the case where the independence assumption [4] is false
for the underlying process { ∈N} and instead:
( )=, 0    1 for all 6=  =1  (5)
How does such a misspecification affect the reliability of test ?
The actual distribution of (X) under 0 and 1 are:
(i)* (X)=
√
(−0)

=0
v N (0 ()) 
(ii)* (X)=
√
(−0)

=1
v N
³√
(1−0)

 ()
´ (6)
()=(1+(−1))  1 for 01 and 1
How does this change affect the relevant error probabilities?
4

Example 1. Consider the case: =05 (=1645) =1
and =100 To ﬁnd the actual type I error probability
we need to evaluate the tail area of the distribution in (i)*
beyond =1645:
∗
=P((X)  ; 0)=P(  1645√
()
; =0)
where  v N(0 1) The results in table 2 for diﬀerent values
of  indicate that test  has now become ‘unreliable’ because
∗
 . One will apply test  thinking that it will reject a
true 0 only 5% of the time, when, in fact it is much higher.
Table 2 - Type I error of  when ( )=
 .0 .05 .1 .2 .3 .5 .75 .8 .9
∗
.05 249 309 359 383 408 425 427 431
The actual power should now be evaluated using:
∗
(1) =P(  (1
p
())
h
−
√
(1−0)

i
; =1)
giving rise to the results in table 3.
Table 3 - Power ∗
(1) of  when ( )=
 ∗
(01) ∗
(02) ∗
(05) ∗
(1) ∗
(2) ∗
(3) ∗
(4)
0 061 074 121 258 637 911 991
05 262 276 318 395 557 710 832
1 319 330 364 422 542 659 762
3 390 397 418 453 525 596 664
5 414 419 436 464 520 575 630
8 431 436 449 471 515 560 603
9 435 439 452 473 514 556 598
For small values of 1 (01 02 051) the power increases
as  → 1, but for larger values of 1 (2 3 4) the power
decreases, ruining the ‘probativeness’ of a test! It has become
5

like a defective smoke alarm which has the tendency to go off
when burning toast, but it will not be triggered by real smoke
until the house is fully ablaze; Mayo (1996).
The above example is only indicative of an actual situa-
tion in practice where several of the model assumptions are
often invalid, rendering the reliability of inference a lot
more dire than this example might suggest; see Spanos and
McGuirk (2001).
1.3 On the reluctance to validate statistical models
The key reason why model validation is extremely important
is that No trustworthy evidence for or against a sub-
stantive claim (or theory) can be secured on the basis of a
statistically misspecified model.
In light of this, ‘why has model validation been neglected?
There are several reasons, including the following.
(a) Inadequate appreciation of the serious implications of
statistical misspecification for the reliability of inference.
(b) Inadequate understanding of how one can secure statis-
tical adequacy using thorough M-S testing.
(c) Inadequate understanding M-S testing and confusion
with N-P testing render it vulnerable to charges like: (i) infi-
nite regress and circularity, and (ii) illicit double-use of data.
(d) There is an erroneous impression that statistical mis-
specification is inevitable since modeling involves abstraction,
simplification and approximation. Hence, the slogan "All
models are wrong, but some are useful" is used as the excuse
for neglecting model validation.
This aphorism is especially pernicious because confuses two
different aspects of empirical modeling:
6

(i) the adequacy of the substantive (structural) model Mϕ(z)
(substantive adequacy), vis-a-vis the phenomenon of interest,
(ii) the validity of the (implicit) statistical model Mθ(z) (sta-
tistical adequacy) vis- a-vis the data z0.
It’s one thing to claim that the structural model Mϕ(z)
is wrong in the sense that it’s false to claim it is an exact
picture of reality in a substantive sense, and quite another
to claim that the implicit statistical model Mθ(z) could not
have generated data z0 because its probabilistic assumptions
are invalid for z0. In cases where we may arrive at statistically
adequate models, we can learn true things even with idealized
and partial substantive models.
When one imposes the substantive information (theory) on
data z0 at the outset, by estimating Mϕ(z) the end result is
often a statistically and substantively misspecified model, but
one has no way to delineate the two sources of error:
(a) the inductive premises are invalid, or
(b) the substantive information is inadequate,
and apportion blame with a view to address the unreliability
of inference problem.
The key to circumventing this Duhemian ambiguity is to
find a way to disentangle the statistical Mθ(z) from the sub-
stantive premises Mϕ(z). What is often insufficiently appre-
ciated is the fact that behind every substantive model Mϕ(z)
there is (often implicit) a statistical model Mθ(z) which pro-
vides the inductive premises for the reliability of statistical
inference based on data z0 The latter is just a set of proba-
bilistic assumptions pertaing to the chance regularities in data
z0 Statistical adequacy ensures error reliability in the sense
that the actual error probabilities approximately closely the
nominal ones.
7

2 M-S testing: a first encounter
To get some idea of what M-S testing is all about, let us focus
on a few simple tests to assess assumptions [1]-[4] of the simple
Normal model (table 4).
Table 4 - The simple Normal model
Statistical GM: = +  ∈N
[1] Normal:  v N( )
[2] Constant mean: ()= for all ∈N
[3] Constant variance:  ()=2
, for all ∈N
[4] Independence: { ∈N} - independent process.
Mis-Specification (M-S) testing differs from Neyman-Pearson
(N-P) testing in several respects, the most important of which
is that the latter is testing within boundaries of the assumed
statistical model Mθ(x), but the former is testing outside
those boundaries. N-P testing partitions the assumed model
using the parameters as an index. Conceptually, M-S testing
partitions the set P(x) of all possible statistical models that
could have given rise to data x0 into Mθ(x) and its compli-
ment P(x) − Mθ(x). However, P(x) − Mθ(x) cannot be
expressed in a parametric form and thus M-S testing is more
open-ended than N-P testing.
P ( )x
H0
H1
Fig. 1: N-P testing within Mθ(x)
P ( )x
M ( )x
Fig. 2: M-S testing outside Mθ(x)
8

2.1 Omnibus (nonparametric) M-S tests
2.1.1 The ‘Runs M-S test’ for the IID assumptions [2]-[4]
The hypothesis of interest concerns the ordering of the sam-
ple X:=(1 2  ) in the sense that the distribution of
the sample remains the same under for any random reordering
of X i.e.
0: (1 2  ; θ)=(1
 2
  
; θ)
for any permutation (1 2  ) of the index (=1 2  )
Step 1: transform data x0:=(1 2  ) into a sequence
of differences ( − −1), =2 3  
Step 2: replace each (−−1)0 with ‘+’ and each (−−1)0
with ‘-’. A ‘run’ is a segment of the sequence consisting of ad-
jacent identical elements which are followed and preceeded by
a different symbol.
The transformation takes the form:
(1  ) → {(−−1) =2  } → (+ + − + · · · + − − +)
(7)
Step 3: count the number of runs.
Example:
++|{z}
1
−|{z}
2
+ + +| {z }
3
−−|{z}
4
++|{z}
5
− − −| {z }
6
+|{z}
7
−|{z}
8
+|{z}
9
−−|{z}
10
+ + + + +| {z }
11
−|{z}
12
· · ·
consists of 12 runs; the first is a run of 2 positive signs, the
second a run of 1 negative sign, etc.
Runs test. One of the simplest runs test is based on
comparing the actual number of runs  with the number of
expected runs assuming that the data represent a realization
of an IID process { ∈N}. The test takes the form:
(X)=[−()]
√
 ()
 1()={x: |(x)|  
2
}
9

Using simple combinatorics with a sample size  one can
derive:
()=
¡2−1
3
¢
  ()=16−29
90

and show that the distribution of (X) for  ≥ 40 is:
(X)= [−()] 
p
 ()
IID
≈ N(0 1)
Note that this test is insensitive to departures from Nor-
mality because all distributional information has been lost in
the transformation (7).
Example- exam scores. Let us return to the exam
score data, shown below in both the alphabetical and sitting
arrangement.
Case 1. The exam scores data arranged in alphabetical
order (ﬁg, 2) we observe the following runs:
{1 1 4 1 1 3 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 2 3 1 1}
+

{1 1 3 1 1 2 2 1 2 1 2 2 3 1 2 1 2 1 2 1 2 1 1 1 1}
−

(8)
Hence, the actual number of runs is 50 which is close to the
number of runs expected under IID: (2(70)−1)3 ' 46. Ap-
plying the above runs test yields:
(x0)=50−
³
2(70)−1
3
´

q
16(70)−29
90
=1053[292]
where the p-value is in square brackets. This indicates no
departure from the IID ([2]-[4]) assumptions.
Fig. 3: −alphabetical order Fig. 4: −sitting order
10

Case 2. Consider the scores data ordered according to the
sitting order in ﬁgure 4. This data exhibit cycles which
yield the following runs up and down:
{3 2 4 4 1 4 3 6 1 4}
+
 {2 2 2 4 3 3 7 4 6 1 3}
−
 (9)
The diﬀerence between the patterns in (8) and (9) is in that
there is more clustering and thus fewer runs in the latter case.
The actual number of runs is 21; less than half of what were
expected under IID.
(x0)=21−
³
2(70)−1
3
´

q
16(70)−29
90
= −7276[0000]
which clearly indicates strong departures from the IID ([2]-[4])
assumptions.
2.1.2 Kolmogorov’s M-S test for Normality ([1])
The Kolmogorov M-S test for assessing the validity of a dis-
tributional assumption under two key conditions:
(i) the data x0:=(1 2  ) can be viewed as a realiza-
tion of a random (IID) sample X:=( 1 2  ), and
(ii) the random variables 1 2   are continuous (not
discrete).
The test relies on the empirical cumulative distribution
function (ecdf):
b()= [no of (12) that do not exceed ]

 ∀∈R
Under (i)-(ii), the ecdf is a strongly consistent estimator of
the cumulative distribution function (cdf): ()=( ≤ )
∀∈R
The generic hypothesis being tested takes the form:
0: ∗
()=0() ∈R (10)
where ∗
() denotes the true cdf, and 0() the cdf assumed
by the statistical model Mθ(x)
11

Kolmogorov (1933) proposed the distance function:
∆(X)= sup∈R | b() − 0()|
and proved that under (i)-(ii):
lim
→∞
P(
√
∆(X) ≤ )=() for   0 uniformly in  (11)
where () denotes the cdf of the Kolmogorov distribution:
()=1−2
P∞
=1(−1)+1
−222
' 1 − 2 exp(−22
)
Since () is known (approximated), one can define a M-S
test based on the test statistic (X)=
√
∆(X) giving rise
to the p-value:
P((X)  (x0); 0)=(x0)
Example. Applying the Kolmogorov test to the scores
data in fig. 3 yielded:
P((X)  039; 0)=15
which does not indicate any serious departures from the Nor-
mality assumption. The graph below provides a pictorial de-
piction of what this test is measuring in terms of the discrep-
ancies from the line to the observed points.
12011010090807060504030
99.9
99
95
90
80
70
60
50
40
30
20
10
5
1
0.1
x
Percent
M ean 71.69
S tD ev 13.61
N 70
KS 0.039
P -Valu e >0.150
Probability Plot of x
Norm al
fig. 5: P-P Normality plot
Note that this particular test might be too sensitive to out-
liers because it picks up only the biggest distance!
12

2.1.3 The role for omnibus M-S tests
The key advantage of the above omnibus tests is that they
probe more broadly around the Mθ(x) than directional (para-
metric) M-S tests at the expense of lower power. However,
tests with low power are useful in M-S testing because when
they detect a departure, they provide better evidence for its
presence than a test with very high power!
A key weakness of the above omnibus tests is that when
the null hypothesis is rejected, the test does not provide any
information as to the direction of departure. Such information
is needed for the next stage of modeling, that of respecifying
the original model Mθ(x) with a view to account for the sys-
tematic information not accounted for by Mθ(x).
2.2 Directional (parametric) M-S tests
2.2.1 A parametric M-S test for independence ([4])
A general approach to deriving M-S tests is to return to the
original probabilistic assumptions of the process { ∈N}
underlying data x0:=(1 2  ) and replace one or more
assumptions with more general ones and derive relevant dis-
tance functions using the two statistical Generating Mecha-
nisms (GMs).
In the case of the simple Normal model, the process { ∈N}
is assumed to be NIID. Let us relax the IID assumptions to
Markov dependence and stationarity, which gives rise to the
AutoRegressive (AR(1)), model based on (|−1; θ), whose
statistical GM is:
 = 0 + 1−1 +  vN(0 2
0) ∈N (12)
where 0=(1−1)∈R 1=(1)
(0)
∈(−1 1) 2
0=(0)(1−2
1)∈R+;
13

=() (0)= () (1)=( −1) =1   
Fig. 6: M-S testing by encompassing
The AR(1) parametrically nests (includes as a special case)
the simple Normal model because when 1=0 :
0= (1−1)|1=0 = 2
0= (0)(1−2
1)
¯
¯
1=0
=(0)
the AR(1) reduces to the simple Normal:
 = 0 + 1−1 + 
1=0
→ = +  ∈N
This suggests that a way to assess assumption [4] (table 4) is
to test the hypotheses:
0: 1=0 vs. 1: 1 6= 0 (13)
in the context of the AR(1) model. This will give rise to a
t-type test :={(X) 1()}:
(X)= (b1−0)
√
 (b1)
0
≈ (−2) 1()={x: |(x)|  }
where b1=
P
=1(−)(−1−)
P
=1(−1−)2   (b1)= 2
P
=1(−1−)2 
2
= 1
−2
P
=1(−b0−b1−1)2
 b0=(1−b1)
Example. For the data in ﬁgure 4, (12) yields:
=39593
(7790)
+ 0441
(0106)
−1 + b 2
=2 2
=14342 =69
14

The M-S t-test for (13) yields:
(x0)=
¡441
106
¢
=4160 (x0)=0000
indicating a clear departure from assumption [4].
It is straightforward to extend the above test to Markov()
by estimating the auxiliary regression:
=0 +
P
=1 − +  ∈N (14)
and testing the coefficient restrictions:
0: 1=2= · · · ==0 for   ( − 1) (15)
This gives rise to an F-type test, analogous to Ljung and Box
(1978) test, with one big difference: the estimated coefficients
in (14) can also be assessed individually using t-tests in order
to avoid the large  problem raised above. For the case  =
2 the auxiliary regression is:
=0 + 1−1 + 2−2 +  ∈N (16)
and the F-test for the joint significance of 1 and 2 will take
the form:
(x)=−

¡−3
2
¢ 0
v (2  − 3)
where =
P
=1( − )2
 denote the Restricted [re-
strictions 1=2=0 imposed]Residuals Sum of Squares, and
=
P
=1 b2
  b=−b0 − b1−1 − b2−2 the Unre-
stricted Residuals Sum of Squares, (2  − 3) denotes the F
distribution with 2 and  − 3 degrees of freedom.
One of the key advantages of this approach is that it can
easily be extended to derive joint M-S tests that assess more
than one assumption.
15

2.2.2 A parametric M-S test for IID ([2]-[3])
The above t-type parametric test based on the auxiliary Au-
toregression (12) can be extended to provide a joint test for
assumptions [2] and [4], by replacing the stationarity assump-
tion of { ∈N} with mean non-stationarity, gives rise to a
heterogeneous AR(1) model with a statistical GM:
=0 +
[2]
z}|{
1 +
[4]
z }| {
1−1 +  ∈N (17)
0=+1(1-) 1=(1-1)1 1=((1)(0)) 2
0=(0)(1-2
1)
The AR(1) with a trend nests the simple Normal model:
=0 + 1 + 1−1 + 
1=0
1=0
→ = +  ∈N
This suggests that a way to assess assumptions [2]&[4] (table
4) jointly is to test the hypotheses:
0: 1=0 and 1=0 vs. 1: 16=0 or 16=0
This will give rise to a F-type test :={(X) 1()}:
(X)=RRSS-URSS
URSS
¡−3
2
¢ 0
≈ F(2 −3) 1()={x: (x)}
URSS=
P
=1(−b0−b1−b1−1)2
 RRSS=
P
=1(−)2

where URSS and RRSS denote the Unrestricted and Restricted
Residual Sum of Squares, respectively, and F(2,n-3) denotes
the F distribution with 2 and −3 degrees of freedom.
Example. For the data in ﬁgure 4, the restricted and
unrestricted models yielded, respectively:
=7169
(1631)
+ b 2
=18523 =69
=38156
(8034)
+ 055
(073)
 + 0434
(0107)
−1 + b 2
=14434 =69 (18)
16

where RRSS=26845 URSS=21543 yielding:
(x0)=
¡26845−21543
21543
¢ ¡67
2
¢
=8245 (x0)=0006
indicating a clear departure from the null ([2]&[4]).
What is particularly notable about the auxiliary autore-
gression (18) is that a closer look at the t-ratios indicates
that the source of the problem is dependence and not t-
heterogeneity. The t-ratio of the coefficient of  is statistically
insignificant:
(x0)=
¡055
073
¢
=753 (x0)=226
but the coefficient of −1 is statistically significant:
(x0)=
¡434
107
¢
=4056 (x0)=0000
indicating a clear departure from assumption [4], but not from
[2]. This information that enables one to apportion blame
cannot be gleaned from the runs test.
An alternative, and more preferable, way to specify the
above auxiliary regressions is in terms of the residuals:
b= ( − ) =( − 7169) =1 2  
in the sense that the auxiliary regression:
b= −33534
(8034)
+055
(073)
 + 0434
(0107)
−1 + b 2
=14434 =69 (19)
is a mirror image of (18):
=38156
(8034)
+ 055
(073)
 + 0434
(0107)
−1 + b 2
=14434 =69 (20)
with identical parameter estimates, apart from the constant
(−33534=38156−716), which is irrelevant for M-S testing
purposes.
17

2.2.3 A parametric M-S test for assumptions [3]-[4]
In light of the fact that 2
= ()=(2
 ) one can test the
variance constancy [3] and independence [4] assumptions using
the residuals squared in the context of the auxiliary regression:
b2
 =0 +
[3]
z}|{
1 +
[4]
z }| {
22
−1 +  =1 2  
Using the above data, this gives rise to:
b2
 =29526
(8943)
− 1035
(1353)
 − 016
(014)
2
−1 + b (21)
The non-significance of the coefficients 1 and 2 indicate no
departures from assumptions [3] and [4].
Note that one could test assumption [3] individually using
the auxiliary regression:
b2
 =24384
(5530)
− 1728
(1354)
 + b1 (22)
where the t-test for the coefficient of  yields: 1728
1354
=1276[206];
the p-value is given in square brackets.
2.2.4 Extending the above auxiliary regression
The auxiliary regression (17), providing the basis of the joint
test for assumptions [2]-[4] can be easily extended to include
higher order trends (up to order  ≥ 1) and additional lags
( ≥ 1):
 = 0 +
P
=1 
+
P
=1 − +  ∈N (23)
2.2.5 A parametric M-S test for Normality ([1])
An alternative way to test Normality is to use parametric tests
relying on key features of the distribution. An example of this
type of test is the Skewness-Kurtosis test.
18

A key feature of the Pearson family is that it is specified
using the first four moments. Within this family we can char-
acterize several distributions using the skewness and kurtosis
coefficients:
3=(−())3
³√
 ()
´3  4=(−())4
³√
 ()
´4 
The skewness is the standardized third central moment and
provides a measure of asymmetry of () and the kurtosis is
the standardized fourth central moment and is a measure of
the peakness in relation to the tails of ()
The Normal distribution is characterized within the Pearson
family via the restrictions:
(3=0 4=3) ⇒ ∗
()=() for all ∈R
where ∗
() and () denote the true density and the Normal
density, respectively.
These moments can be used to derive a M-S test for the
Normality assumption [1] (table 4), using the hypotheses:
0: 3=0 and 4=3 vs. 1: 36=0 or 46=3
The Skewness-Kurtosis test is given by:
(X)=
6
b2
3 + 
24
(b4−3)
2 0
v

2
(2)
P((X)  (x0); 0)=(x0)
(24)
where 2
(2) denotes the chi-square distribution with 2 degrees
of freedom, and:
b3=
1

P
=1(−)3
³√1

P
=1(−)2
´3  b4=
1

P
=1(−)4
³√1

P
=1(−)2
´4 
Example. For the scores data in fig. 3: b3= −03,
b4=262:
(x0)=70
6
(−03)2
+ 70
24
(−38)
2
=432 (x0)=806
19

indicating no departure from the Normality assumption [1].
How is this test different from Kolmogorov’s nonparamet-
ric test? Depending on whether b3 6= 0 or b4 6= 3 one
can conclude whether the underlying distribution () is non-
symmetric or leptokurtic and that information can be useful
at the respecfication stage.
2.3 Simple Normal model: a summary of M-S testing
The first auxiliary regression specifies how departures from
different assumptions might affect the mean:
(i) b=10 +
[2]
z }| {
11 + 122
+
[4]
z }| {
13−1 + 14−2 + 1
0 : 11=12=13=14=0
The second auxiliary regression specifies how departures
from different assumptions might affect the variance:
(ii) b2
 =20 +
[3]
z }| {
21 + 222
+
[4]
z }| {
232
−1 + 242
−2 + 2
0 : 21=22=23=24=0
When NO departures from assumptions [2]-[4] are detected
one can proceed to test the Normality assumption using tests
like the skewnesss-kurtosis, the Kolmogorov or the Anderson-
Darling. Otherwise, one uses the residuals from auxiliary re-
gression (i) as a basis for a Normality test.
Example. Consider the casting of two dice data in table 5.
Evaluating the sample mean, variance, skewness and kurtosis
for the dice data yields:
=1

P
=1 =7080 2
= 1
−1
P
=1( − )2
=5993
b3= − 035 b4=2362
20

Table 5 - Observed data on dice casting
3 10 11 5 6 7 10 8 5 11 2 9 9 6 8 4 7 6 5 12
7 8 5 4 6 11 7 10 5 8 7 5 9 8 10 2 7 3 8 10
11 8 9 5 7 3 4 9 10 4 7 4 6 9 7 6 12 8 11 9
10 3 6 9 7 5 8 6 2 9 6 4 7 8 10 5 8 7 9 6
5 7 7 6 12 9 10 4 8 6 5 4 7 8 6 7 11 7 8 3
1009080706050403020101
12
10
8
6
4
2
Inde x
x
Time S eries P lot of x
Fig. 7: t-plot of the dice data
(a) Testing assumptions [2]-[4] using the runs test, requires
the counting of runs:
+ + - + + + - - + - + + - + - + - - + - + - - + + - + - + - - + - + - + - + + + - + - + - + + + -
+ - + + - - + - + - + - + + - - + - - + - - + + + - + - + - - + + - + - + - + - - - + + - + + - + -
For  ≥ 40, the type I error probability evaluation is based
on:
(X)= −([2−1]3)
√
[16−29]90
[1]-[4]
v N(0 1)
For the above data: =100 =50:
()=(200−1)3=66333  ()=(16(100)−29)90=17456
(X)=72−66333√
17456
=1356 P(|(X)|  1356; )=175
This does not indicate any departure from the IID assump-
tions.
(b) Test the independence assumption [4] using the auxil-
iary regression:
=0 + 1−1 +  =1 2  
21

=7856
(759)
− 103
(101)
−1 + b
(2425)

and the t-test for the significance of 1 yields: (x)=103
101
=1021[310]
where the p-value in square brackets indicates no clear depar-
ture from the Independence assumption; see Spanos (1999),
p. 774.
(c) Test the identically distributed assumptions [2]-[3] using
=0 + 1 +  =1 2  
=7193
(496)
− 002
(008)
 + b
(2460)

and the t-test for the significance of 1 yields: (x)=0022
0085
=259[793]
where the p-value indicates no departure from the ID assump-
tion; see Spanos (1999), p. 774.
(d) One can test the IID assumptions [2]-[4] jointly using
=0 + 1 + 2−1 +  =1 2  
=8100
(877)
− 0048
(0086)
 − 103
(101)
−1 + b
(2434)

where the F-test for the joint significance of 1 and 2 i.e.
0: 1=2=0 vs. 1: 16=0 or 26=0
(x)=−

¡−3

¢
=5766−568540
568540
¡96
2
¢
=680[511]
where =
P
=1( −)2
 denote the Restricted Resid-
uals Sum of Squares [the sum of squares of the residuals with
the restrictions imposed], and =
P
=1 b2
  the Unre-
stricted Residuals Sum of Squares [the sum of squares of the
residuals without the restrictions], respectively; note that
(−) is often called the Explained Sum of Squares
(ESS). The p-value in square brackets indicates no departure
from the IID assumptions, confirming the previous M-S test-
ing results.
22

1 21 08642
1 8
1 6
1 4
1 2
1 0
8
6
4
2
0
x
Frequency
H is to g r a m o f x
Fig. 8: Histogram of the dice data
(e) Testing the Normality assumption [1] using the SK test
yields:
(x0)=100
6
(−0035)2
+ 100
24
(2362 − 3)2
=1716[424]
The p-value indicates no departure from the Normality as-
sumption, but as shown in Spanos (1999), p. 775, this does
not mean that the assumption is valid; the test has very low
power. This is to be expected because the data come from a
discrete triangular distribution with values from 2 to 12, as
shown by the histogram (ﬁg. 8).
Using the more powerful Anderson and Darling (1952) test,
which for the ordered X sample simpliﬁes to:
-(X)= −−1

P
=1
©
(2−1)
£
ln []− ln(1− ln [+1−]
¤ª

however, provides evidence against Normality:
-(x0)=772[041]
In light of the M-S results in (a)-(e) one needs to replace the
Normality assumption with a triangular discrete distribution
in order to get a more adequate statistical model.
23

3 Mis-Speciﬁcation (M-S) testing: a formalization
3.1 The nature of M-S testing
The basic question posed by M-S testing is whether or not the
particular data x0:=(1 2  ) constitute a ‘truly typical
realization’ of the stochastic process { ∈N} underlying
the (predesignated) statistical model:
Mθ(x)={(x; θ) θ∈Θ} x∈R

Remember also that the primary aim of the frequentist ap-
proach is to learn from data x0 about the true statistical Data-
Generating Mechanism (DGM) M∗
(x)={(x; θ∗
)} x∈R
.
P ( )x
H0
H1
Fig. 9: N-P testing within Mθ(x)
P ( )x
M ( )x
Fig. 10: M-S testing outside Mθ(x)
Hence, the primary role of M-S testing is to probe, vis-
a-vis data x0 for possible departures from M(x) beyond its
boundaries, but within P(x) the set of all possible statistical
models that could have given rise to x0 In this sense, the
generic form of M-S testing is probing outside Mθ(x):
0: ∗
(x)∈Mθ(x) vs. 0: ∗
(x)∈ [P(x)−M(x)] 
where ∗
(x)=(x; θ∗
) denotes the ‘true’ distribution of the
sample.
24

In contrast, N-P testing is always within the boundaries
of Mθ(x). It presupposes that Mθ(z) is statistically ade-
quate and the hypotheses are ultimately concerned with learn-
ing from data about the ‘true’  say ∗
that could have
given rise to data x0 In general, the expression ‘θ∗
denotes
the true value of θ’ is a shorthand for saying that ‘data x0
constitute a realization of the sample X with distribution
(x; θ∗
)’ By defining the partition of Θ=(−∞ ∞) in terms of
Θ0=(−∞ 0] and Θ1=(0 ∞) and the associated partition of
Mθ(x), M0(x)={(x; ) ∈Θ0} and M1(x)={(x; ) ∈Θ1}
the hypotheses in (1) can be framed equivalently, but more
perceptively, as:
0: (x; ∗
)∈M0(x) vs. 1: (x; ∗
)∈M1(x), x∈R

Indeed, the test statistic (X)=
√
(−0)

 for the optimal (UMP)
N-P test, is, in essence, the standardized difference between
∗
and 0 with ∗
replaced by its best estimator 
The fact that M-S testing is probing [P(x)−M(x)] raises
certain technical and conceptual problems pertaining to how
one can operationalize such investigating. In practice, one
needs to replace the broad 0 with a more specific opera-
tional 1. This operationalization has a very wide scope, ex-
tending from vague omnibus (local), to specific directional
(broader) alternatives, like the tests based on the auxiliary
autoregressions and the Skewness-Kurtosis test. In all cases,
however, 1 does not span 0 and that raises additional
issues, including:
(a) The higher vulnerability of M-S testing to the fallacy of
rejection: (mis)interpreting reject 0 [evidence against 0] as
evidence for the specific 1. Rejecting the null in a M-S test
provides evidence against the original model Mθ(x) but that
does not imply good evidence for the particular alternative
25

1. Hence in practice one should never accept 1 without
further probing because that will be a classic example of the
fallacy of rejection.
(b) In M-S testing the type II error [accepting the null when
false] is often the more serious of the two errors. This is be-
cause for the type I error [rejecting the null when true] one will
have another chance to correct the error at the respecification
stage. When one, after a battery of M-S tests, erroneously
concludes that M(x) is statistically adequate, one will pro-
ceed to draw inferences oblivious to the fact that the actual
error probabilities might be very different from the nominal
(assumed) ones.
(c) In M-S testing the objective is to probe [P(x)−M(x)]
as exhaustively as possible, using a combination of om-
nibus M-S tests whose probing is more broad but have low
power and directional M-S tests whose probing in narrower
but goes much further and have higher power.
(d) Applying several M-S tests in probing the validity of one
or a combination of assumptions does not necessarily increase
the relevant type I error probability because the framing of the
hypotheses of interest renders them different from the multiple
hypothesis testing problem as construed in the N-P framework.
3.2 Respecification
After a reliable diagnosis of the sources of misspecification,
stemming from a reasoned scrutiny of the M-S testing re-
sults as a whole, one needs to respecify the original statisti-
cal model. Tracing the symptoms back to the source, enables
one to return to the three-way partitioning, based on the three
types of probabilistic assumptions and re-partition using more
appropriate reduction assumptions.
26

4 M-S testing: revisiting methodological issues
In this section we discuss some of the key criticisms of M-
S testing in order to bring out some of the confusions they
conceal.
4.1 Securing the effectiveness/reliability of M-S testing
There are a number of strategies designed to enhance the ef-
fectiveness/reliability of M-S probing thus render the diagnosis
more reliable.
¥ A most efficient way to probe [P(x)−M(x)] is to con-
struct M-S tests by modifying the original tripartite parti-
tioning that gave rise to Mθ(x) in directions of educated
departures gleaned from Exploratory Data Analysis. This gives
rise to encompassing models or directions of departure, which
enable one to eliminate an infinite number of alternative mod-
els at a time; Spanos (1999). This should be contrasted
with a most inefficient way to do this, that involves probing
[P(x)−M(x)] one model at a time Mϕ
(x) =1 2  This
is a hopeless task because there is an infinite number of such
alternative models to probe for and eliminate.
¥ Judicious combinations of omnibus (non-parametric),
directional (parametric) and simulation-based tests, probing
as broadly as possible and upholding dissimilar assumptions.
The interdependence of the model assumptions, stemming
fromM(x) being a parametrization of the process { ∈N}
plays a crucial role in the self-correction of M-S testing results.
¥ Astute ordering of M-S tests so as to exploit the in-
terrelationship among the model assumptions with a view to
‘correct’ each other’s diagnosis. For instance, the probabilistic
assumptions [1]-[3] of the Normal, Linear Regression model
(table 8) are interrelated because all three stem from the
27

assumption of Normality for the vector process {Z ∈N}
where Z:=( ) assumed to be NIID. This information is
also useful in narrowing down the possible alternatives. It is
important to note that the Normality assumption [1] should
be tested last because most of the M-S tests for it assume
that the other assumptions are valid, rendering the results
questionable when that clause is invalid.
¥ Joint M-S tests (testing several assumptions simul-
taneously) designed to avoid ‘erroneous’ diagnoses as well as
minimize the maintained assumptions.
The above strategies enable one to argue with severity
that when no departures from the model assumptions are de-
tected, the model provides a reliable basis for inference, in-
cluding appraising substantive claims (Mayo & Spanos, 2004).
4.2 The infinite regress and circularity charges
The infinite regress charge is often articulated by claiming
that each M-S test relies on a set of assumptions, and thus it
assesses the assumptions of the model Mθ(x) by invoking the
validity of its own assumptions, trading one set of assumptions
with another ad infinitum. Indeed, some go as far as to claim
that this reasoning is often circular because some M-S tests
inadvertently assume the validity of the very assumption they
aim to test!
A closer look at the reasoning underlying M-S testing re-
veals that both charges are misplaced.
¥ First, the scenario used in evaluating the type I error
invokes no assumptions beyond those of Mθ(x), since every
M-S test is evaluated under:
: all the probabilistic assumptions of Mθ(x) are valid.
Moreover, when any one (or more) of the model assumptions
28

is rejected, the model Mθ(x) as a whole, is considered mis-
specified.
Example. In the context of the simple Normal model
(table 6), the runs test is an example of an omnibus M-S test
for assumptions [2]-[4]. The original data, or the residuals, are
replaced with a + when the next data point is an up and with
a − when it’s a down. A run is a sub-sequence of one type
(+ or −) immediately preceded and succeeded by an element
of the other type.
For  ≥ 40, the type I error probability evaluation is based
on:
(X)= −([2−1]3)
√
[16−29]90
[1]-[4]
v N(0 1)
It is important to emphasize that the runs test is insensitive
to departures from Normality, and thus the effective scenario
for deriving the type I error is under assumptions [2]-[4].
¥ Second, the power for any M-S test, is determined by
evaluating the test statistic under certain forms of departures
from the assumptions being appraised [no circularity], but re-
taining the rest of the model assumptions.
For the runs test, the evaluation of power is based on:
(X)
[1]&[2]-[4]
v N( 2
) 6=0 2
 0
where [2]-[4] denote specific departures from these assump-
tions considered by the test in question. However, since the
test is insensitive to departures from [1], the effective scenario
does not have any retained assumptions. One of the advan-
tages of nonparametric tests is that they are insensitive to
departures from certain retained assumptions.
Bottom line: in M-S testing the evaluations under the
null and alternative hypotheses invoke only the model assump-
tions; no additional assumptions are involved. Moreover, the
29

use of joint M-S tests aims to minimize the number of model
assumptions retained when evaluating under the alternative.
4.3 Illegitimate double-use of data charge
In the context of the error statistical approach it is certainly
true that the same data x0 are being used for two different
purposes:
I (a) to test primary hypotheses in terms of the unknown
parameter(s) θ, and
I (b) to assess the validity of the prespecified model Mθ(x)
but ‘does that constitute an illegitimate double-use of data?’
Mayo (1981) answered that question in the negative, ar-
guing that the original data x0 are commonly remodeled to
r0=G(x0) r0∈R
  ≤  and thus rendered distinct from x0
when testing M(x)’s assumptions:
“What is relevant for our purposes is that the data used to
test the probability of heads [primary hypothesis] is distinct from
the data used in the subsequent test of independence [model as-
sumption]. Hence, no illegitimate double use of data is required.”
(Mayo, 1981, p. 195).
Hendry (1995), p. 545, interpreted this statement to mean:
“... following Mayo (1981), diagnostic test information is ef-
fectively independent of the sufficient statistics, so ‘discounting’
for such tests is not necessary.”
Combining these two views offers a more formal answer.
First, (a) and (b) pose very different questions to data x0
and second, the probing takes place within vs. outside Mθ(x)
respectively.
Indeed, one can go further to argue that the answers to the
questions posed in (a) and (b) rely on distinct informa-
tion in data x0.
30

Under certain conditions, the sample can be split into two
components:
X → (S(X) R(X))
that induce the following reduction in (x; θ):
(x; θ)=·(s; θ) · (r) ∀ (s r) ∈R
 ×R−
 
where = || is the Jacobian of: X → (S(X) R(X))
S(X):=(1  ) is a complete sufficient statistic,
R(X):=(1  −) a maximal ancillary statistic, and
S(Z) and R(Z) are independent.
What does this reduction mean?
(z; θ) =  ·
inference
z }| {
(s; θ) ·
model validation
z}|{
(r)
(25)
I [a] all primary inferences are based exclusively on (s; θ),
and
I [b] (r) can be used to validate Mθ(z) using error prob-
abilities that are free of θ.
Example. For the simple Normal model (table 4) holds
for S:=( 2
) :
 = 1

P
=1  2
= 1
−1
P
=1( − )2

as the minimal sufficient statistic, and R(X)=(b3  b) where:
b= (
√
(−)) v St(−1) =1 2  
known as the studentized residuals, the maximal ancillary
statistic.
I This explains why M-S testing is often based on the
residuals, and confirms Mayo’s (1981) claimthat R(X)=(b3  b)
provides information distinct from R(X) upon which the pri-
mary inferences are based.
31

The crucial argument for relying on (r) for model valida-
tion purposes is that the probing for departures from Mθ(x)
is based on error probabilities that do not depend on θ.
Generality of result in (25). This result holds for al-
most all statistical models routinely used in statistical infer-
ence, including the simple Normal, the simple Bernoulli, the
Linear Regression and related models and all statistical mod-
els based on the (natural) Exponential family of distributions,
such as the Normal, exponential, gamma, chi-squared, beta,
Dirichlet, Bernoulli, Poisson, Wishart, geometric, Laplace,
Levy, log-Normal, Pareto, Weibull, binomial (with fixed num-
ber of trials), multinomial (with fixed number of trials), and
negative binomial (with fixed number of failures) and many
others. Finally, the above result in (25) holds ‘approximately’
in all cases of statistical models whose inference relies on as-
ymptotic Normality.
5 Summary and Conclusions
Approximations, limited data and uncertainty lead to the use of
statistical models in learning from data about phenomena
of interest.
All statistical methods (frequentist, Bayesian, non-
parametric) rely on a prespecified statistical model Mθ(x)
as the primary basis of inference. The sound application and
the objectivity of their methods turns on the validity of these
assumed statistical models for the particular data.
Fundamental aim: How to specify and validate statisti-
cal models.
Unfortunately, model validation has been a neglected
aspect of empirical modeling. At best, one often finds more
32

of a grab-bag of techniques than a systematic account.
Error statistics attempts to remedy that by proposing a
coherent account of statistical model specification and valida-
tion that puts the entire process on a sounder philosophical
footing (Spanos, 1986, 1999, 2000, 2010; Mayo and Spanos,
2004).
Crucial strengths of frequentist error statistical methods
in this context:
I There is a clear goal to achieve: the statistical model
is sufficiently adequate so that the actual error probabilities
approximate well the nominal ones.
I It supplies a trenchant battery of Mis-Specification (M-S)
tests for model-validation (non-parametric and parametric)
with a view to minimize both types of errors and generate a
reliable diagnosis thru self-correction.
I In offers a seemless transition from model validation
to subsequent use in the sense that the same error statistical
reasoning is used.
The focus is on the question: What is the nature and warrant
for frequentist error statistical model specification and validation?
Failing to grasp the correct rationale of M-S testing has
led many to think that merely finding a statistical model that
‘fits’ the data well in some sense is tantamount to showing it
is statistically adequate. It is not!
Minimal Principle of Evidence: if the procedure had
no capacity to uncover departures from a hypothesis , then
not finding any is poor evidence for 
Failing to satisfy so minimal a principle leads to models
which, while acceptable according to its own self-scrutiny, are
in fact inadequate and they give rise to untrustworthy evi-
dence.
33

An Introduction to Mis-Speciﬁcation (M-S) Testing

More Related Content

What's hot (20)

Similar to An Introduction to Mis-Speciﬁcation (M-S) Testing (20)

More from jemille6 (20)

Recently uploaded (20)

An Introduction to Mis-Speciﬁcation (M-S) Testing