SlideShare a Scribd company logo
PHIL 6334 - Probability/Statistics Lecture Notes 3:
Estimation (Point and Interval)
Aris Spanos [Spring 2014]
1

Introduction

In this lecture we will consider point estimation in its simplest
form by focusing the discussion on simple statistical models,
whose generic form is give in table 1.
Table 1 — Simple (generic) Statistical Model
[i] Probability model: Φ={ (; θ) θ∈Θ ∈R } 
[ii] Sampling model: X:=(1  ) is a random (IID) sample.
What makes this type of statistical model ‘simple’ is the notion of random (IID) sample.
1.1

Random sample (IID)

The notion of a random sample is defined in terms of the joint
distribution of the sample X:=(1 2  ) say  (1 2  ; θ)
for all x:=(1 2  )∈R  by imposing two probabilistic

assumptions:
(I) Independence: the sample X is said to be Independent (I) if, for all x∈R  the joint distribution splits up into

a product of marginal distributions:
Q
 (x; θ)=1(1; θ1)·2(2; θ2)· · · · (; θ):= =1  ( ; θ )
(ID) Identically Distributed: the sample X is said to
be Identically Distributed (ID) if the marginal distributions
1
are identical:
 ( ; θ )= ( ; θ) for all =1 2  
Note that this means two things, the density functions have
the same form and the unknown parameters are common to
all of them.
For a better understanding of these two crucial probabilistic
assumptions we need to simplify the discussion by focusing
first on the two r.v. variable case, which we denote by  and
 to avoid subscripts.
First, let us revisit the notion of a random variable in order
to motive the notions of marginal and joint distributions.
Example 5. Tossing a coin twice and noting the outcome.
In this case ={( ) (  ) ( ) (  )} and let us assume that the events of interest are ={() ( ) ( )}
and ={(  ) ( ) ( )} Using these two events we can
generate the event space of interest F by applying the set
theoretic operations of union (∪), intersection (∩), and complementation (−). That is, F={ ∅      ∩  };
convince yourself that this will give rise to the set of all subsets of . Let us define the real-valued functions () and
 () on  as follows:
()=( )=( )=1 (  )=0
 ( ) = ( )= (  )=1  () =0
Do these two functions define proper r.v.s with respect to F
To check that we define all possible events generated by these
functions and check whether they belong to F:
{:()=0}={(  )}=∈F {:()=1}=∈F
2
{: ()=0}={()}=∈F {: ()=1}=∈F
Hence, both functions do define proper r.v’s with respect
to F To derive their distributions we assume that we have a
fair coin, i.e. each event in  has probability .25 of occurring.
Hence, both functions do define proper r.v’s with respect to
F To derive their distributions we assume that we have a fair
coin, i.e. each event in  has probability .25 of occurring.
{:()=0}=  (=0)=25 {: ()=0}=  (=0)=25
{:()=1}=  (=1)=75 {: ()=1}=  (=1)=75
Hence, their ‘marginal’ density functions take the form:

0 1
 () 25 75


0 1
 () 25 75

(1)

How can one define the joint distribution of these two r.v.s?
To define the joint density function we need to specify all the
events:
(=  =) ∈R  ∈R 
denoting ‘their joint occurrence’, and then attach probabilities to these events. These events belong to F by definition
because as a field is closed under the set theoretic operations
∪ ∩  so that:
(=0  =0)=
(=0  =1)=
(=1  =0)=
(=1  =1)=

{}=∅
{(  )}
{()}
{( ) ( )}
3

 (=0 =0)=
 (=0 =1)=
 (=1 =0)=
 (=1 =1)=

0
25
25
50
Hence, the joint density is defined by:
Â
0

0 1
0 25

1

25 50

(2)

How is the joint density (2) connected to the individual (marginal) densities given in (1)? It turns out that if we sum
over the rows of the above table for each value of , i.e. use
P
∈R  ( )= () we will get the marginal distribution
of  :  () ∈R  and if we sum over the columns for each
P
value of  , i.e. use ∈R  ( )= () we will get the
marginal distribution of :  () ∈R :
Â
0

0 1  ()
0 25 .25

1
25 50
 () .25 .75

.75
1

(3)

Note: ()=0(25)+1(75)=75 = ( )
 ()=(0−75)2(25)+(1−75)2(75)=1875 =  ( )
Armed with the joint distribution we can proceed to define the notions of Independence and Identically Distributed
between the r.v’s  and  .
Independence. Two r.v’s  and  are said to be Independent iff:
 ( )= ()· () for all values ( )∈R × R 

(4)

That is, to verify that these two r.v’s are independent we need
to confirm that the probability of all possible pairs of values
( ) satisfies (4).
4
Example. In the case of the joint distribution in (3)
we can show that the r.v’s are not independent because for
( )=(0 0):
 (0 0)=0 6=  (0)· (0)=(25)(25)
It is important to emphasize that the above condition of
Independence is not equivalent to the two random variables
being uncorrelated:
(  )=0 9  ( )= ()· () for all ( )∈R ×R 
where ‘9’ denotes ‘does not imply’. This is because (  )
is a measure of linear dependence between  and  since it
is based on the covariance defined by:
(  ) =[(-())( -( ))=2(0)(0-75) + (25)(0-75)(1-75)+
+(25)(0−75)(1−75) + (5)(1−75)(1−75) = −0625
A standardized covariance yields the correlation:
( )
= −0625 =
1875
 ()· ( )

(  )= √

−1
3

The intuition underlying this result is that the correlation involves only the first two moments [mean, variance, covariance]
of  and  but independence is defined in terms of the density functions; the latter, in principle, involves all moments,
not just the first two!
Identically Distributed. Two r.v’s  and  are said
to be Identically Distributed iff:
 (; )= (; ) for all values ( )∈R × R 

(5)

Example. In the case of the joint distribution in (3) we
can show that the r.v’s are identically distributed because (5)
5
holds. In particular, both r.v’s  and  take the same values
with the same probabilities.
To shed further light on the notion of IID, consider the
three bivariate distributions given below.
Â

1

2

 ()

Â

0

1

 ()

Â

0

1

 ()

0

018 042

06

0

018 042

06

0

036 024

06

2

012 028

04

1

012 028

04

1

024 016

04

1

 ()

1

 ()

 () 03

07

(A)

03

07

(B)

06

04

(C)

(I)  and  are Independent iff:
 ( )= ()· () for all ( )∈R × R 

(6)

(ID)  and  are Identically Distributed iff:
 () =  () for all ( )∈R × R  = and R =R 
The random variables  and  are independent in all three
cases since they satisfy (4) (verify!).
The random variables in (A) are not Identically Distributed
because R 6=R  and  ()6= () for some ( )∈R ×R 
The random variables in (B) are not Identically Distributed because even though R =R   ()6= () for some
( )∈R × R 
Finally, the random variables in (C) are Identically Distributed because R =R  and  ()= () for all ( )∈R ×
R 
6

1
2

Point Estimation: an overview

It turns out that all forms of frequentist inference, which include point and interval estimation, hypothesis testing and
prediction, are defined in terms of two sets:
X — sample space:
the set of all possible values of the sample X
Θ — parameter space: the set of all possible values of θ
Note that the sample space X is always a subset of R and
denoted by R 

In estimation the objective is to use the statistical information to infer the ‘true’ value ∗ of the unknown parameter, whatever that happens to be, as along as it belongs to
Θ
In general, an estimator b of  is a mapping (function)

from the sample space to the parameter space:
b X → Θ
():

(7)

Example 1. Let the statistical model of interest be the
simple Bernoulli model (table 2) and consider the question of
estimating the unknown parameter  whose parameter space
is Θ:=[0 1] Note that the sample space is: X:={0 1}
Table 2 - Simple Bernoulli Model
Statistical GM:  = +   ∈N.
⎫
[1] Bernoulli:
 v Ber( )  =0 1 ⎬
[2] constant mean:
( )=
∈N.
⎭
[3] constant variance:  ( )=(1−)
[4] Independence: {  ∈N} is an independent process
7
The notation b
(X) is used to denote an estimator in order to bring out the fact that it is a function of the sample
X and for different values it generates the sampling distribution  (b
(x); ) for x∈X. Post-data b
(X) yields an estimate b 0) which constitutes a particular value of b
(x
(X) corresponding to data x0 Crucial distinction: b
(X)-estimator
(Plato’s world), b 0)-estimate (real world), and -unknown
(x
constant (Plato’s world); Fisher (1922).
In light of the definition in (7), which of the following mappings constitute potential estimators of ?
Table 3: Estimators of ?
[a] b1(X)=

[b] b2(X)=1 − 

[c] b3(X)=(1 + )2

¡ ¢
b (X)= 1 P  for some   3
[d] 
=1
¡
¢
b+1(X)= 1 P 
[e] 
=1
+1

Do the mappings [a]-[e] in table 3 constitute estimators of
? All five functions [a]-[e] have X as their domain, but is the
range of each mapping a subset of Θ:=[0 1]? Mapping [a],
[c]-[e] can be possible estimators of  because their ranges are
subsets of [0 1], but [b] cannot not because it can take the
value −1 [ensure you understand why!] which lies outside the
parameter space of 
One can easily think of many more functions from X to Θ
that will qualify as possible estimators of  Given the plethora
of such possible estimators, how does one decide which one is
the most appropriate?
8
To answer that question let us think about the possibility of an ideal estimator, ∗():X → ∗ i.e., ∗(x)=∗ for
all values x∈X . That is, ∗(X) pinpoints the true value ∗
of  whatever the data. A moment’s reflection reveals that
no such estimator could exist because X is a random vector
with its own distribution  (x; ) for all x∈X. Moreover, in
view of the randomness of X, any mapping of the form (7)
will be a random variable with its own sampling distribution,
 (b
(x); ) which is directly derivable from  (x; ). Let us
take stock of these distributions.
Let us keep track of these distributions and where they
come from. The distribution of the sample  (x; ) for all x∈X
is given by the assumptions of statistical model in question.
I In the above case of the simple Bernoulli model, we can
combine assumptions [2]-[4] to give us:
[2]-[4] Y
 (x; ) =
 ( ; )
=1

and then use [1]:  ( ; )=(1 − )1− =1 2   to determine  (x; ):
P
[2]-[4] Y
[1]-[4] P 
=1
 (x; ) =
 ( ; ) = 
(1−) =1 1− = (1−)− 
=1
P
where = =1  , and one can show that :
P
 =   v Bin( (1 − ))
(8)
=1
i.e.  is Binomially distributed. note that the means and
variances are derived using the two formulae:
(i) (1 + 2 + )=(1) + (2) + 
2

2

(ii)  (1 + 2 + )=  (1) +   (2)
9

(9)
To derive the mean and variance of  :
P
P
(i) P
( ) =  (  ) =  ()=  =
=1
=1
=1

P
P
(ii) P
 ( )=  (  ) =   ()=  (1−)=(1−)
=1
=1
=1

The result in (8) is a special case of a general result.
¥ The sampling distribution of any (well-behaved) function
of the sample, say =(1 2  ) can be derived from
 (x; ) x∈X using the formula:
R
R
()=P( ≤ )= ··· {x: (x)≤}  (x; θ)x ∈R (10)
In the Bernoulli case, all the estimators [a], [c]-[e] are linear
functions of (1 2  ) and thus, by (8), their distribution is Binomial. In particular,

Table 4: Estimators and their sampling distributions
[a] b1(X)= v Ber( (1−))

³
´
b3(X)=(1 + )2 v Bin  [ (1−) ]
[c] 
2
³
´
¡ 1 ¢ P
(1−)
[d] b(X)= 

=1  v Bin  [  ]  for   3
³
´
¡ 1 ¢ P
(1−)

[e] b+1(X)= +1

=1  v Bin +1  [ (+1)2 ]

(11)

It is important to emphasize at the outset that the sampling
distributions [a]-[e] are evaluated under =∗ where ∗ is the
true value of 
It is clear that none of the sampling distributions of the
estimators in table 4 resembles that of the ideal estimator,
∗(X), whose sampling distribution, if it exists, would be of
the form:
(12)
[i] P(∗(X)=∗)=1
10
In terms of its first two moments, the ideal estimator satisfies
[ii] (∗(X))=∗and [iii]  (∗(X))=0 In contrast to the
(infeasible) ideal estimator in (12), when the estimators in
table 4 infer  using an outcome x, the inference is always
subject to some error because the variance is not zero. The
sampling distributions of these estimators provide the basis
for evaluating such errors.
In the statistics literature the evaluation of inferential
errors in estimation is accomplished in two interconnected
stages.
The objective of the first stage is to narrow down the set
of all possible estimators of  to an optimal subset, where
optimality is assessed by how closely the sampling distribution
of an estimator approximates that of the ideal estimator in
(12); the subject matter of section 3.
The second stage is concerned with using optimal estimators to construct the shortest Confidence Intervals (CI)
for the unknown parameter  based on prespecifying the error of covering (encompassing) ∗ within a random interval of
the form ((X) (X)); the subject matter of section 4.
3

Properties of point estimators

As mentioned above, the notion of an optimal estimator can
be motivated by how well the sampling distribution of an estimator b(X) approximates that of the ideal estimator in (12).

In particular, the three features of the ideal estimator [i]-[iii]
motivate the following optimal properties of feasible estimators.
11
Condition [ii] motivates the property known as:
[I] Unbiasedness: An estimator b
(X) is said to be an unbiased for  if:
(13)
(b
(X))=∗
That is, the mean of the sampling distribution of b
(X) coincides with the true value of the unknown parameter 
Example. In the case of the simple Bernoulli model,
we can see from table 4 that the estimators b1(X) b3(X)


and b(X) are unbiased since in all three cases (13) is satis
fied. In contrast, estimator b+1(X) is not unbiased because

´
³

 b+1(X) = +1  6= 

Condition [iii] motivates the property known as:
[II] Full Efficiency: An unbiased estimator b(X) is said

to be a fully efficient estimator of  if its variance is as small
as it can be, where the latter is expressed by:
´i−1
h ³
(x;)
b(X))=():=  − 2 ln  2
 (



where ‘()’ stands for the Cramer-Rao lower bound; note
that  (x; ) is given by the assumed model.
Example (the derivations are not important!). In the case
of the simple Bernoulli model:
P
ln  (x; )= ln +(− ) ln(1−) where  = =1   ( )=



 ln  (x;)
1
= ( )( 1 ) − ( −  )( 1− )


2 ln  (x;)

1
=− ( 12 )−(− )( 1− )2
2

´
³ 2
(x;)
1
=( 12 )( ) + [ − ( )]( 1− )2
−  ln2

and thus the Cramer-Rao lower bound is:
():= (1−) 

12


= (1−) 
Looking at the estimators of  in (12) it is clear that only
one unbiased estimator achieves that bound, b(X) Hence,

b(X) is the only estimator of  which is both unbiased and

fully efficient.
Comparisons between unbiased estimators can be made in
terms of relative efficiency:
 (b1(X))   (b2(X)) for   2




asserting that b2(X) is relatively more efficient than b1(X)

but one needs to be careful with such comparisons because
they can be very misleading when both estimators are bad,
as in the case above; the fact that b2(X) is relatively more

efficient than b1(X) does not mean that the former is even

an adequate estimator. Hence, relative efficiency is not something to write home about!
What renders these two estimators practically useless? An asymptotic property motivated by condition [i] of
the ideal estimator, known as consistency.
Intuitively, an estimator b(X) is consistent when its preci
sion (how close to ∗ is) improves as the sample size increases.
Condition [i] of the ideal estimator motivates the property
known as:
[III] Consistency: an estimator b(X) is consistent if:


Strong: P(lim→∞ b(X)=∗)=1
(14)
¯
³¯
´
¯b
∗¯
Weak: lim→∞ P ¯(X) −  ¯ ≤  =1

That is, an estimator b(X) is consistent if it approximates

(probabilistically) the sampling distribution of the ideal es13
timator asymptotically; as  → ∞ The difference between
strong and weak consistency stems from the form of probabilistic convergence they involve, with the former being stronger
than the latter. Both of these properties constitute an extension of the Strong and Weak Law of Large Numbers (LLN)
P
1
which hold for the sample mean  =    of a process
=1
{  =1 2   } under certain probabilistic assumptions,
the most restrictive being that the process is IID; see Spanos
(1999), ch. 8.
0.65

0.60

0.60

Sample average

0.70

0.65

Sample Average

0.70

0.55
0.50

0.55
0.50

0.45

0.45

0.40

0.40
1

20

40

60

80

100

120

140

160

180

1

200

100

200

300

400

500

600

700

800

900

1000

Inde x

Inde x

Fig. 2: t-plot of  for a BerIID
realization with =1000

Fig. 1: t-plot of  for a BerIID
realization with =200

In practice, it is no-trivial to prove that a particular estimator is consistent or not by verifying directly the conditions
in (14). However, there is often a short-cut for verifying consistency in the case of unbiased estimators using the sufficient
³
´
condition:
lim   b(X) =0

(15)
→∞

Example. In the case of the simple Bernoulli model, one
can verify that the estimators b1(X) and b3(X) are inconsis

tent because:
³
´
³
´
b1(X) =(1−)6=0 lim   b1(X) = (1−) 6=0
lim   

2
→∞

→∞

14
i.e. their variances do not decrease to zero as the sample size
 goes to infinity.
In contrast, the estimators b(X) and b+1(X) are consis

tent because:
³
´
lim   b(X) = lim ( (1−) )=0 lim (b+1(X))=0



→∞

→∞

→∞

Note that ‘MSE’ denotes the ‘Mean Square Error’, defined by:
(b; ∗)= (b) + [(b; ∗)]2




where (b; ∗)=(b)−∗ Hence:



lim (b)=0 if (a) lim  (b)=0 and (b) lim (b)=∗




→∞

→∞

→∞


where (b) is equivalent to lim (b; ∗)=0
→∞
Let us take stock of the above properties and how they
can used by the practitioner in deciding which estimator is
optimal. The property which defines minimal reliability for
an estimator is that of consistency. Intuitively, consistency
indicates that as the sample size increases [as  → ∞] the
estimator b(X) approaches ∗ the true value of  in some

probabilistic sense; convergence almost surely or convergence
in probability. Hence, if an estimator b(X) is not consistent,

it is automatically excluded from the subset of potentially
optimal estimators, irrespective of any other properties this
estimator might enjoy. In particular, an unbiased estimator
which is inconsistent is practically useless. On the other hand,
just because an estimator b(X) is consistent does not imply

that it’s a ‘good’ estimator; it only implies that it’s minimally
acceptable.
15
¥ It is important to emphasize that the properties of unbiasedness and fully efficiency hold for any sample size   1
and thus we call them finite sample properties, but consistency
is an asymptotic property because it holds as  → ∞
Example. In the case of the simple Bernoulli model, if
the choice between estimators is confined (artificially) among
the estimators b1(X) b3(X) and b+1(X) the latter estimator



should be chosen, despite being biased, because it’s a consistent estimator of  On the other hand, among the estimators
given in table 4, b (X) is clearly the best (most optimal) be
cause it satisfies all three properties. In particular, b(X), not

only satisfies the minimal property of consistency, but it also
has the smallest variance possible, which means that it comes
closer to the ideal estimator than any of the others, for any
sample size   2 The sampling distribution of b(X), when

∗
evaluated under = , takes the form:
³
´
∗
¡ ¢
∗
∗
b(X)= 1 P  = Bin ∗ [  (1− ) ] 
(16)
v
[d] 
=1


whatever the ‘true’ value ∗ happens to be.

Additional Asymptotic properties
In addition to the properties of estimators mentioned above,
there are certain other properties which are often used in practice to decide on the optimality of an estimator. The most
important is given below for completeness.
[V] Asymptotic Normality: an estimator b(X) is said

to be asymptotically Normal if:
´
√ ³b
 (X) −  v N(0 ∞()) ∞()6=0
(17)

where ‘v’ stands for ‘can be asymptotically approximated by’.


16
This property is an extension of a well-known result in
probability theory: the Central Limit Theorem (CLT).
The CLT asserts that, under certain probabilistic assumptions
on the process {  =1 2   } , the most restrictive
being that the process is IID, the sampling distribution of
P
1
 =    for a ‘large enough’  can be approximated
=1
by the Normal distribution (Spanos, 1999, ch. 8):

(√−())
v N(0 1)
(18)


 (  )

Note that the important difference between (17) and (18) is
that b(X) in the former does not have to coincide with  ;

it can be any well-behaved function (X) of the sample X.
Example. In the case of the simple Bernoulli model the
sampling distribution of b (X) which we know is Binomial

(see (16)), it can also be approximated using (18). In the
graph below we compare the Normal approximation to the
Binomial for =10 and =20 in the case where =5, and the
improvement is clearly noticeable.
Distribution Plot

Distribution Plot
0.20

Distribution n p
Binomial
10 0.5

0.4

Distribution n p
Binomial
20 0.5

Distribution Mean StDev
Normal
5
1

Distribution Mean StDev
Normal
10
2.236

0.15

Density

Density

0.3

0.2

0.05

0.1

0.0

0.10

0

2

4

6

8

0.00

10

X

Normal approx. of Bin.:
 (; =.5 =10)

2

4

6

8

10
X

12

14

16

18

Normal approx. of Bin.:
 (; =.5 =20)

17
4

Confidence Intervals (CIs): an overview

4.1

An optimal CI begins with an optimal point estimator

Example 2. Let us summarize the discussion concerning
point estimation by briefly discussing the simple (one parameter) Normal model, where =1 (table 5).
Table 5 - Simple Normal Model (one unknown parameter)
Statistical GM:
 = +   ∈N={1 2 }
⎫
[1] Normality:
 v N( )  ∈R ⎬
[2] Constant mean:
( )=
∈N.
⎭
[3] Constant variance:  ( )= 2 (known)
[4] Independence: {  ∈N} independent process
In section 3 we discussed the question of choosing among numerous possible estimators of  such as [a]-[e] (table 6) using
their sampling distributions. These results stem from the following theorem. If X:=(1 2  ) is a random (IID)
sample from the Normal distribution, i.e.
 v NIID(  2) ∈N:=(1 2   )
P
then the sampling distribution of =1  is:
P
2
(19)
=1  v N(  )
Among the above estimators the sample mean for 2=1:
¡ 1¢
P
1
 (X):=  =    v N  
b
=1
constitutes the optimal point estimator of  because it is:
[U] Unbiased (( )=∗),
[FE] Fully Efficient ( ( )=()), and
[SC] Strongly Consistent (P(lim→∞  =∗)=1).
18
Table 6: Estimators

UN FE SC

1(X)= v N( 1)
b
2(X)=1 −  v N(0 2)
b
3(X)=(¢ + )2 v N( 1 )¢
b
¡ 1 1 P
¡ 21
 (X)= 
b

=1  v N ³ 
´
P
1


[e] +1(X)= +1 =1 vN +1  (+1)2
b

X
×
X
X

×
×
×
X

×
×
×
X

×

×

X

[a]
[b]
[c]
[d]

Given that any ‘decent’ estimator (X) of  is likely to
b
yield any value in the interval (−∞ ∞)  can one say something more about its reliability than just "on average" its
values (x) for x∈X are more likely to occur around ∗ (the
b
true value) than those further away?
4.2

What is a Confidence Interval?

I This is what a Confidence Interval (CI) proposes to address.
In general, a 1− CI for  takes the generic form:
P((X) ≤ ∗ ≤ (X))=1−
where (X) and (X) denote the lower and upper (random)
bounds of this CI. The 1− is referred to as the confidence
level and represents the coverage probability of the CI:
(X;)=((X) (X))
in the sense that the probability that the random interval
(X) covers (overlays) the true ∗ is equal to (1−) 
This is often envisioned in terms of a long-run metaphor
of repeating the experiment underlying the statistical model
in question in order to get a sequence of outcomes (realizations
19
of X) x each of which will yield an observed
(x ;) =1 2  
0
In the context of this metaphor (1−) denotes the relative
frequency of the observed CIs that will include (overlay) ∗
Example 2. In the case of the simple (one parameter)
Normal model (table 5), let us consider the question of constructing 95 CIs using the different unbiased estimators of 
in table 6:
[a] P(b1(X)−196 ≤ ∗ ≤ 1(X)+196)=95

b

1
1
[c] P(b3(X)−196( √2 ) ≤ ∗ ≤ 3(X)+196( √2 ))=95

b
1
1
[d] P( −196( √ ) ≤ ∗ ≤  +196( √ ))=95

(20)

How do these CIs differ? The answer is in terms of their
precision (accuracy).
One way to measure precision for CIs is to evaluate their
length:
³
´
³
´
1
1
√
[a]: 2 (196) =392 [c]: 2 196( √2 ) =2772 [d]: 2 196( √ ) = 392

It is clear from this evaluation that the CI associated with
P
1
 =    is the shortest for any   2; e.g. for =100
=1
392
the length of this CI is √100 =392

20
∗
1.
` − − − −  − − − − a
2.
` − − − − − − − − −− a
3.
` − − − −  − − − − a
4.
` − − − −  − − − − a
5.
` − − − −  − − − − a
6.
` − − − −  − − − − a
7.
` − − − −  − − − − a
8.
` − − − −  − − − − a
9.
` − − − −  − − − − a
10.
` − − − −  − − − − a
11.
` − − − −  − − − − a
12.
` − − − −  − − − − a
13.I ` − − −−− − −− a
14.
` − − − −  − − − − a
15.
` − − − −  − − − − a
16.
` − − − −  − − − − a
17.
` − − − − − − − − −− a
18.
` − − − −  − − − − a
19.
` − − − −  − − − − a
20.
` − − − −  − − − − a
21
4.3

Constructing Confidence Intervals (CIs)

More generally, the sampling distribution of optimal estimator
  gives rise to a pivot (a function of the sample and  whose
distribution is known):
¢ =∗
√ ¡
(21)
   −  v N(0 1)
which can be used to construct the shortest CI among all
(1−) CIs for :
1
1
P( −  ( √ ) ≤ ∗ ≤  +  ( √ )) = (1−) 
(22)
2
2

where P(|| ≥   )=(1−) for  vN(0 1) (figures 1-2).
2
Example 3. In the case where  2 is unknown, and we use
P
2=[1( − 1)]  ( −  )2 to estimate it, the pivot in
=1
(21) takes the form:
√
(  −) =∗
(23)
v St( − 1)

where St(−1) denotes the Student’s t distribution with (−1)
degrees of freedom.
Step 1. Attach a (1−) coverage probability using (23):
√
(  −)
≤   ) = (1−) 
P(−  ≤

2
2

where P(| | ≥   )=(1−) for  vSt(−1)
2
√
(  −)
to isolate  to derive the CI:
Step 2. Re-arrange

√
¡
¢
(  −)


P(−  ≤
≤   )=P(−  ( √ ) ≤   −  ≤   ( √ ))=

2
2
2
2


=P(− −  ( √ ) ≤ − ≤ −  +   ( √ ))=
2
2


= P( −  ( √ ) ≤ ∗ ≤  +  ( √ ))= (1−) 
2
2

In figures 1-2 the underlying distribution is Normal and in
figures 3-4 it’s Student’s t with 19 degrees of freedom. One
22
can see that, while the tail areas are the same for each , the
threshold values for the Normal   for each  are smaller than
2
∗
the corresponding values   for the Student’s t because the
2
latter has heavier tails due to the randomness of 2
Distribution Plot

Distribution Plot

Normal, Mean=0, StDev=1

Normal, Mean=0, StDev=1

0.3
Density

0.4

0.3
Density

0.4

0.2

0.1

0.2

0.1
0.025

0.0

0.05

0.025
-1.96

0
X

0.0

1.96

0.05
-1.64

0
X

1.64

Fig. 1: P(|| ≥ 025)=.95
for  vN(0 1)

Fig. 2: P(|| ≥ 05)=.90
for  vN(0 1)

Distribution Plot

Distribution Plot

T, df=20

T, df=20

0.3
Density

0.4

0.3
Density

0.4

0.2

0.1

0.1
0.025

0.0

0.2

0.05

0.025
-2.09

0
X

0.0

2.09

Fig. 3: P(| | ≥ 025)=.95
for  vSt(19)

0.05
-1.72

0
X

1.72

Fig. 4: P(| | ≥ 05)=.90
for  vSt(19)

23
5

Summary and conclusions

The primary objective in frequentist estimation is to learn
about ∗ the true value of the unknown parameter  of interest using its sampling distribution (b ∗) associated with
;
particular sample size  The finite sample properties are de;
fined directly in terms of (b ∗) and the asymptotic properties are defined in terms of the asymptotic sampling distri;
;
bution ∞(b ∗) aiming to approximate (b ∗) at the limit
as  → ∞
The question that needs to be considered at this stage is:
what combination of the above mentioned
properties specifies an ‘optimal’ estimator?
A necessary but minimal property for an estimator is consistency (preferably strong). By itself, however, consistency
does not secure learning from data for a given ; it’s a promissory note for potential learning. Hence, for actual learning
one needs to supplement consistency with certain finite sample properties like unbiasedness and efficiency to ensure that
learning can take place with the particular data x0:=(1 2  )
of sample size .
Among finite sample properties full efficiency is clearly
the most important because it secures the highest degree of
learning for a given  since it offers the best possible precision.
Relative efficiency, although desirable, needs to be investigated further to find out how large is the class of estimators being compared before passing judgement. Being the
24
best econometrician in my family, although worthy of something,
does not make me a good econometrician!!
Unbiasedness, although desirable, is not considered indispensable by itself. Indeed, as shown above, an unbiased
but inconsistent estimator is practically useless, and a consistent but biased estimator is always preferable.
Hence, a consistent, unbiased and fully efficient estimator sets the gold standard in estimation.
In conclusion, it is important to emphasize that point estimation is often considered inadequate for the purposes of
scientific inquiry because a ‘good’ point estimator b(X) by

itself, does not provide any measure of the reliability and
precision associated with the estimate b(x0); one would be

wrong to assume that b(x0) ' ∗ This is the reason why

b(x0) is often accompanied by its standard error [the es
q

timated standard deviation  (b(X))] or the p-value of
some test of significance associated with the generic hypothesis =0.
Interval estimation rectifies this weakness of point estimation by providing the relevant error probabilities associated with inferences pertaining to ‘covering’ the true value
∗ of 

25

More Related Content

PDF
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
PDF
6334 Day 3 slides: Spanos-lecture-2
PDF
Spurious correlation (updated)
PDF
Probability/Statistics Lecture Notes 4: Hypothesis Testing
PDF
An Introduction to Mis-Specification (M-S) Testing
PDF
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
PDF
A. spanos slides ch14-2013 (4)
PPTX
Diagnostic methods for Building the regression model
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
6334 Day 3 slides: Spanos-lecture-2
Spurious correlation (updated)
Probability/Statistics Lecture Notes 4: Hypothesis Testing
An Introduction to Mis-Specification (M-S) Testing
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
A. spanos slides ch14-2013 (4)
Diagnostic methods for Building the regression model

What's hot (20)

PPTX
Chap09 hypothesis testing
PPT
Chapter11
PPT
Chapter13
PPT
PDF
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
PPT
PPTX
Statistical computing 1
PPTX
Statistical computing2
PPTX
Chap08 estimation additional topics
PPT
Chapter11
PPT
Chapter14
PPT
Chapter15
PPT
Chapter14
PDF
Sample sample distribution
PDF
Applied Business Statistics ,ken black , ch 6
PDF
Testing as estimation: the demise of the Bayes factor
PDF
Applied Business Statistics ,ken black , ch 3 part 2
PPTX
Mean, variance, and standard deviation of a Discrete Random Variable
PDF
2 random variables notes 2p3
PDF
HW1 MIT Fall 2005
Chap09 hypothesis testing
Chapter11
Chapter13
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
Statistical computing 1
Statistical computing2
Chap08 estimation additional topics
Chapter11
Chapter14
Chapter15
Chapter14
Sample sample distribution
Applied Business Statistics ,ken black , ch 6
Testing as estimation: the demise of the Bayes factor
Applied Business Statistics ,ken black , ch 3 part 2
Mean, variance, and standard deviation of a Discrete Random Variable
2 random variables notes 2p3
HW1 MIT Fall 2005
Ad

Similar to Spanos lecture+3-6334-estimation (20)

PDF
Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...
PDF
Fuzzy Group Ideals and Rings
PDF
Statistical Hydrology for Engineering.pdf
PDF
ISI MSQE Entrance Question Paper (2006)
PDF
THE WEAK SOLUTION OF BLACK-SCHOLE’S OPTION PRICING MODEL WITH TRANSACTION COST
PDF
THE WEAK SOLUTION OF BLACK-SCHOLE’S OPTION PRICING MODEL WITH TRANSACTION COST
PDF
THE WEAK SOLUTION OF BLACK-SCHOLE’S OPTION PRICING MODEL WITH TRANSACTION COST
PDF
THE WEAK SOLUTION OF BLACK-SCHOLE’S OPTION PRICING MODEL WITH TRANSACTION COST
PDF
The Weak Solution of Black-Scholes Option Pricing Model with Transaction Cost
PDF
Nbhm m. a. and m.sc. scholarship test 2006
PDF
Estimation rs
PDF
Ichimura 1993: Semiparametric Least Squares (non-technical)
PPTX
Probability Distribution
PDF
A PROBABILISTIC ALGORITHM FOR COMPUTATION OF POLYNOMIAL GREATEST COMMON WITH ...
PDF
A PROBABILISTIC ALGORITHM FOR COMPUTATION OF POLYNOMIAL GREATEST COMMON WITH ...
PDF
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
PDF
Engr 371 final exam april 1996
PDF
ISI MSQE Entrance Question Paper (2008)
PDF
chap2.pdf
PDF
Litv_Denmark_Weak_Supervised_Learning.pdf
Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard ...
Fuzzy Group Ideals and Rings
Statistical Hydrology for Engineering.pdf
ISI MSQE Entrance Question Paper (2006)
THE WEAK SOLUTION OF BLACK-SCHOLE’S OPTION PRICING MODEL WITH TRANSACTION COST
THE WEAK SOLUTION OF BLACK-SCHOLE’S OPTION PRICING MODEL WITH TRANSACTION COST
THE WEAK SOLUTION OF BLACK-SCHOLE’S OPTION PRICING MODEL WITH TRANSACTION COST
THE WEAK SOLUTION OF BLACK-SCHOLE’S OPTION PRICING MODEL WITH TRANSACTION COST
The Weak Solution of Black-Scholes Option Pricing Model with Transaction Cost
Nbhm m. a. and m.sc. scholarship test 2006
Estimation rs
Ichimura 1993: Semiparametric Least Squares (non-technical)
Probability Distribution
A PROBABILISTIC ALGORITHM FOR COMPUTATION OF POLYNOMIAL GREATEST COMMON WITH ...
A PROBABILISTIC ALGORITHM FOR COMPUTATION OF POLYNOMIAL GREATEST COMMON WITH ...
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
Engr 371 final exam april 1996
ISI MSQE Entrance Question Paper (2008)
chap2.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
Ad

More from jemille6 (20)

PDF
What is the Philosophy of Statistics? (and how I was drawn to it)
PDF
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
PDF
Severity as a basic concept in philosophy of statistics
PDF
“The importance of philosophy of science for statistical science and vice versa”
PDF
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
PDF
D. Mayo JSM slides v2.pdf
PDF
reid-postJSM-DRC.pdf
PDF
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
PDF
Causal inference is not statistical inference
PDF
What are questionable research practices?
PDF
What's the question?
PDF
The neglected importance of complexity in statistics and Metascience
PDF
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
PDF
On Severity, the Weight of Evidence, and the Relationship Between the Two
PDF
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
PDF
Comparing Frequentists and Bayesian Control of Multiple Testing
PPTX
Good Data Dredging
PDF
The Duality of Parameters and the Duality of Probability
PDF
Error Control and Severity
PDF
The Statistics Wars and Their Causalities (refs)
What is the Philosophy of Statistics? (and how I was drawn to it)
Mayo, DG March 8-Emory AI Systems and society conference slides.pdf
Severity as a basic concept in philosophy of statistics
“The importance of philosophy of science for statistical science and vice versa”
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
D. Mayo JSM slides v2.pdf
reid-postJSM-DRC.pdf
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Causal inference is not statistical inference
What are questionable research practices?
What's the question?
The neglected importance of complexity in statistics and Metascience
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
On Severity, the Weight of Evidence, and the Relationship Between the Two
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Comparing Frequentists and Bayesian Control of Multiple Testing
Good Data Dredging
The Duality of Parameters and the Duality of Probability
Error Control and Severity
The Statistics Wars and Their Causalities (refs)

Recently uploaded (20)

PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Complications of Minimal Access Surgery at WLH
PDF
RMMM.pdf make it easy to upload and study
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
01-Introduction-to-Information-Management.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
master seminar digital applications in india
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Microbial disease of the cardiovascular and lymphatic systems
human mycosis Human fungal infections are called human mycosis..pptx
Complications of Minimal Access Surgery at WLH
RMMM.pdf make it easy to upload and study
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Anesthesia in Laparoscopic Surgery in India
Final Presentation General Medicine 03-08-2024.pptx
01-Introduction-to-Information-Management.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
master seminar digital applications in india
Microbial diseases, their pathogenesis and prophylaxis
O5-L3 Freight Transport Ops (International) V1.pdf
A systematic review of self-coping strategies used by university students to ...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape

Spanos lecture+3-6334-estimation

  • 1. PHIL 6334 - Probability/Statistics Lecture Notes 3: Estimation (Point and Interval) Aris Spanos [Spring 2014] 1 Introduction In this lecture we will consider point estimation in its simplest form by focusing the discussion on simple statistical models, whose generic form is give in table 1. Table 1 — Simple (generic) Statistical Model [i] Probability model: Φ={ (; θ) θ∈Θ ∈R }  [ii] Sampling model: X:=(1  ) is a random (IID) sample. What makes this type of statistical model ‘simple’ is the notion of random (IID) sample. 1.1 Random sample (IID) The notion of a random sample is defined in terms of the joint distribution of the sample X:=(1 2  ) say  (1 2  ; θ) for all x:=(1 2  )∈R  by imposing two probabilistic  assumptions: (I) Independence: the sample X is said to be Independent (I) if, for all x∈R  the joint distribution splits up into  a product of marginal distributions: Q  (x; θ)=1(1; θ1)·2(2; θ2)· · · · (; θ):= =1  ( ; θ ) (ID) Identically Distributed: the sample X is said to be Identically Distributed (ID) if the marginal distributions 1
  • 2. are identical:  ( ; θ )= ( ; θ) for all =1 2   Note that this means two things, the density functions have the same form and the unknown parameters are common to all of them. For a better understanding of these two crucial probabilistic assumptions we need to simplify the discussion by focusing first on the two r.v. variable case, which we denote by  and  to avoid subscripts. First, let us revisit the notion of a random variable in order to motive the notions of marginal and joint distributions. Example 5. Tossing a coin twice and noting the outcome. In this case ={( ) (  ) ( ) (  )} and let us assume that the events of interest are ={() ( ) ( )} and ={(  ) ( ) ( )} Using these two events we can generate the event space of interest F by applying the set theoretic operations of union (∪), intersection (∩), and complementation (−). That is, F={ ∅      ∩  }; convince yourself that this will give rise to the set of all subsets of . Let us define the real-valued functions () and  () on  as follows: ()=( )=( )=1 (  )=0  ( ) = ( )= (  )=1  () =0 Do these two functions define proper r.v.s with respect to F To check that we define all possible events generated by these functions and check whether they belong to F: {:()=0}={(  )}=∈F {:()=1}=∈F 2
  • 3. {: ()=0}={()}=∈F {: ()=1}=∈F Hence, both functions do define proper r.v’s with respect to F To derive their distributions we assume that we have a fair coin, i.e. each event in  has probability .25 of occurring. Hence, both functions do define proper r.v’s with respect to F To derive their distributions we assume that we have a fair coin, i.e. each event in  has probability .25 of occurring. {:()=0}=  (=0)=25 {: ()=0}=  (=0)=25 {:()=1}=  (=1)=75 {: ()=1}=  (=1)=75 Hence, their ‘marginal’ density functions take the form:  0 1  () 25 75  0 1  () 25 75 (1) How can one define the joint distribution of these two r.v.s? To define the joint density function we need to specify all the events: (=  =) ∈R  ∈R  denoting ‘their joint occurrence’, and then attach probabilities to these events. These events belong to F by definition because as a field is closed under the set theoretic operations ∪ ∩  so that: (=0  =0)= (=0  =1)= (=1  =0)= (=1  =1)= {}=∅ {(  )} {()} {( ) ( )} 3  (=0 =0)=  (=0 =1)=  (=1 =0)=  (=1 =1)= 0 25 25 50
  • 4. Hence, the joint density is defined by: Â 0 0 1 0 25 1 25 50 (2) How is the joint density (2) connected to the individual (marginal) densities given in (1)? It turns out that if we sum over the rows of the above table for each value of , i.e. use P ∈R  ( )= () we will get the marginal distribution of  :  () ∈R  and if we sum over the columns for each P value of  , i.e. use ∈R  ( )= () we will get the marginal distribution of :  () ∈R : Â 0 0 1  () 0 25 .25 1 25 50  () .25 .75 .75 1 (3) Note: ()=0(25)+1(75)=75 = ( )  ()=(0−75)2(25)+(1−75)2(75)=1875 =  ( ) Armed with the joint distribution we can proceed to define the notions of Independence and Identically Distributed between the r.v’s  and  . Independence. Two r.v’s  and  are said to be Independent iff:  ( )= ()· () for all values ( )∈R × R  (4) That is, to verify that these two r.v’s are independent we need to confirm that the probability of all possible pairs of values ( ) satisfies (4). 4
  • 5. Example. In the case of the joint distribution in (3) we can show that the r.v’s are not independent because for ( )=(0 0):  (0 0)=0 6=  (0)· (0)=(25)(25) It is important to emphasize that the above condition of Independence is not equivalent to the two random variables being uncorrelated: (  )=0 9  ( )= ()· () for all ( )∈R ×R  where ‘9’ denotes ‘does not imply’. This is because (  ) is a measure of linear dependence between  and  since it is based on the covariance defined by: (  ) =[(-())( -( ))=2(0)(0-75) + (25)(0-75)(1-75)+ +(25)(0−75)(1−75) + (5)(1−75)(1−75) = −0625 A standardized covariance yields the correlation: ( ) = −0625 = 1875  ()· ( ) (  )= √ −1 3 The intuition underlying this result is that the correlation involves only the first two moments [mean, variance, covariance] of  and  but independence is defined in terms of the density functions; the latter, in principle, involves all moments, not just the first two! Identically Distributed. Two r.v’s  and  are said to be Identically Distributed iff:  (; )= (; ) for all values ( )∈R × R  (5) Example. In the case of the joint distribution in (3) we can show that the r.v’s are identically distributed because (5) 5
  • 6. holds. In particular, both r.v’s  and  take the same values with the same probabilities. To shed further light on the notion of IID, consider the three bivariate distributions given below. Â 1 2  () Â 0 1  () Â 0 1  () 0 018 042 06 0 018 042 06 0 036 024 06 2 012 028 04 1 012 028 04 1 024 016 04 1  () 1  ()  () 03 07 (A) 03 07 (B) 06 04 (C) (I)  and  are Independent iff:  ( )= ()· () for all ( )∈R × R  (6) (ID)  and  are Identically Distributed iff:  () =  () for all ( )∈R × R  = and R =R  The random variables  and  are independent in all three cases since they satisfy (4) (verify!). The random variables in (A) are not Identically Distributed because R 6=R  and  ()6= () for some ( )∈R ×R  The random variables in (B) are not Identically Distributed because even though R =R   ()6= () for some ( )∈R × R  Finally, the random variables in (C) are Identically Distributed because R =R  and  ()= () for all ( )∈R × R  6 1
  • 7. 2 Point Estimation: an overview It turns out that all forms of frequentist inference, which include point and interval estimation, hypothesis testing and prediction, are defined in terms of two sets: X — sample space: the set of all possible values of the sample X Θ — parameter space: the set of all possible values of θ Note that the sample space X is always a subset of R and denoted by R   In estimation the objective is to use the statistical information to infer the ‘true’ value ∗ of the unknown parameter, whatever that happens to be, as along as it belongs to Θ In general, an estimator b of  is a mapping (function)  from the sample space to the parameter space: b X → Θ (): (7) Example 1. Let the statistical model of interest be the simple Bernoulli model (table 2) and consider the question of estimating the unknown parameter  whose parameter space is Θ:=[0 1] Note that the sample space is: X:={0 1} Table 2 - Simple Bernoulli Model Statistical GM:  = +   ∈N. ⎫ [1] Bernoulli:  v Ber( )  =0 1 ⎬ [2] constant mean: ( )= ∈N. ⎭ [3] constant variance:  ( )=(1−) [4] Independence: {  ∈N} is an independent process 7
  • 8. The notation b (X) is used to denote an estimator in order to bring out the fact that it is a function of the sample X and for different values it generates the sampling distribution  (b (x); ) for x∈X. Post-data b (X) yields an estimate b 0) which constitutes a particular value of b (x (X) corresponding to data x0 Crucial distinction: b (X)-estimator (Plato’s world), b 0)-estimate (real world), and -unknown (x constant (Plato’s world); Fisher (1922). In light of the definition in (7), which of the following mappings constitute potential estimators of ? Table 3: Estimators of ? [a] b1(X)=  [b] b2(X)=1 −   [c] b3(X)=(1 + )2  ¡ ¢ b (X)= 1 P  for some   3 [d]  =1 ¡ ¢ b+1(X)= 1 P  [e]  =1 +1 Do the mappings [a]-[e] in table 3 constitute estimators of ? All five functions [a]-[e] have X as their domain, but is the range of each mapping a subset of Θ:=[0 1]? Mapping [a], [c]-[e] can be possible estimators of  because their ranges are subsets of [0 1], but [b] cannot not because it can take the value −1 [ensure you understand why!] which lies outside the parameter space of  One can easily think of many more functions from X to Θ that will qualify as possible estimators of  Given the plethora of such possible estimators, how does one decide which one is the most appropriate? 8
  • 9. To answer that question let us think about the possibility of an ideal estimator, ∗():X → ∗ i.e., ∗(x)=∗ for all values x∈X . That is, ∗(X) pinpoints the true value ∗ of  whatever the data. A moment’s reflection reveals that no such estimator could exist because X is a random vector with its own distribution  (x; ) for all x∈X. Moreover, in view of the randomness of X, any mapping of the form (7) will be a random variable with its own sampling distribution,  (b (x); ) which is directly derivable from  (x; ). Let us take stock of these distributions. Let us keep track of these distributions and where they come from. The distribution of the sample  (x; ) for all x∈X is given by the assumptions of statistical model in question. I In the above case of the simple Bernoulli model, we can combine assumptions [2]-[4] to give us: [2]-[4] Y  (x; ) =  ( ; ) =1 and then use [1]:  ( ; )=(1 − )1− =1 2   to determine  (x; ): P [2]-[4] Y [1]-[4] P  =1  (x; ) =  ( ; ) =  (1−) =1 1− = (1−)−  =1 P where = =1  , and one can show that : P  =   v Bin( (1 − )) (8) =1 i.e.  is Binomially distributed. note that the means and variances are derived using the two formulae: (i) (1 + 2 + )=(1) + (2) +  2 2 (ii)  (1 + 2 + )=  (1) +   (2) 9 (9)
  • 10. To derive the mean and variance of  : P P (i) P ( ) =  (  ) =  ()=  = =1 =1 =1 P P (ii) P  ( )=  (  ) =   ()=  (1−)=(1−) =1 =1 =1 The result in (8) is a special case of a general result. ¥ The sampling distribution of any (well-behaved) function of the sample, say =(1 2  ) can be derived from  (x; ) x∈X using the formula: R R ()=P( ≤ )= ··· {x: (x)≤}  (x; θ)x ∈R (10) In the Bernoulli case, all the estimators [a], [c]-[e] are linear functions of (1 2  ) and thus, by (8), their distribution is Binomial. In particular, Table 4: Estimators and their sampling distributions [a] b1(X)= v Ber( (1−))  ³ ´ b3(X)=(1 + )2 v Bin  [ (1−) ] [c]  2 ³ ´ ¡ 1 ¢ P (1−) [d] b(X)=   =1  v Bin  [  ]  for   3 ³ ´ ¡ 1 ¢ P (1−)  [e] b+1(X)= +1  =1  v Bin +1  [ (+1)2 ] (11) It is important to emphasize at the outset that the sampling distributions [a]-[e] are evaluated under =∗ where ∗ is the true value of  It is clear that none of the sampling distributions of the estimators in table 4 resembles that of the ideal estimator, ∗(X), whose sampling distribution, if it exists, would be of the form: (12) [i] P(∗(X)=∗)=1 10
  • 11. In terms of its first two moments, the ideal estimator satisfies [ii] (∗(X))=∗and [iii]  (∗(X))=0 In contrast to the (infeasible) ideal estimator in (12), when the estimators in table 4 infer  using an outcome x, the inference is always subject to some error because the variance is not zero. The sampling distributions of these estimators provide the basis for evaluating such errors. In the statistics literature the evaluation of inferential errors in estimation is accomplished in two interconnected stages. The objective of the first stage is to narrow down the set of all possible estimators of  to an optimal subset, where optimality is assessed by how closely the sampling distribution of an estimator approximates that of the ideal estimator in (12); the subject matter of section 3. The second stage is concerned with using optimal estimators to construct the shortest Confidence Intervals (CI) for the unknown parameter  based on prespecifying the error of covering (encompassing) ∗ within a random interval of the form ((X) (X)); the subject matter of section 4. 3 Properties of point estimators As mentioned above, the notion of an optimal estimator can be motivated by how well the sampling distribution of an estimator b(X) approximates that of the ideal estimator in (12).  In particular, the three features of the ideal estimator [i]-[iii] motivate the following optimal properties of feasible estimators. 11
  • 12. Condition [ii] motivates the property known as: [I] Unbiasedness: An estimator b (X) is said to be an unbiased for  if: (13) (b (X))=∗ That is, the mean of the sampling distribution of b (X) coincides with the true value of the unknown parameter  Example. In the case of the simple Bernoulli model, we can see from table 4 that the estimators b1(X) b3(X)   and b(X) are unbiased since in all three cases (13) is satis fied. In contrast, estimator b+1(X) is not unbiased because  ´ ³   b+1(X) = +1  6=   Condition [iii] motivates the property known as: [II] Full Efficiency: An unbiased estimator b(X) is said  to be a fully efficient estimator of  if its variance is as small as it can be, where the latter is expressed by: ´i−1 h ³ (x;) b(X))=():=  − 2 ln  2  (   where ‘()’ stands for the Cramer-Rao lower bound; note that  (x; ) is given by the assumed model. Example (the derivations are not important!). In the case of the simple Bernoulli model: P ln  (x; )= ln +(− ) ln(1−) where  = =1   ( )=   ln  (x;) 1 = ( )( 1 ) − ( −  )( 1− )   2 ln  (x;)  1 =− ( 12 )−(− )( 1− )2 2  ´ ³ 2 (x;) 1 =( 12 )( ) + [ − ( )]( 1− )2 −  ln2 and thus the Cramer-Rao lower bound is: ():= (1−)   12  = (1−) 
  • 13. Looking at the estimators of  in (12) it is clear that only one unbiased estimator achieves that bound, b(X) Hence,  b(X) is the only estimator of  which is both unbiased and  fully efficient. Comparisons between unbiased estimators can be made in terms of relative efficiency:  (b1(X))   (b2(X)) for   2    asserting that b2(X) is relatively more efficient than b1(X)  but one needs to be careful with such comparisons because they can be very misleading when both estimators are bad, as in the case above; the fact that b2(X) is relatively more  efficient than b1(X) does not mean that the former is even  an adequate estimator. Hence, relative efficiency is not something to write home about! What renders these two estimators practically useless? An asymptotic property motivated by condition [i] of the ideal estimator, known as consistency. Intuitively, an estimator b(X) is consistent when its preci sion (how close to ∗ is) improves as the sample size increases. Condition [i] of the ideal estimator motivates the property known as: [III] Consistency: an estimator b(X) is consistent if:   Strong: P(lim→∞ b(X)=∗)=1 (14) ¯ ³¯ ´ ¯b ∗¯ Weak: lim→∞ P ¯(X) −  ¯ ≤  =1 That is, an estimator b(X) is consistent if it approximates  (probabilistically) the sampling distribution of the ideal es13
  • 14. timator asymptotically; as  → ∞ The difference between strong and weak consistency stems from the form of probabilistic convergence they involve, with the former being stronger than the latter. Both of these properties constitute an extension of the Strong and Weak Law of Large Numbers (LLN) P 1 which hold for the sample mean  =    of a process =1 {  =1 2   } under certain probabilistic assumptions, the most restrictive being that the process is IID; see Spanos (1999), ch. 8. 0.65 0.60 0.60 Sample average 0.70 0.65 Sample Average 0.70 0.55 0.50 0.55 0.50 0.45 0.45 0.40 0.40 1 20 40 60 80 100 120 140 160 180 1 200 100 200 300 400 500 600 700 800 900 1000 Inde x Inde x Fig. 2: t-plot of  for a BerIID realization with =1000 Fig. 1: t-plot of  for a BerIID realization with =200 In practice, it is no-trivial to prove that a particular estimator is consistent or not by verifying directly the conditions in (14). However, there is often a short-cut for verifying consistency in the case of unbiased estimators using the sufficient ³ ´ condition: lim   b(X) =0  (15) →∞ Example. In the case of the simple Bernoulli model, one can verify that the estimators b1(X) and b3(X) are inconsis  tent because: ³ ´ ³ ´ b1(X) =(1−)6=0 lim   b1(X) = (1−) 6=0 lim     2 →∞ →∞ 14
  • 15. i.e. their variances do not decrease to zero as the sample size  goes to infinity. In contrast, the estimators b(X) and b+1(X) are consis  tent because: ³ ´ lim   b(X) = lim ( (1−) )=0 lim (b+1(X))=0    →∞ →∞ →∞ Note that ‘MSE’ denotes the ‘Mean Square Error’, defined by: (b; ∗)= (b) + [(b; ∗)]2    where (b; ∗)=(b)−∗ Hence:   lim (b)=0 if (a) lim  (b)=0 and (b) lim (b)=∗    →∞ →∞ →∞  where (b) is equivalent to lim (b; ∗)=0 →∞ Let us take stock of the above properties and how they can used by the practitioner in deciding which estimator is optimal. The property which defines minimal reliability for an estimator is that of consistency. Intuitively, consistency indicates that as the sample size increases [as  → ∞] the estimator b(X) approaches ∗ the true value of  in some  probabilistic sense; convergence almost surely or convergence in probability. Hence, if an estimator b(X) is not consistent,  it is automatically excluded from the subset of potentially optimal estimators, irrespective of any other properties this estimator might enjoy. In particular, an unbiased estimator which is inconsistent is practically useless. On the other hand, just because an estimator b(X) is consistent does not imply  that it’s a ‘good’ estimator; it only implies that it’s minimally acceptable. 15
  • 16. ¥ It is important to emphasize that the properties of unbiasedness and fully efficiency hold for any sample size   1 and thus we call them finite sample properties, but consistency is an asymptotic property because it holds as  → ∞ Example. In the case of the simple Bernoulli model, if the choice between estimators is confined (artificially) among the estimators b1(X) b3(X) and b+1(X) the latter estimator    should be chosen, despite being biased, because it’s a consistent estimator of  On the other hand, among the estimators given in table 4, b (X) is clearly the best (most optimal) be cause it satisfies all three properties. In particular, b(X), not  only satisfies the minimal property of consistency, but it also has the smallest variance possible, which means that it comes closer to the ideal estimator than any of the others, for any sample size   2 The sampling distribution of b(X), when  ∗ evaluated under = , takes the form: ³ ´ ∗ ¡ ¢ ∗ ∗ b(X)= 1 P  = Bin ∗ [  (1− ) ]  (16) v [d]  =1   whatever the ‘true’ value ∗ happens to be. Additional Asymptotic properties In addition to the properties of estimators mentioned above, there are certain other properties which are often used in practice to decide on the optimality of an estimator. The most important is given below for completeness. [V] Asymptotic Normality: an estimator b(X) is said  to be asymptotically Normal if: ´ √ ³b  (X) −  v N(0 ∞()) ∞()6=0 (17)  where ‘v’ stands for ‘can be asymptotically approximated by’.  16
  • 17. This property is an extension of a well-known result in probability theory: the Central Limit Theorem (CLT). The CLT asserts that, under certain probabilistic assumptions on the process {  =1 2   } , the most restrictive being that the process is IID, the sampling distribution of P 1  =    for a ‘large enough’  can be approximated =1 by the Normal distribution (Spanos, 1999, ch. 8):  (√−()) v N(0 1) (18)   (  ) Note that the important difference between (17) and (18) is that b(X) in the former does not have to coincide with  ;  it can be any well-behaved function (X) of the sample X. Example. In the case of the simple Bernoulli model the sampling distribution of b (X) which we know is Binomial  (see (16)), it can also be approximated using (18). In the graph below we compare the Normal approximation to the Binomial for =10 and =20 in the case where =5, and the improvement is clearly noticeable. Distribution Plot Distribution Plot 0.20 Distribution n p Binomial 10 0.5 0.4 Distribution n p Binomial 20 0.5 Distribution Mean StDev Normal 5 1 Distribution Mean StDev Normal 10 2.236 0.15 Density Density 0.3 0.2 0.05 0.1 0.0 0.10 0 2 4 6 8 0.00 10 X Normal approx. of Bin.:  (; =.5 =10) 2 4 6 8 10 X 12 14 16 18 Normal approx. of Bin.:  (; =.5 =20) 17
  • 18. 4 Confidence Intervals (CIs): an overview 4.1 An optimal CI begins with an optimal point estimator Example 2. Let us summarize the discussion concerning point estimation by briefly discussing the simple (one parameter) Normal model, where =1 (table 5). Table 5 - Simple Normal Model (one unknown parameter) Statistical GM:  = +   ∈N={1 2 } ⎫ [1] Normality:  v N( )  ∈R ⎬ [2] Constant mean: ( )= ∈N. ⎭ [3] Constant variance:  ( )= 2 (known) [4] Independence: {  ∈N} independent process In section 3 we discussed the question of choosing among numerous possible estimators of  such as [a]-[e] (table 6) using their sampling distributions. These results stem from the following theorem. If X:=(1 2  ) is a random (IID) sample from the Normal distribution, i.e.  v NIID(  2) ∈N:=(1 2   ) P then the sampling distribution of =1  is: P 2 (19) =1  v N(  ) Among the above estimators the sample mean for 2=1: ¡ 1¢ P 1  (X):=  =    v N   b =1 constitutes the optimal point estimator of  because it is: [U] Unbiased (( )=∗), [FE] Fully Efficient ( ( )=()), and [SC] Strongly Consistent (P(lim→∞  =∗)=1). 18
  • 19. Table 6: Estimators UN FE SC 1(X)= v N( 1) b 2(X)=1 −  v N(0 2) b 3(X)=(¢ + )2 v N( 1 )¢ b ¡ 1 1 P ¡ 21  (X)=  b  =1  v N ³  ´ P 1   [e] +1(X)= +1 =1 vN +1  (+1)2 b X × X X × × × X × × × X × × X [a] [b] [c] [d] Given that any ‘decent’ estimator (X) of  is likely to b yield any value in the interval (−∞ ∞)  can one say something more about its reliability than just "on average" its values (x) for x∈X are more likely to occur around ∗ (the b true value) than those further away? 4.2 What is a Confidence Interval? I This is what a Confidence Interval (CI) proposes to address. In general, a 1− CI for  takes the generic form: P((X) ≤ ∗ ≤ (X))=1− where (X) and (X) denote the lower and upper (random) bounds of this CI. The 1− is referred to as the confidence level and represents the coverage probability of the CI: (X;)=((X) (X)) in the sense that the probability that the random interval (X) covers (overlays) the true ∗ is equal to (1−)  This is often envisioned in terms of a long-run metaphor of repeating the experiment underlying the statistical model in question in order to get a sequence of outcomes (realizations 19
  • 20. of X) x each of which will yield an observed (x ;) =1 2   0 In the context of this metaphor (1−) denotes the relative frequency of the observed CIs that will include (overlay) ∗ Example 2. In the case of the simple (one parameter) Normal model (table 5), let us consider the question of constructing 95 CIs using the different unbiased estimators of  in table 6: [a] P(b1(X)−196 ≤ ∗ ≤ 1(X)+196)=95  b 1 1 [c] P(b3(X)−196( √2 ) ≤ ∗ ≤ 3(X)+196( √2 ))=95  b 1 1 [d] P( −196( √ ) ≤ ∗ ≤  +196( √ ))=95 (20) How do these CIs differ? The answer is in terms of their precision (accuracy). One way to measure precision for CIs is to evaluate their length: ³ ´ ³ ´ 1 1 √ [a]: 2 (196) =392 [c]: 2 196( √2 ) =2772 [d]: 2 196( √ ) = 392  It is clear from this evaluation that the CI associated with P 1  =    is the shortest for any   2; e.g. for =100 =1 392 the length of this CI is √100 =392 20
  • 21. ∗ 1. ` − − − −  − − − − a 2. ` − − − − − − − − −− a 3. ` − − − −  − − − − a 4. ` − − − −  − − − − a 5. ` − − − −  − − − − a 6. ` − − − −  − − − − a 7. ` − − − −  − − − − a 8. ` − − − −  − − − − a 9. ` − − − −  − − − − a 10. ` − − − −  − − − − a 11. ` − − − −  − − − − a 12. ` − − − −  − − − − a 13.I ` − − −−− − −− a 14. ` − − − −  − − − − a 15. ` − − − −  − − − − a 16. ` − − − −  − − − − a 17. ` − − − − − − − − −− a 18. ` − − − −  − − − − a 19. ` − − − −  − − − − a 20. ` − − − −  − − − − a 21
  • 22. 4.3 Constructing Confidence Intervals (CIs) More generally, the sampling distribution of optimal estimator   gives rise to a pivot (a function of the sample and  whose distribution is known): ¢ =∗ √ ¡ (21)    −  v N(0 1) which can be used to construct the shortest CI among all (1−) CIs for : 1 1 P( −  ( √ ) ≤ ∗ ≤  +  ( √ )) = (1−)  (22) 2 2 where P(|| ≥   )=(1−) for  vN(0 1) (figures 1-2). 2 Example 3. In the case where  2 is unknown, and we use P 2=[1( − 1)]  ( −  )2 to estimate it, the pivot in =1 (21) takes the form: √ (  −) =∗ (23) v St( − 1)  where St(−1) denotes the Student’s t distribution with (−1) degrees of freedom. Step 1. Attach a (1−) coverage probability using (23): √ (  −) ≤   ) = (1−)  P(−  ≤  2 2 where P(| | ≥   )=(1−) for  vSt(−1) 2 √ (  −) to isolate  to derive the CI: Step 2. Re-arrange  √ ¡ ¢ (  −)   P(−  ≤ ≤   )=P(−  ( √ ) ≤   −  ≤   ( √ ))=  2 2 2 2   =P(− −  ( √ ) ≤ − ≤ −  +   ( √ ))= 2 2   = P( −  ( √ ) ≤ ∗ ≤  +  ( √ ))= (1−)  2 2 In figures 1-2 the underlying distribution is Normal and in figures 3-4 it’s Student’s t with 19 degrees of freedom. One 22
  • 23. can see that, while the tail areas are the same for each , the threshold values for the Normal   for each  are smaller than 2 ∗ the corresponding values   for the Student’s t because the 2 latter has heavier tails due to the randomness of 2 Distribution Plot Distribution Plot Normal, Mean=0, StDev=1 Normal, Mean=0, StDev=1 0.3 Density 0.4 0.3 Density 0.4 0.2 0.1 0.2 0.1 0.025 0.0 0.05 0.025 -1.96 0 X 0.0 1.96 0.05 -1.64 0 X 1.64 Fig. 1: P(|| ≥ 025)=.95 for  vN(0 1) Fig. 2: P(|| ≥ 05)=.90 for  vN(0 1) Distribution Plot Distribution Plot T, df=20 T, df=20 0.3 Density 0.4 0.3 Density 0.4 0.2 0.1 0.1 0.025 0.0 0.2 0.05 0.025 -2.09 0 X 0.0 2.09 Fig. 3: P(| | ≥ 025)=.95 for  vSt(19) 0.05 -1.72 0 X 1.72 Fig. 4: P(| | ≥ 05)=.90 for  vSt(19) 23
  • 24. 5 Summary and conclusions The primary objective in frequentist estimation is to learn about ∗ the true value of the unknown parameter  of interest using its sampling distribution (b ∗) associated with ; particular sample size  The finite sample properties are de; fined directly in terms of (b ∗) and the asymptotic properties are defined in terms of the asymptotic sampling distri; ; bution ∞(b ∗) aiming to approximate (b ∗) at the limit as  → ∞ The question that needs to be considered at this stage is: what combination of the above mentioned properties specifies an ‘optimal’ estimator? A necessary but minimal property for an estimator is consistency (preferably strong). By itself, however, consistency does not secure learning from data for a given ; it’s a promissory note for potential learning. Hence, for actual learning one needs to supplement consistency with certain finite sample properties like unbiasedness and efficiency to ensure that learning can take place with the particular data x0:=(1 2  ) of sample size . Among finite sample properties full efficiency is clearly the most important because it secures the highest degree of learning for a given  since it offers the best possible precision. Relative efficiency, although desirable, needs to be investigated further to find out how large is the class of estimators being compared before passing judgement. Being the 24
  • 25. best econometrician in my family, although worthy of something, does not make me a good econometrician!! Unbiasedness, although desirable, is not considered indispensable by itself. Indeed, as shown above, an unbiased but inconsistent estimator is practically useless, and a consistent but biased estimator is always preferable. Hence, a consistent, unbiased and fully efficient estimator sets the gold standard in estimation. In conclusion, it is important to emphasize that point estimation is often considered inadequate for the purposes of scientific inquiry because a ‘good’ point estimator b(X) by  itself, does not provide any measure of the reliability and precision associated with the estimate b(x0); one would be  wrong to assume that b(x0) ' ∗ This is the reason why  b(x0) is often accompanied by its standard error [the es q  timated standard deviation  (b(X))] or the p-value of some test of significance associated with the generic hypothesis =0. Interval estimation rectifies this weakness of point estimation by providing the relevant error probabilities associated with inferences pertaining to ‘covering’ the true value ∗ of  25