Linear_Models_with_R_----_(2._Estimation).pdf

Chapter 2
Estimation
2.1 Linear Model
Let’s start by defining what is meant by a linear model. Suppose we want to model
the response Y in terms of three predictors, X1, X2 and X3. One very general form for
the model would be:
Y = f(X1,X2,X3)+ε
where f is some unknown function and ε is the error in this representation. ε is
additive in this instance, but could enter in some even more general form. Still, if we
assume that f is a smooth, continuous function, that still leaves a very wide range
of possibilities. Even with just three predictors, we typically will not have enough
data to try to estimate f directly. So we usually have to assume that it has some more
restricted form, perhaps linear as in:
Y = β0 +β1X1 +β2X2 +β3X3 +ε
where βi, i = 0,1,2,3 are unknown parameters. Unfortunately this term is subject to
some confusion as engineers often use the term parameter for what statisticians call
the variables, Y, X1 and so on. β0 is called the intercept term.
Thus the problem is reduced to the estimation of four parameters rather than
the infinite dimensional f. In a linear model the parameters enter linearly — the
predictors themselves do not have to be linear. For example:
Y = β0 +β1X1 +β2 logX2 +β3X1X2 +ε
is a linear model, but:
Y = β0 +β1X
β2
1 +ε
is not. Some relationships can be transformed to linearity — for example, y = β0x
β
1ε
can be linearized by taking logs. Linear models seem rather restrictive, but because
the predictors can be transformed and combined in any way, they are actually very
flexible. The term linear is often used in everyday speech as almost a synonym for
simplicity. This gives the casual observer the impression that linear models can only
handle small simple datasets. This is far from the truth — linear models can easily
be expanded and modified to handle complex datasets. Linear is also used to refer to
straight lines, but linear models can be curved. Truly nonlinear models are rarely ab-
solutely necessary and most often arise from a theory about the relationships between
the variables, rather than an empirical investigation.
Where do models come from? We distinguish several different sources:
13
Faraway, Julian J.. Linear Models with R, CRC Press LLC, 2014. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/bibliouocsp-ebooks/detail.action?docID=1640577.
Created from bibliouocsp-ebooks on 2023-01-07 10:27:00.
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

14 ESTIMATION
1. Physical theory may suggest a model. For example, Hooke’s law says that the
extension of a spring is proportional to the weight attached. Models like these
usually arise in the physical sciences and engineering.
2. Experience with past data. Similar data used in the past were modeled in a partic-
ular way. It is natural to see whether the same model will work with the current
data. Models like these usually arise in the social sciences.
3. No prior idea exists — the model comes from an exploration of the data. We use
skill and judgment to pick a model. Sometimes it does not work and we have to
try again.
Models that derive directly from physical theory are relatively uncommon so that
usually the linear model can only be regarded as an approximation to a complex real-
ity. We hope it predicts well or explains relationships usefully but usually we do not
believe it is exactly true. A good model is like a map that guides us to our destination.
For the rest of this chapter, we will stay in the special world of Mathematics where
all models are true.
2.2 Matrix Representation
We want a general solution to estimating the parameters of a linear model. We can
find simple formulae for some special cases but to devise a method that will work in
all cases, we need to use matrix algebra. Let’s see how this can be done.
We start with some data where we have a response Y and, say, three predictors,
X1, X2 and X3. The data might be presented in tabular form like this:
y1 x11 x12 x13
y2 x21 x22 x23
... ...
yn xn1 xn2 xn3
where n is the number of observations, or cases, in the dataset.
Given the actual data values, we may write the model as:
yi = β0 +β1xi1 +β2xi2 +β3xi3 +εi i = 1,...,n.
It will be more convenient to put this in a matrix/vector representation. The regression
equation is then written as:
y = Xβ+ε
where y = (y1,...,yn)T , ε = (ε1,...,εn)T , β = (β0,...,β3)T and:
X =




1 x11 x12 x13
1 x21 x22 x23
... ...
1 xn1 xn2 xn3




Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

ESTIMATING β 15
The column of ones incorporates the intercept term. One simple example is the null
model where there is no predictor and just a mean y = µ+ε:


y1
...
yn

 =


1
...
1

µ+


ε1
...
εn


We can assume that Eε = 0 since if this were not so, we could simply absorb the
nonzero expectation for the error into the mean µ to get a zero expectation.
2.3 Estimating β
The regression model, y = Xβ+ε, partitions the response into a systematic compo-
nent Xβ and a random component ε. We would like to choose β so that the system-
atic part explains as much of the response as possible. Geometrically speaking, the
response lies in an n-dimensional space, that is, y ∈ IRn
while β ∈ IRp where p is the
number of parameters. If we include the intercept then p is the number of predictors
plus one. It is easy to get confused as to whether p is the number of predictors or
parameters, as different authors use different conventions, so be careful.
The problem is to find β so that Xβ is as close to Y as possible. The best choice,
the estimate ˆ
β, is apparent in the geometrical representation seen in Figure 2.1. ˆ
β is, in
this sense, the best estimate of β within the model space. The ˆ
β values are sometimes
called the regression coefficients. The response predicted by the model is ŷ = X ˆ
β
or Hy where H is an orthogonal projection matrix. The ŷ are called predicted or
fitted values. The difference between the actual response and the predicted response
is denoted by ε̂ and is called the residual. .
ŷ
y
ε̂
Space spanned by X
Figure 2.1 Geometrical representation of the estimation β. The data vector Y is
projected orthogonally onto the model space spanned by X. The fit is represented by
projection ŷ = X ˆ
β with the difference between the fit and the data represented by the
residual vector ε̂.
The conceptual purpose of the model is to represent, as accurately as possible,
something complex, y, which is n-dimensional, in terms of something much simpler,
the model, which is p-dimensional. Thus if our model is successful, the structure in
the data should be captured in those p dimensions, leaving just random variation in
the residuals which lie in an (n− p)-dimensional space. We have:
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

16 ESTIMATION
Data = Systematic Structure + Random Variation
n dimensions = p dimensions + (n− p) dimensions
2.4 Least Squares Estimation
The estimation of β can also be considered from a nongeometric point of view. We
define the best estimate of β as the one which minimizes the sum of the squared
errors:
∑ε2
i = εT
ε = (y−Xβ)T
(y−Xβ)
Differentiating with respect to β and setting to zero, we find that ˆ
β satisfies:
XT
X ˆ
β = XT
y
These are called the normal equations. We can derive the same result using the geo-
metric approach. Now provided XT X is invertible:
ˆ
β = (XT
X)−1
XT
y
X ˆ
β = X(XT
X)−1
XT
y
ŷ = Hy
H = X(XT X)−1XT is called the hat matrix and is the orthogonal projection of y onto
the space spanned by X. H is useful for theoretical manipulations, but you usually do
not want to compute it explicitly, as it is an n×n matrix which could be uncomfort-
ably large for some datasets. The following useful quantities can now be represented
using H.
The predicted or fitted values are ŷ = Hy = X ˆ
β while the residuals are ε̂ = y −
X ˆ
β = y−ŷ = (I −H)y. The residual sum of squares (RSS) is ε̂T ε̂ = yT (I −H)T (I −
H)y = yT (I −H)y.
Later, we will show that the least squares estimate is the best possible estimate of
β when the errors ε are uncorrelated and have equal variance, i.e., var ε = σ2I. ˆ
β is
unbiased and has variance (XT X)−1σ2 provided var ε = σ2I. Since ˆ
β is a vector, its
variance is a matrix.
We also need to estimate σ2. We find that Eε̂T ε̂ = σ2(n− p), which suggests the
estimator:
σ̂2
=
ε̂T ε̂
n− p
=
RSS
n− p
as an unbiased estimate of σ2. n − p is called the degrees of freedom of the model.
Sometimes you need the standard error for a particular component of ˆ
β which can be
picked out as se(ˆ
βi−1) =
q
(XT X)−1
ii σ̂.
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

EXAMPLES OF CALCULATING ˆ
β 17
2.5 Examples of Calculating ˆ
β
In a few simple models, it is possible to derive explicit formulae for ˆ
β:
1. When y = µ+ε, X = 1 and β = µ hence XT X = 1T 1 = n so:
ˆ
β = (XT
X)−1
XT
y =
1
n
1T
y = ȳ
2. Simple linear regression (one predictor):
yi = β0 +β1xi +εi


y1
...
yn

 =


1 x1
...
1 xn



β0
β1

+


ε1
...
εn


We can now apply the formula but a simpler approach is to rewrite the equation
as:
yi =
β0
0
z }| {
β0 +β1x̄+β1(xi −x̄)+εi
so now:
X =


1 x1 −x̄
...
1 xn −x̄

 XT
X =

n 0
0 ∑n
i=1(xi −x̄)2

Next work through the rest of the calculation to reconstruct the familiar estimate, that
is:
ˆ
β1 =
∑(xi −x̄)yi
∑(xi −x̄)2
In higher dimensions, it is usually not possible to find such explicit formulae for the
parameter estimates unless XT X happens to be a simple form. So typically we need
computers to fit such models. Regression has a long history, so in the time before
computers became readily available, fitting even quite simple models was a tedious
time consuming task. When computing was expensive, data analysis was limited. It
was designed to keep calculations to a minimum and restrict the number of plots.
This mindset remained in statistical practice for some time even after computing
became widely and cheaply available. Now it is a simple matter to fit a multitude of
models and make more plots than one could reasonably study. The challenge now for
the analyst is to choose among these intelligently to extract the crucial information
in the data.
2.6 Example
Now let’s look at an example concerning the number of species found on the various
Galápagos Islands. There are 30 cases (Islands) and seven variables in the dataset.
We start by reading the data into R and examining it:
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

18 ESTIMATION
data(gala, package=faraway)
head(gala[,-2])
Species Area Elevation Nearest Scruz Adjacent
Baltra 58 25.09 346 0.6 0.6 1.84
Bartolome 31 1.24 109 0.6 26.3 572.33
Caldwell 3 0.21 114 2.8 58.7 0.78
Champion 25 0.10 46 1.9 47.4 0.18
Coamano 2 0.05 77 1.9 1.9 903.82
Daphne.Major 18 0.34 119 8.0 8.0 1.84
The variables are Species — the number of species found on the island, Area
— the area of the island (km2), Elevation — the highest elevation of the island (m),
Nearest — the distance from the nearest island (km), Scruz — the distance from
Santa Cruz Island (km), Adjacent — the area of the adjacent island (km2). We have
omitted the second column (which has the number of endemic species) because we
shall not use this alternative response variable in this analysis.
The data were presented by Johnson and Raven (1973) and also appear in Weis-
berg (1985). I have filled in some missing values for simplicity (see Chapter 13 for
how this can be done). Fitting a linear model in R is done using the lm() com-
mand. Notice the syntax for specifying the predictors in the model. This is part of the
Wilkinson–Rogers notation. In this case, since all the variables are in the gala data
frame, we must use the data= argument:
lmod - lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data=gala)
summary(lmod)
Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz +
Adjacent , data = gala)
Residuals:
Min 1Q Median 3Q Max
-111.68 -34.90 -7.86 33.46 182.58
Coefficients:
Estimate Std. Error t value Pr(|t|)
(Intercept) 7.06822 19.15420 0.37 0.7154
Area -0.02394 0.02242 -1.07 0.2963
Elevation 0.31946 0.05366 5.95 0.0000038 ***
Nearest 0.00914 1.05414 0.01 0.9932
Scruz -0.24052 0.21540 -1.12 0.2752
Adjacent -0.07480 0.01770 -4.23 0.0003 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 61 on 24 degrees of freedom
Multiple R-squared: 0.766, Adjusted R-squared: 0.717
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.84e-07
For my tastes, this output contains rather too much information. I have written an
alternative called sumary which produces a shorter version of this. Since we will be
looking at lot of regression output, the use of this version makes this book several
pages shorter. Of course, if you prefer the above, feel free to add the extra “m” in the
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

EXAMPLE 19
function call. You will need to install and load my package if you want to use my
version:
require(faraway)
sumary(lmod)
(Intercept) 7.06822 19.15420 0.37 0.7154
Area -0.02394 0.02242 -1.07 0.2963
Elevation 0.31946 0.05366 5.95 0.0000038
Nearest 0.00914 1.05414 0.01 0.9932
Scruz -0.24052 0.21540 -1.12 0.2752
Adjacent -0.07480 0.01770 -4.23 0.0003
n = 30, p = 6, Residual SE = 60.975, R-Squared = 0.77
We can identify several useful quantities in this output. Other statistical packages
tend to produce output quite similar to this. One useful feature of R is that it is
possible to directly calculate quantities of interest. Of course, it is not necessary here
because the lm() function does the job, but it is very useful when the statistic you
want is not part of the prepackaged functions. First, we extract the X-matrix:
x - model.matrix( ~ Area + Elevation + Nearest + Scruz + Adjacent,
gala)
and here is the response y:
y - gala$Species
Now let’s construct (XT X)−1. t() does transpose and %*% does matrix multipli-
cation. solve(A) computes A−1 while solve(A,b) solves Ax = b:
xtxi - solve(t(x) %*% x)
We can get ˆ
β directly, using (XT X)−1XT y:
xtxi %*% t(x) %*% y
[,1]
1 7.068221
Area -0.023938
Elevation 0.319465
Nearest 0.009144
Scruz -0.240524
Adjacent -0.074805
This is a very bad way to compute ˆ
β. It is inefficient and can be very inaccurate
when the predictors are strongly correlated. Such problems are exacerbated by large
datasets. A better, but not perfect, way is:
solve(crossprod(x,x),crossprod(x,y))
[,1]
1 7.068221
Area -0.023938
Elevation 0.319465
Nearest 0.009144
Scruz -0.240524
Adjacent -0.074805
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

20 ESTIMATION
where crossprod(x,y) computes xT y. Here we get the same result as lm() be-
cause the data are well-behaved. In the long run, you are advised to use carefully
programmed code such as found in lm() which uses the QR decomposition. To see
more details, consult a text such as Thisted (1988) or read Section 2.7.
We can extract the regression quantities we need from the model object. Com-
monly used are residuals(), fitted(), df.residual() which gives the degrees
of freedom, deviance() which gives the RSS and coef() which gives the ˆ
β. You
can also extract other needed quantities by examining the model object and its sum-
mary:
names(lmod)
[1] coefficients residuals effects rank
[5] fitted.values assign qr df.residual
[9] xlevels call terms model
lmodsum - summary(lmod)
names(lmodsum)
[1] call terms residuals coefficients
[5] aliased sigma df r.squared
[9] adj.r.squared fstatistic cov.unscaled
We can estimate σ using the formula in the text above or extract it from the summary
object:
sqrt(deviance(lmod)/df.residual(lmod))
[1] 60.975
lmodsum$sigma
[1] 60.975
We can also extract (XT X)−1 and use it to compute the standard errors for the coef-
ficients. (diag() returns the diagonal of a matrix):
xtxi - lmodsum$cov.unscaled
sqrt(diag(xtxi))*60.975
(Intercept) Area Elevation Nearest Scruz
19.154139 0.022422 0.053663 1.054133 0.215402
Adjacent
0.017700
or get them from the summary object:
lmodsum$coef[,2]
(Intercept) Area Elevation Nearest Scruz
19.154198 0.022422 0.053663 1.054136 0.215402
Adjacent
0.017700
2.7 QR Decomposition
This section might be skipped unless you are interested in the actual calculation of ˆ
β
and related quantities. Any design matrix X can be written as:
X = Q

R
0

= Qf R
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

QR DECOMPOSITION 21
where Q is an n×n orthogonal matrix, that is QT Q = QQT = I and R is a p× p upper
triangular matrix (Rij = 0 for i j). The 0 is an (n − p)× p matrix of zeroes while
Qf is the first p columns of Q.
The RSS = (y − Xβ)T (y − Xβ) = ky − Xβk2 where k · k is the Euclidean length
of a vector. The matrix Q represents a rotation and does not change length. Hence:
RSS = kQT
y−QT
Xβk2
=

f
r

−

R
0

β
2
where

f
r

= QT y for vector f of length p and vector r of length n− p. From this
we see:
RSS = k f −Rβk2
+krk2
which can be minimized by setting β so that Rβ = f.
Let’s see how this works for the Galápagos data. First we compute the QR de-
composition:
qrx - qr(x)
The components of the decomposition must be extracted by other functions. For
example, we can extract the Q matrix using qr.Q:
dim(qr.Q(qrx))
[1] 30 6
Notice that we do not need the whole n × n matrix for the computation. The first p
columns suffice so qr.q returns what we call Qf . We can compute f:
(f - t(qr.Q(qrx)) %*% y)
[,1]
[1,] -466.8422
[2,] 381.4056
[3,] 256.2505
[4,] 5.4076
[5,] -119.4983
[6,] 257.6944
Solving Rβ = f is easy because of the triangular form of R. We use the method of
backsubstitution:
backsolve(qr.R(qrx),f)
[,1]
[1,] 7.068221
[2,] -0.023938
[3,] 0.319465
[4,] 0.009144
[5,] -0.240524
[6,] -0.074805
where the results match those seen previously.
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

22 ESTIMATION
2.8 Gauss–Markov Theorem
ˆ
β is a plausible estimator, but there are alternatives. Nonetheless, there are three good
reasons to use least squares:
1. It results from an orthogonal projection onto the model space. It makes sense
geometrically.
2. If the errors are independent and identically normally distributed, it is the maxi-
mum likelihood estimator. Loosely put, the maximum likelihood estimate is the
value of β that maximizes the probability of the data that was observed.
3. The Gauss–Markov theorem states that ˆ
β is the best linear unbiased estimate
(BLUE).
To understand the Gauss–Markov theorem we first need to understand the concept of
an estimable function. A linear combination of the parameters ψ = cT β is estimable
if and only if there exists a linear combination aT y such that:
EaT
y = cT
β ∀β
Estimable functions include predictions of future observations, which explains why
they are well worth considering. If X is of full rank, then all linear combinations are
estimable.
Suppose Eε = 0 and var ε = σ2I. Suppose also that the structural part of the
model, EY = Xβ is correct. (Clearly these are big assumptions and so we will address
the implications of this later.) Let ψ = cT β be an estimable function; then the Gauss–
Markov theorem states that in the class of all unbiased linear estimates of ψ, ψ̂ = cT ˆ
β
has the minimum variance and is unique.
We prove this theorem. Suppose aT y is some unbiased estimate of cT β so that:
EaT
y = cT
β ∀β
aT
Xβ = cT
β ∀β
which means that aT X = cT . This implies that c must be in the range space of XT
which in turn implies that c is also in the range space of XT X which means there
exists a λ such that c = XT Xλ so:
cT ˆ
β = λT
XT
X ˆ
β = λT
XT
y
Now we can show that the least squares estimator has the minimum variance — pick
an arbitrary estimate aT y and compute its variance:
var (aT
y) = var (aT
y−cT ˆ
β+cT ˆ
β)
= var (aT
y−λT
XT
y+cT ˆ
β)
= var (aT
y−λT
XT
y)+var (cT ˆ
β)+2cov(aT
y−λT
XT
y,λT
XT
y)
but
cov(aT
y−λT
XT
y,λT
XT
y) = (aT
−λT
XT
)σ2
IXλ
= (aT
X −λT
XT
X)σ2
Iλ
= (cT
−cT
)σ2
Iλ = 0
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

GOODNESS OF FIT 23
so
var (aT
y) = var (aT
y−λT
XT
y)+var (cT ˆ
β)
Now since variances cannot be negative, we see that:
var (aT
y) ≥ var (cT ˆ
β)
In other words, cT ˆ
β has minimum variance. It now remains to show that it is unique.
There will be equality in the above relationship if var (aT y − λT XT y) = 0 which
would require that aT −λT XT = 0 which means that aT y = λT XT y = cT ˆ
β. So equality
occurs only if aT y = cT ˆ
β so the estimator is unique. This completes the proof.
The Gauss–Markov theorem shows that the least squares estimate ˆ
β is a good
choice, but it does require that the errors are uncorrelated and have equal variance.
Even if the errors behave, but are nonnormal, then nonlinear or biased estimates may
work better. So this theorem does not tell one to use least squares all the time; it just
strongly suggests it unless there is some strong reason to do otherwise. Situations
where estimators other than ordinary least squares should be considered are:
1. When the errors are correlated or have unequal variance, generalized least squares
should be used. See Section 8.1.
2. When the error distribution is long-tailed, then robust estimates might be used.
Robust estimates are typically not linear in y. See Section 8.4.
3. When the predictors are highly correlated (collinear), then biased estimators such
as ridge regression might be preferable. See Chapter 11.
2.9 Goodness of Fit
It is useful to have some measure of how well the model fits the data. One common
choice is R2, the so-called coefficient of determination or percentage of variance
explained:
R2
= 1−
∑(ŷi −yi)2
∑(yi −ȳ)2
= 1−
RSS
Total SS(Corrected for Mean)
Its range is 0 ≤ R2 ≤ 1 — values closer to 1 indicating better fits. For simple linear re-
gression R2 = r2 where r is the correlation between x and y. An equivalent definition
is:
R2
=
∑(ŷi −ȳ)2
∑(yi −ȳ)2
or
R2
= cor2
(ŷ,y)
The graphical intuition behind R2 is seen in Figure 2.2. Suppose you want to
predict y. If you do not know x, then your best prediction is ȳ, but the variabil-
ity in this prediction is high. If you do know x, then your prediction will be given
by the regression fit. This prediction will be less variable provided there is some
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

24 ESTIMATION
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
Figure 2.2 When x is not known, the best predictor of y is ȳ and the variation is
denoted by the dotted line. When x is known, we can predict y more accurately by
the solid line. R2 is related to the ratio of these two variances.
relationship between x and y. R2 is one minus the ratio of the sum of squares for these
two predictions. Thus for perfect predictions the ratio will be zero and R2 will be
one.
Some care is necessary if there is no intercept in your model. The denominator
in the first definition of R2 has a null model with an intercept in mind when the
sum of squares is calculated. Unfortunately, R uses this definition and will give a
misleadingly high R2. If you must have an R2 use the cor2(ŷ,y) definition when there
is no intercept.
What is a good value of R2? It depends on the area of application. In the biological
and social sciences, variables tend to be more weakly correlated and there is a lot of
noise. We would expect lower values for R2 in these areas — a value of, say, 0.6
might be considered good. In physics and engineering, where most data come from
closely controlled experiments, we typically expect to get much higher R2s and a
value of 0.6 would be considered low. Some experience with the particular area is
necessary for you to judge your R2s well.
It is a mistake to rely on R2 as a sole measure of fit. In Figure 2.3, we see some
simulated datasets where the R2 is around 0.65 for a linear fit in all four cases. The
first plot on the upper left shows an ordinary relationship while on the right the varia-
tion in x is smaller as is the residual variation. However, predictions (within the range
of x) would have less variation in the second case. On the lower left, the fit looks
good except for two outliers demonstrating how sensitive R2 is to a few extreme
values. On the lower right, we see that the true relationship is somewhat quadratic
which shows us that R2 doesn’t tell us much about whether we have the right
model.
An alternative measure of fit is σ̂. This quantity is directly related to the standard
errors of estimates of β and predictions. The advantage is that σ̂ is measured in the
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

GOODNESS OF FIT 25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
−0.2
0.2
0.6
1.0
x1
y1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
−0.2
0.2
0.6
1.0
x2
y2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
x3
y3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.1
0.2
0.3
0.4
0.5
x4
y4
Figure 2.3 Four simulated datasets where R2 is about 0.65. The plot on the upper
left is well-behaved for R2. In the plot on the upper right, the residual variation is
smaller than the first plot but the variation in x is also smaller so R2 is about the
same. In the plot on the lower left, the fit looks strong except for a couple of outliers
while on the lower right, the relationship is quadratic.
units of the response and so may be directly interpreted in the context of the particular
dataset. This may also be a disadvantage in that one must understand the practical
significance of this measure whereas R2, being unitless, is easy to understand. The
R regression summary returns both values and it is worth paying attention to both of
them.
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

26 ESTIMATION
2.10 Identifiability
The least squares estimate is the solution to the normal equations:
XT
X ˆ
β = XT
y
where X is an n × p matrix. If XT X is singular and cannot be inverted, then there
will be infinitely many solutions to the normal equations and ˆ
β is at least partially
unidentifiable. Unidentifiability will occur when X is not of full rank — that is when
its columns are linearly dependent. With observational data, unidentifiability is usu-
ally caused by some oversight. Here are some examples:
1. A person’s weight is measured both in pounds and kilos and both variables are
entered into the model. One variable is just a multiple of the other.
2. For each individual we record the number of years of preuniversity education,
the number of years of university education and also the total number of years
of education and put all three variables into the model. There is an exact linear
relation among the variables.
3. We have more variables than cases, that is, p n. When p = n, we may perhaps
estimate all the parameters, but with no degrees of freedom left to estimate any
standard errors or do any testing. Such a model is called saturated. When p n,
then the model is sometimes called supersaturated. Such models are considered
in large-scale screening experiments used in product design and manufacture and
in bioinformatics where there are more genes than individuals tested, but there
is no hope of uniquely estimating all the parameters in such a model. Different
approaches are necessary.
Such problems can be avoided by paying attention. Identifiability is more of an issue
in designed experiments. Consider a simple two-sample experiment, where the treat-
ment observations are y1,...,yn and the controls are yn+1,...,ym+n. Suppose we try
to model the response by an overall mean µ and group effects α1 and α2:
yj = µ+αi +εj i = 1,2 j = 1,...,m+n








y1
...
yn
yn+1
...
ym+n








=








1 1 0
...
1 1 0
1 0 1
. . .
1 0 1










µ
α1
α2

+






ε1
...
...
...
εm+n






Now although X has three columns, it has only rank two — (µ,α1,α2) are not
identifiable and the normal equations have infinitely many solutions. We can solve
this problem by imposing some constraints, µ = 0 or α1 +α2 = 0, for example.
Statistics packages handle nonidentifiability differently. In the regression case
above, some may return error messages and some may fit models because rounding
error may remove the exact identifiability. In other cases, constraints may be applied
but these may be different from what you expect. By default, R fits the largest iden-
tifiable model by removing variables in the reverse order of appearance in the model
formula.
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

IDENTIFIABILITY 27
Here is an example. Suppose we create a new variable for the Galápagos dataset
— the difference in area between the island and its nearest neighbor:
gala$Adiff - gala$Area -gala$Adjacent
and add that to the model:
lmod - lm(Species ~ Area+Elevation+Nearest+Scruz+Adjacent +Adiff,
gala)
sumary(lmod)
Coefficients: (1 not defined because of singularities)
(Intercept) 7.06822 19.15420 0.37 0.7154
Area -0.02394 0.02242 -1.07 0.2963
Elevation 0.31946 0.05366 5.95 0.0000038
Nearest 0.00914 1.05414 0.01 0.9932
Scruz -0.24052 0.21540 -1.12 0.2752
Adjacent -0.07480 0.01770 -4.23 0.0003
We get a message about a singularity because the rank of the design matrix X
is six, which is less than its seven columns. In most cases, the cause of identifia-
bility can be revealed with some thought about the variables, but, failing that, an
eigendecomposition of XT X will reveal the linear combination(s) that gave rise to
the unidentifiability — see Section 11.1.
Lack of identifiability is obviously a problem, but it is usually easy to identify and
work around. More problematic are cases where we are close to unidentifiability. To
demonstrate this, suppose we add a small random perturbation to the third decimal
place of Adiff by adding a random variate from U[−0.005,0.005] where U denotes
the uniform distribution. Random numbers are by nature random so results are not
exactly reproducible. However, you can make the numbers come out the same every
time by setting the seed on the random number generator using set.seed(). I have
done this here so you will not wonder why your answers are not exactly the same as
mine, but it is not strictly necessary.
set.seed(123)
Adiffe - gala$Adiff+0.001*(runif(30)-0.5)
and now refit the model:
lmod - lm(Species ~ Area+Elevation+Nearest+Scruz +Adjacent+Adiffe,
gala)
sumary(lmod)
(Intercept) 3.2964 19.4341 0.17 0.87
Area -45122.9865 42583.3393 -1.06 0.30
Elevation 0.3130 0.0539 5.81 0.0000064
Nearest 0.3827 1.1090 0.35 0.73
Scruz -0.2620 0.2158 -1.21 0.24
Adjacent 45122.8891 42583.3406 1.06 0.30
Adiffe 45122.9613 42583.3381 1.06 0.30
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

28 ESTIMATION
Notice that now all parameters are estimated, but the standard errors are very large
because we cannot estimate them in a stable way. We set up this problem so we know
the cause but in general we need to be able to identify such situations. We do this in
Section 7.3.
2.11 Orthogonality
Orthogonality is a useful property because it allows us to more easily interpret the
effect of one predictor without regard to another. Suppose we can partition X in two,
X = [X1|X2] such that XT
1 X2 = 0. So now:
Y = Xβ+ε = X1β1 +X2β2 +ε
and
XT
X =

XT
1 X1 XT
1 X2
XT
2 X1 XT
2 X2

=

XT
1 X1 0
0 XT
2 X2

which means:
ˆ
β1 = (XT
1 X1)−1
XT
1 y ˆ
β2 = (XT
2 X2)−1
XT
2 y
Notice that ˆ
β1 will be the same regardless of whether X2 is in the model or not (and
vice versa). So we can interpret the effect of X1 without a concern for X2. Unfor-
tunately, the decoupling is not perfect. Suppose we wish to test H0 : β1 = 0. We
have RSS/d f = σ̂2 that will be different depending on whether X2 is included in the
model or not, but the difference in F is not liable to be as large as in nonorthogonal
cases.
If the covariance between vectors x1 and x2 is zero, then ∑j(xj1 −x̄1)(xj2 −x̄2) =
0. This means that if we center the predictors, a covariance of zero implies orthog-
onality. As can be seen in the second example in Section 2.5, we can center the
predictors without essentially changing the model provided we have an intercept
term.
Orthogonality is a desirable property, but will only occur when X is chosen by the
experimenter. It is a feature of a good design. In observational data, we do not have
direct control over X and this is the source of many of the interpretational difficulties
associated with nonexperimental data.
Here is an example of an experiment to determine the effects of column tem-
perature, gas/liquid ratio and packing height in reducing the unpleasant odor of a
chemical product that was sold for household use. Read the data in and display:
data(odor, package=faraway)
odor
odor temp gas pack
1 66 -1 -1 0
2 39 1 -1 0
3 43 -1 1 0
4 49 1 1 0
5 58 -1 0 -1
6 17 1 0 -1
7 -5 -1 0 1
8 -40 1 0 1
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

ORTHOGONALITY 29
9 65 0 -1 -1
10 7 0 1 -1
11 43 0 -1 1
12 -22 0 1 1
13 -31 0 0 0
14 -35 0 0 0
15 -26 0 0 0
The three predictors have been transformed from their original scale of measurement,
for example, temp = (Fahrenheit-80)/40 so the original values of the predictor were
40, 80 and 120. The data is presented in John (1971) and give an example of a central
composite design. We compute the covariance of the predictors:
cov(odor[,-1])
temp gas pack
temp 0.57143 0.00000 0.00000
gas 0.00000 0.57143 0.00000
pack 0.00000 0.00000 0.57143
The matrix is diagonal. Even if temp was measured in the original Fahrenheit scale,
the matrix would still be diagonal, but the entry in the matrix corresponding to temp
would change. Now fit a model, while asking for the correlation of the coefficients:
lmod - lm(odor ~ temp + gas + pack, odor)
summary(lmod,cor=T)
Coefficients:
(Intercept) 15.2 9.3 1.63 0.13
temp -12.1 12.7 -0.95 0.36
gas -17.0 12.7 -1.34 0.21
pack -21.4 12.7 -1.68 0.12
Residual standard error: 36 on 11 degrees of freedom
Multiple R-Squared: 0.334, Adjusted R-squared: 0.152
F-statistic: 1.84 on 3 and 11 DF, p-value: 0.199
Correlation of Coefficients:
(Intercept) temp gas
temp 0.00
gas 0.00 0.00
pack 0.00 0.00 0.00
We see that, as expected, the pairwise correlation of all the coefficients is zero. Notice
that the SEs for the coefficients are equal due to the balanced design. Now drop one
of the variables:
lmod - lm(odor ~ gas + pack, odor)
summary(lmod)
Coefficients:
(Intercept) 15.20 9.26 1.64 0.13
gas -17.00 12.68 -1.34 0.20
pack -21.37 12.68 -1.69 0.12
Residual standard error: 35.9 on 12 degrees of freedom
Multiple R-Squared: 0.279, Adjusted R-squared: 0.159
F-statistic: 2.32 on 2 and 12 DF, p-value: 0.141
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

30 ESTIMATION
The coefficients themselves do not change, but the residual SE does change slightly,
which causes small changes in the SEs of the coefficients, t-statistics and p-values,
but nowhere near enough to change our qualitative conclusions.
Exercises
1. The dataset teengamb concerns a study of teenage gambling in Britain. Fit a re-
gression model with the expenditure on gambling as the response and the sex,
status, income and verbal score as predictors. Present the output.
(a) What percentage of variation in the response is explained by these predictors?
(b) Which observation has the largest (positive) residual? Give the case number.
(c) Compute the mean and median of the residuals.
(d) Compute the correlation of the residuals with the fitted values.
(e) Compute the correlation of the residuals with the income.
(f) For all other predictors held constant, what would be the difference in predicted
expenditure on gambling for a male compared to a female?
2. The dataset uswages is drawn as a sample from the Current Population Survey in
1988. Fit a model with weekly wages as the response and years of education and
experience as predictors. Report and give a simple interpretation to the regression
coefficient for years of education. Now fit the same model but with logged weekly
wages. Give an interpretation to the regression coefficient for years of education.
Which interpretation is more natural?
3. In this question, we investigate the relative merits of methods for computing the
coefficients. Generate some artificial data by:
x - 1:20
y - x+rnorm(20)
Fit a polynomial in x for predicting y. Compute ˆ
β in two ways — by lm() and by
using the direct calculation described in the chapter. At what degree of polynomial
does the direct calculation method fail? (Note the need for the I() function in
fitting the polynomial, that is, lm(y ~ x + I(x^2)).
4. The dataset prostate comes from a study on 97 men with prostate cancer who
were due to receive a radical prostatectomy. Fit a model with lpsa as the response
and lcavol as the predictor. Record the residual standard error and the R2. Now
add lweight, svi, lbph, age, lcp, pgg45 and gleason to the model one at a
time. For each model record the residual standard error and the R2. Plot the trends
in these two statistics.
5. Using the prostate data, plot lpsa against lcavol. Fit the regressions of lpsa
on lcavol and lcavol on lpsa. Display both regression lines on the plot. At
what point do the two lines intersect?
6. Thirty samples of cheddar cheese were analyzed for their content of acetic acid,
hydrogen sulfide and lactic acid. Each sample was tasted and scored by a panel of
judges and the average taste score produced. Use the cheddar data to answer the
following:
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

ORTHOGONALITY 31
(a) Fit a regression model with taste as the response and the three chemical con-
tents as predictors. Report the values of the regression coefficients.
(b) Compute the correlation between the fitted values and the response. Square it.
Identify where this value appears in the regression output.
(c) Fit the same regression model but without an intercept term. What is the value
of R2 reported in the output? Compute a more reasonable measure of the good-
ness of fit for this example.
(d) Compute the regression coefficients from the original fit using the QR decom-
position showing your R code.
7. An experiment was conducted to determine the effect of four factors on the re-
sistivity of a semiconductor wafer. The data is found in wafer where each of the
four factors is coded as − or + depending on whether the low or the high setting
for that factor was used. Fit the linear model resist ∼ x1 + x2 + x3 + x4.
(a) Extract the X matrix using the model.matrix function. Examine this to deter-
mine how the low and high levels have been coded in the model.
(b) Compute the correlation in the X matrix. Why are there some missing values
in the matrix?
(c) What difference in resistance is expected when moving from the low to the
high level of x1?
(d) Refit the model without x4 and examine the regression coefficients and stan-
dard errors? What stayed the the same as the original fit and what changed?
(e) Explain how the change in the regression coefficients is related to the correla-
tion matrix of X.
8. An experiment was conducted to examine factors that might affect the height of
leaf springs in the suspension of trucks. The data may be found in truck. The five
factors in the experiment are set to − and + but it will be more convenient for us
to use −1 and +1. This can be achieved for the first factor by:
truck$B - sapply(truck$B, function(x) ifelse(x==- ,-1,1))
Repeat for the other four factors.
(a) Fit a linear model for the height in terms of the five factors. Report on the value
of the regression coefficients.
(b) Fit a linear model using just factors B, C, D and E and report the coefficients.
How do these compare to the previous question? Show how we could have
anticipated this result by examining the X matrix.
(c) Construct a new predictor called A which is set to B+C+D+E. Fit a linear model
with the predictors A, B, C, D, E and O. Do coefficients for all six predictors
appear in the regression summary? Explain.
(d) Extract the model matrix X from the previous model. Attempt to compute ˆ
β
from (XT X)−1XT y. What went wrong and why?
(e) Use the QR decomposition method as seen in Section 2.7 to compute ˆ
β. Are
the results satisfactory?
(f) Use the function qr.coef to correctly compute ˆ
β.
Copyright
©
2014.
CRC
Press
LLC.
All
rights
reserved.

Linear_Models_with_R_----_(2._Estimation).pdf

More Related Content

Similar to Linear_Models_with_R_----_(2._Estimation).pdf (20)

Recently uploaded (20)

Linear_Models_with_R_----_(2._Estimation).pdf