Chapter 14 Part Ii

Chapter 14 Part II

ISDS 2001
Matt Levy

Using the Estimated Regression Equation
for Estimation and Prediction
Recall, we use Simple Linear Regression, we are making an
assumption about a relationship between x and y.

Once we have analyzed we have a good fit, through r2 and
other measures, we use the regression equation for estimation
and prediction.

We do this by developing the following:
Point Estimates
Interval Estimates
Confidence Intervals for a mean value of y
Prediction Intervals for individual values of y

Estimation
We are interested in developing two distinct estimates and intervals
for those estimates:

An estimate of the mean value of y for a specific x.
An estimate of an individual value of y.

For a mean value, we develop a confidence interval .
For an individual value, we develop a prediction interval .

This is based on more than simply the number input (x) into the
regression equation, it is based on what the number represents.

For example, an x input and resulting ŷ may be the same values, but
if we are predicting a single y (vs. mean of y) there will be a wider
margin of error for the interval.

Developing the Confidence Interval for
the Mean Value of y
To do this, let's first define some terms:

xp = the value of the independent variable (usually given)

yp = the value of the dependent variable

E(yp) = the expected value of y, given xp.

ŷp = b0 + b1xp = the point estimate of E(yp) when x = xp.

Developing the Confidence Interval for
the Mean Value of y
In general, we cannot expect ŷp to equal E(yp) exactly.

So to make an inference about how close they are, we need the standard
deviation of ŷp (sŷp).

Equations 14.22 and 14.23 derive the variance and standard deviation,
respectively.
Consequently, the confidence interval for E(yp) is as follows:

ŷp ± t α/2 * sŷp, b ased on a t-distribution with n-2 degrees of freedom.
Note that we can make the best estimate of y when xp is equal or very close to the mean of x. Meaning we will have a
tighter confidence interval. Figure 14.8 and the equation below it illustrates this concept.

Developing the Prediction Interval for
the Mean Value of y
Applicable when we are trying to predict individual value.

The prediction interval will be considerably larger the confidence interval.

To develop the prediction interval, we need 2 components which comprise the
variance for the prediction interval (s2ind):

1. The variance of y values about the mean E(yp), given by s2
2. The variance associated with using ŷp to estimate E(yp), given by s2ŷp

So that s2ind = s2 + s2ŷp 2
and sind = √ s ind

And the prediction interval is given by:
ŷp ± t α/2 * sind, b ased on a t-distribution with n-2 degrees of freedom.

Residual Analysis
The residual is the difference between the observed and estimated dependent
variables (yi-ŷi ).

We use residual analysis to analyze the validity of our assumptions. For most
models, we make the following assumptions:
1. E(ε) = 0
2. The variance of ε, denoted by σ2, is the same for all values of x.
3. The values of ε are independent.
4. The error term ε has a normal distribution.

These assumptions are the theoretical basis for the t-test and F-test. Therefore
it is important the residuals are analyzed to further state the case for model
validity.

We analyze residuals using graphical plots of the following:
1. Residuals against the independent variable (x)
2. Residuals against the predicted values (ŷi).
3. Standardized Residual Plot.
4. Normal Probability Plot

Residual Analysis
Residual Plot Against x
x is on the horizontal axis, (yi - ŷi) on the vertical axis.
Should loosely resemble a horizontal band.

Residual Plot against ŷ
ŷ is on the horizontal axis, (yi - ŷi) on the vertical axis.
Should loosely resemble a horizontal band.

Standardized Residual Plot
Divide each residual by its standard deviation.
syi - ŷi = s√1-hi, where hi is the leverage of observation
See Figures 14.30 and 14.31.
When looking at the plot, 95% of values should be between (-2, +2).

Normal Probability Plot
Uses the concept of normal scores, see Figure 14.15

For all probability plots we are visually looking for a pattern scattered about a
horizontal line. Other patterns may violate assumptions.

Residual Analysis: Outliers and
Influential Observations
An outlier is a data point that does not fit the trend when shown
visually.

It may represent erroneous data, or something that may warrant more
careful examination.

Outliers have the potential to heavily influence our predictive ability in
regression.

For simple linear regression, we can simply use a scatter diagram to
detect outliers. For multiple regression, we must use the standardized
residuals.

Refer to Figure 14.20 for what it looks like to have a very influential
observation.

Residual Analysis: Outliers and
Influential Observations
Observations with extreme values for the independent
variable are called high leverage points.

For these troublesome data points we introduce a new
measure called the leverage of observation (hi).

See Figure 14.33 for the formula for the hi.

This usually works if a residual is small.

For large residuals with high leverage another measure known
as Cook's D statistic is used. We discuss this in Chapter 15.

End of Chapter 14

Let me know if you have any questions.

Please read the chapter.

Please do your homework.

Chapter 14 Part Ii

More Related Content

What's hot (19)

Viewers also liked (6)

Similar to Chapter 14 Part Ii (20)

More from Matthew L Levy (9)

Recently uploaded (20)

Chapter 14 Part Ii