Regression (II)

BIOL209: Regression 2
Paul Gardner
May 9, 2017
Paul Gardner BIOL209: Regression 2

You now know: The Famous Five (parametric statistics)
1. x
2. y
3. x2
4. y2
5. xy
The famous ﬁve values (and n) can be used to
compute:
SSx = x2
−
( x)2
n
SSy = y2
−
( y)2
n
SSxy = xy −
x y
n
Which, are used to compute the mean x
n , variance SSx
n−1 ,
covariance SSxy
n−1 , correlation SSxy√
SSx×SSy
and carry out linear
regression b = SSxy
SSx , a = y
n − b x
n !

What else do we need to know about regression?

What is the explanatory power of our model?
We can take the total variation in y, SSy, & partition it into
compoonents that tell us about the explanatory power of our
model
SSy = SSR + SSE
The variation that is explained by the model is called the
regression sum of squares (SSR)

What is the explanatory power of our model?
We could compute SSy and SSE, using:
SSy = y2
− ( y)2
n , SSE = (y − a − bx)2
But, since we know a and b we can use a shortcut:
SSR = b.SSxy =
SSxy2
SSx
The explained variation is the regression sum of squares
(SSR)
The unexplained variation is the error sum of squares (SSE)

Recall: SE¯x = s2
n
Similarly, we want to know SEb and SEa
We need a “variance” measure for regression parameters a & b
The variance of x is given by: s2 = (x−¯x)
n−1 = SSx
df
Similar justiﬁcation can be used to show that the error variance for
regression is:
s2
=
SSE
n − 2

Standard error for the slope (b)
SEb =
s2
SSx
q
q
q
0 20 40 60 80 100
050100200
Less reliable estimate of b
Independent variable
Dependentvariable
SSX
q
q
q
0 20 40 60 80 100
050100200
More reliable estimate of b
Dependentvariable SSX

Standard error for the intercept (a)
SEa =
s2
SSx
×
x2
n
q
0 20 40 60 80 100
050100200
Less reliable estimate of a
Dependentvariable
∑x2
q
0 20 40 60 80 100
050100200
More reliable estimate of a
Dependentvariable
∑x2
NB. Small errors in the estimate of b have a larger impact on a,
the further your x values are from 0.

Coefficient of determination (goodness of fit): r2
Recall: r = SSxy√
SSx×SSy
r2
=
SSxy2
SSx × SSy
=
SSR
SSy
r2 is very important!
the fraction of variation in y that is explained by x
https://en.wikipedia.org/wiki/Coefficient of determination

Summary
Error variance for regression:
s2
=
SSE
n − 2
Standard errors of a and b:
SEb =
s2
SSx
SEa =
s2
SSx
×
x2
n
Fraction of the total variation in y explained by x:
r2
=
SSxy2
SSx × SSy

Example
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
High r squared
Independent variable (x)
Dependentvariable(y)
Fitted line: lm(y~x)
True line
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
Low r squared

Example (high r2
)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
High r squared
True line
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
Low r squared
> cor(x,yh)^2
[1] 0.999939
> summary(lm(yh ~ x))
Call:
lm(formula = yh ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.55650 -0.13005 0.01291 0.09565 0.71169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.908630 0.079617 187.3 <2e-16 ***
x 2.003788 0.002958 677.4 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.2529 on 28 degrees of freedom
Multiple R-squared: 0.9999,Adjusted R-squared: 0.9999
F-statistic: 4.588e+05 on 1 and 28 DF, p-value: < 2.2e-16

Example (low r2
)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
High r squared
True line
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
Low r squared
> cor(x,yl)^2
[1] 0.5832612
> summary(lm(yl ~ x))
Call:
lm(formula = yl ~ x)
Residuals:
-53.653 -11.984 -0.706 8.400 81.706
Coefficients:
(Intercept) 10.7987 9.1658 1.178 0.249
x 2.1320 0.3406 6.260 9.12e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
F-statistic: 39.19 on 1 and 28 DF, p-value: 9.116e-07

Example Questions
Given the below output & SEb = s2
SSx & SEa = s2
SSx × x2
n ,
answer the following:
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
-48.695 -7.629 0.869 10.036 53.315
Coefficients:
(Intercept) 13.1583 7.4235 1.773 0.0872 .
x 2.0293 0.2874 7.060 1.11e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Example Questions
Given the output of lm(y∼x), write down the formula for the
regression line.
Given the followinf statistics, compute the standard errors for
a and b. (n = 30, Error variance: s2 = 94.29961,
SSxy = 11, 976.55, SSx = 5, 901.755, SSy = 37, 957.72,
x2 = 209, 971.6)
What percentage of the variation in y is explained by x?
Is this a signiﬁcant association?

Model checking: very important
model<-lm(y~x)
par(mfrow=c(2,2), cex=2.5)
plot(model)
20 40 60 80 100
−50050
Fitted values
Residuals
q
q
q
qq
qq
qq
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
q
q
q
Residuals vs Fitted
2215
28
q
q
q
qq
q q
qq
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−2012
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
22 15
28
20 40 60 80 100
0.00.51.01.5
Fitted values
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Scale−Location
221528
0.00 0.04 0.08 0.12
−2012
Leverage
q
q
q
q q
qq
qq
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
Cook's distance
0.5
0.5
Residuals vs Leverage
28
13
15

Model checking:
1. Residuals: should not have lots of “structure” (uniform
scatter)
2. Q-Q plot should be a straight line
3.
√
standardized residuals, similar to plot 1. Ideally scatter
should not increase.
4. Highlighting inﬂuential points: leverage and “Cook’s
distance”
Leverage is a measure of how far an x value is from other
observations in a dataset
High leverage and a high residual is bad!
Cook’s distance is a measure of the impact of each observation
upon a regression model
Cook’s distances more than 1 (> 1) are particularly inﬂuential

Transformation: do we need to use linear models?
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
0 10 20 30 40
16182022
x
y
Fitted line: lm(y~log(x))

Transformation: a + bx
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
0 10 20 30 40
16182022
x
y
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
-2.7199 -0.4498 0.2708 0.5185 0.7030
Coefficients:
(Intercept) 18.033888 0.273499 65.94 < 2e-16 ***
x 0.115540 0.009666 11.95 1.63e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Transformation: a + b log(x)
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
0 10 20 30 40
16182022
x
y
> summary(lm(y~log(x)))
Call:
lm(formula = y ~ log(x))
Residuals:
-0.077534 -0.019527 0.005513 0.015164 0.049287
Coefficients:
(Intercept) 14.993292 0.018889 793.8 <2e-16 ***
log(x) 2.005402 0.006165 325.3 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
F-statistic: 1.058e+05 on 1 and 28 DF, p-value: < 2.2e-16

Other transformations: log(y) & a + b
x
log(y)
y = exp(a + bx)
lm(log(y)~x)
q
q
q
q
q
q
q
q qq
q
q
q
qqq
q
q
q
q
q
qqq
q
q
q
q q
q
0 1 2 3 4 5
0100200300400
x
y
Fitted line: lm(log(y)~x)
Reciprocal:
y = a + b
x
xx <- 1/x
lm(y~xx) q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
10 20 30 40 50
35.035.435.8
x
y
Fitted line: lm(y~1/x)

Other transformations
Asymptotic:
y =
ax
1 + bx
nls(y~a*x/(1+b*x),
start=list(a=50,b=5))
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
101214161820
x
y
Fitted line: nls(y~a*x/(1+b*x))
Power law:
y = axb
???????
q
q
q
qq
q
q
q
q
q
q
q
q
q qq
q
q
qq
q
qq
q
q
q
q
q
q
q
−4 −2 0 2 4
−1000100200
x
y

Quiz
1. What models should be ﬁt to the below datasets?
q
q
q
q
q
q
q
qq
q
q
qqq
q
q q
q qq
q
q
q
q
q q
q
qq q
0 1 2 3 4 5
050100150200250300
x
y1
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4 5
681012141618
x
y2
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
qq
q
q
0 1 2 3 4 5
1618202224
x
y3
q
qq qq qq qq qqqqq qq q
q
q
q
q qq q
q q qqq q
0 1 2 3 4 5
50100150200
x
y4

ANOVA Tables
https://onlinecourses.science.psu.edu/stat414/node/218

Course surveys: we value your feedback!
Try to be constructive
What worked well, where could further improvements be
made?

Further reading
Chapter 7 of Crawley (2015) Statistics: An introduction using
R.

The End

Regression (II)

More Related Content

What's hot (18)

Similar to Regression (II) (20)

More from Paul Gardner (20)

Recently uploaded (20)

Regression (II)