BIOL209: Regression 2
Paul Gardner
May 9, 2017
Paul Gardner BIOL209: Regression 2
You now know: The Famous Five (parametric statistics)
1. x
2. y
3. x2
4. y2
5. xy
The famous five values (and n) can be used to
compute:
SSx = x2
−
( x)2
n
SSy = y2
−
( y)2
n
SSxy = xy −
x y
n
Which, are used to compute the mean x
n , variance SSx
n−1 ,
covariance SSxy
n−1 , correlation SSxy√
SSx×SSy
and carry out linear
regression b = SSxy
SSx , a = y
n − b x
n !
Paul Gardner BIOL209: Regression 2
What else do we need to know about regression?
Paul Gardner BIOL209: Regression 2
What is the explanatory power of our model?
We can take the total variation in y, SSy, & partition it into
compoonents that tell us about the explanatory power of our
model
SSy = SSR + SSE
The variation that is explained by the model is called the
regression sum of squares (SSR)
Paul Gardner BIOL209: Regression 2
What is the explanatory power of our model?
We could compute SSy and SSE, using:
SSy = y2
− ( y)2
n , SSE = (y − a − bx)2
But, since we know a and b we can use a shortcut:
SSR = b.SSxy =
SSxy2
SSx
The explained variation is the regression sum of squares
(SSR)
The unexplained variation is the error sum of squares (SSE)
Paul Gardner BIOL209: Regression 2
Recall: SE¯x = s2
n
Similarly, we want to know SEb and SEa
We need a “variance” measure for regression parameters a & b
The variance of x is given by: s2 = (x−¯x)
n−1 = SSx
df
Similar justification can be used to show that the error variance for
regression is:
s2
=
SSE
n − 2
Paul Gardner BIOL209: Regression 2
Standard error for the slope (b)
SEb =
s2
SSx
q
q
q
0 20 40 60 80 100
050100200
Less reliable estimate of b
Independent variable
Dependentvariable
SSX
q
q
q
0 20 40 60 80 100
050100200
More reliable estimate of b
Independent variable
Dependentvariable SSX
Paul Gardner BIOL209: Regression 2
Standard error for the intercept (a)
SEa =
s2
SSx
×
x2
n
q
0 20 40 60 80 100
050100200
Less reliable estimate of a
Independent variable
Dependentvariable
∑x2
q
0 20 40 60 80 100
050100200
More reliable estimate of a
Independent variable
Dependentvariable
∑x2
NB. Small errors in the estimate of b have a larger impact on a,
the further your x values are from 0.
Paul Gardner BIOL209: Regression 2
Coefficient of determination (goodness of fit): r2
Recall: r = SSxy√
SSx×SSy
r2
=
SSxy2
SSx × SSy
=
SSR
SSy
r2 is very important!
the fraction of variation in y that is explained by x
https://en.wikipedia.org/wiki/Coefficient of determination
Paul Gardner BIOL209: Regression 2
Summary
Error variance for regression:
s2
=
SSE
n − 2
Standard errors of a and b:
SEb =
s2
SSx
SEa =
s2
SSx
×
x2
n
Fraction of the total variation in y explained by x:
r2
=
SSxy2
SSx × SSy
Paul Gardner BIOL209: Regression 2
Example
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
High r squared
Independent variable (x)
Dependentvariable(y)
Fitted line: lm(y~x)
True line
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
Low r squared
Independent variable (x)
Dependentvariable(y)
Paul Gardner BIOL209: Regression 2
Example (high r2
)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
High r squared
Independent variable (x)
Dependentvariable(y)
Fitted line: lm(y~x)
True line
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
Low r squared
Independent variable (x)
Dependentvariable(y)
> cor(x,yh)^2
[1] 0.999939
> summary(lm(yh ~ x))
Call:
lm(formula = yh ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.55650 -0.13005 0.01291 0.09565 0.71169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.908630 0.079617 187.3 <2e-16 ***
x 2.003788 0.002958 677.4 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.2529 on 28 degrees of freedom
Multiple R-squared: 0.9999,Adjusted R-squared: 0.9999
F-statistic: 4.588e+05 on 1 and 28 DF, p-value: < 2.2e-16
Paul Gardner BIOL209: Regression 2
Example (low r2
)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
High r squared
Independent variable (x)
Dependentvariable(y)
Fitted line: lm(y~x)
True line
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
0 10 20 30 40 50
02060100
Low r squared
Independent variable (x)
Dependentvariable(y)
> cor(x,yl)^2
[1] 0.5832612
> summary(lm(yl ~ x))
Call:
lm(formula = yl ~ x)
Residuals:
Min 1Q Median 3Q Max
-53.653 -11.984 -0.706 8.400 81.706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.7987 9.1658 1.178 0.249
x 2.1320 0.3406 6.260 9.12e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 29.12 on 28 degrees of freedom
Multiple R-squared: 0.5833,Adjusted R-squared: 0.5684
F-statistic: 39.19 on 1 and 28 DF, p-value: 9.116e-07
Paul Gardner BIOL209: Regression 2
Example Questions
Given the below output & SEb = s2
SSx & SEa = s2
SSx × x2
n ,
answer the following:
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-48.695 -7.629 0.869 10.036 53.315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.1583 7.4235 1.773 0.0872 .
x 2.0293 0.2874 7.060 1.11e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 22.08 on 28 degrees of freedom
Multiple R-squared: 0.6403,Adjusted R-squared: 0.6275
F-statistic: 49.84 on 1 and 28 DF, p-value: 1.114e-07
Paul Gardner BIOL209: Regression 2
Example Questions
Given the output of lm(y∼x), write down the formula for the
regression line.
Given the followinf statistics, compute the standard errors for
a and b. (n = 30, Error variance: s2 = 94.29961,
SSxy = 11, 976.55, SSx = 5, 901.755, SSy = 37, 957.72,
x2 = 209, 971.6)
What percentage of the variation in y is explained by x?
Is this a significant association?
Paul Gardner BIOL209: Regression 2
Model checking: very important
model<-lm(y~x)
par(mfrow=c(2,2), cex=2.5)
plot(model)
20 40 60 80 100
−50050
Fitted values
Residuals
q
q
q
qq
qq
qq
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
q
q
q
Residuals vs Fitted
2215
28
q
q
q
qq
q q
qq
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−2012
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
22 15
28
20 40 60 80 100
0.00.51.01.5
Fitted values
Standardizedresiduals
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Scale−Location
221528
0.00 0.04 0.08 0.12
−2012
Leverage
Standardizedresiduals
q
q
q
q q
qq
qq
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
Cook's distance
0.5
0.5
Residuals vs Leverage
28
13
15
Paul Gardner BIOL209: Regression 2
Model checking:
1. Residuals: should not have lots of “structure” (uniform
scatter)
2. Q-Q plot should be a straight line
3.
√
standardized residuals, similar to plot 1. Ideally scatter
should not increase.
4. Highlighting influential points: leverage and “Cook’s
distance”
Leverage is a measure of how far an x value is from other
observations in a dataset
High leverage and a high residual is bad!
Cook’s distance is a measure of the impact of each observation
upon a regression model
Cook’s distances more than 1 (> 1) are particularly influential
Paul Gardner BIOL209: Regression 2
Transformation: do we need to use linear models?
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
0 10 20 30 40
16182022
x
y
Fitted line: lm(y~x)
Fitted line: lm(y~log(x))
Paul Gardner BIOL209: Regression 2
Transformation: a + bx
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
0 10 20 30 40
16182022
x
y
Fitted line: lm(y~x)
Fitted line: lm(y~log(x))
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.7199 -0.4498 0.2708 0.5185 0.7030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.033888 0.273499 65.94 < 2e-16 ***
x 0.115540 0.009666 11.95 1.63e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.736 on 28 degrees of freedom
Multiple R-squared: 0.8361,Adjusted R-squared: 0.8303
F-statistic: 142.9 on 1 and 28 DF, p-value: 1.633e-12
Paul Gardner BIOL209: Regression 2
Transformation: a + b log(x)
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
0 10 20 30 40
16182022
x
y
Fitted line: lm(y~x)
Fitted line: lm(y~log(x))
> summary(lm(y~log(x)))
Call:
lm(formula = y ~ log(x))
Residuals:
Min 1Q Median 3Q Max
-0.077534 -0.019527 0.005513 0.015164 0.049287
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.993292 0.018889 793.8 <2e-16 ***
log(x) 2.005402 0.006165 325.3 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.02957 on 28 degrees of freedom
Multiple R-squared: 0.9997,Adjusted R-squared: 0.9997
F-statistic: 1.058e+05 on 1 and 28 DF, p-value: < 2.2e-16
Paul Gardner BIOL209: Regression 2
Other transformations: log(y) & a + b
x
log(y)
y = exp(a + bx)
lm(log(y)~x)
q
q
q
q
q
q
q
q qq
q
q
q
qqq
q
q
q
q
q
qqq
q
q
q
q q
q
0 1 2 3 4 5
0100200300400
x
y
Fitted line: lm(y~x)
Fitted line: lm(log(y)~x)
Reciprocal:
y = a + b
x
xx <- 1/x
lm(y~xx) q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
10 20 30 40 50
35.035.435.8
x
y
Fitted line: lm(y~x)
Fitted line: lm(y~1/x)
Paul Gardner BIOL209: Regression 2
Other transformations
Asymptotic:
y =
ax
1 + bx
nls(y~a*x/(1+b*x),
start=list(a=50,b=5))
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
101214161820
x
y
Fitted line: lm(y~x)
Fitted line: nls(y~a*x/(1+b*x))
Power law:
y = axb
???????
q
q
q
qq
q
q
q
q
q
q
q
q
q qq
q
q
qq
q
qq
q
q
q
q
q
q
q
−4 −2 0 2 4
−1000100200
x
y
Paul Gardner BIOL209: Regression 2
Quiz
1. What models should be fit to the below datasets?
q
q
q
q
q
q
q
qq
q
q
qqq
q
q q
q qq
q
q
q
q
q q
q
qq q
0 1 2 3 4 5
050100150200250300
x
y1
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4 5
681012141618
x
y2
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
qq
q
q
0 1 2 3 4 5
1618202224
x
y3
q
qq qq qq qq qqqqq qq q
q
q
q
q qq q
q q qqq q
0 1 2 3 4 5
50100150200
x
y4
Paul Gardner BIOL209: Regression 2
ANOVA Tables
https://onlinecourses.science.psu.edu/stat414/node/218
Paul Gardner BIOL209: Regression 2
Course surveys: we value your feedback!
Try to be constructive
What worked well, where could further improvements be
made?
Paul Gardner BIOL209: Regression 2
Further reading
Chapter 7 of Crawley (2015) Statistics: An introduction using
R.
Paul Gardner BIOL209: Regression 2
The End
Paul Gardner BIOL209: Regression 2

More Related Content

PPT
PPTX
Introduction to neural networks
PDF
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
PDF
2018 MUMS Fall Course - Mathematical surrogate and reduced-order models - Ral...
PPTX
Jacobi method
PPT
PDF
tsoulkas_cumulants
Introduction to neural networks
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
2018 MUMS Fall Course - Mathematical surrogate and reduced-order models - Ral...
Jacobi method
tsoulkas_cumulants

What's hot (18)

PDF
12. Linear models
PDF
Alex1 group2
DOC
It 05104 digsig_1
PDF
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 T- Test
PDF
METHOD OF JACOBI
PPSX
Advanced row pattern matching
PPSX
Comparison GUM versus GUM+1
PDF
Formulas statistics
PDF
Performance of Optimal Registration Estimator
PDF
Linear Programming Problems : Dr. Purnima Pandit
PDF
Rosser's theorem
PDF
Tenser Product of Representation for the Group Cn
PPT
Differential calculus
PPT
Conference ppt
PDF
Fundamentals of Transport Phenomena ChE 715
PPTX
Newton Raphson Of Root Equation
PDF
Home Work; Chapter 9; Inventory Policy Decisions
PDF
Line circle draw
12. Linear models
Alex1 group2
It 05104 digsig_1
Javier Garcia - Verdugo Sanchez - Six Sigma Training - W2 T- Test
METHOD OF JACOBI
Advanced row pattern matching
Comparison GUM versus GUM+1
Formulas statistics
Performance of Optimal Registration Estimator
Linear Programming Problems : Dr. Purnima Pandit
Rosser's theorem
Tenser Product of Representation for the Group Cn
Differential calculus
Conference ppt
Fundamentals of Transport Phenomena ChE 715
Newton Raphson Of Root Equation
Home Work; Chapter 9; Inventory Policy Decisions
Line circle draw
Ad

Similar to Regression (II) (20)

PDF
Regression (I)
PPT
Chapter13
PDF
ML Module 3.pdf
PPT
Linear regression.ppt
PPT
Statistics08_Cut_Regression.jdnkdjvbjddj
PPTX
Regression analysis
PPTX
Statistics-Regression analysis
PDF
need help with stats 301 assignment help
PPTX
01_SLR_final (1).pptx
PPT
15 regression basics
PPT
regression analysis .ppt
PPT
simple linear regression statistics course 2
PPTX
Lecture 2_Chapter 4_Simple linear regression.pptx
PDF
Linear Regression with one Independent Variable.pdf
PPT
wwwwwwwwwwwwwwwwwwwwW7_Simple_linear_regression_PPT.ppt
PDF
probability distribution term 1 IMI New Delhi.pdf
PPT
simple linear regression statistics course
PDF
Simple regression model
PPTX
Presentation on Regression Analysis
PPTX
Regression analysis in R
Regression (I)
Chapter13
ML Module 3.pdf
Linear regression.ppt
Statistics08_Cut_Regression.jdnkdjvbjddj
Regression analysis
Statistics-Regression analysis
need help with stats 301 assignment help
01_SLR_final (1).pptx
15 regression basics
regression analysis .ppt
simple linear regression statistics course 2
Lecture 2_Chapter 4_Simple linear regression.pptx
Linear Regression with one Independent Variable.pdf
wwwwwwwwwwwwwwwwwwwwW7_Simple_linear_regression_PPT.ppt
probability distribution term 1 IMI New Delhi.pdf
simple linear regression statistics course
Simple regression model
Presentation on Regression Analysis
Regression analysis in R
Ad

More from Paul Gardner (20)

PDF
ppgardner-lecture07-genome-function.pdf
PDF
ppgardner-lecture06-homologysearch.pdf
PDF
ppgardner-lecture05-alignment-comparativegenomics.pdf
PDF
ppgardner-lecture04-annotation-comparativegenomics.pdf
PDF
ppgardner-lecture03-genomesize-complexity.pdf
PDF
Does RNA avoidance dictate protein expression level?
PDF
Machine learning methods
PDF
Clustering
PDF
Monte Carlo methods
PDF
The jackknife and bootstrap
PDF
Contingency tables
PDF
Analysis of covariation and correlation
PDF
Analysis of two samples
PDF
Analysis of single samples
PDF
Centrality and spread
PDF
Fundamentals of statistical analysis
PDF
Random RNA interactions control protein expression in prokaryotes
PDF
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
PDF
A meta-analysis of computational biology benchmarks reveals predictors of pro...
PDF
01 nc rna-intro
ppgardner-lecture07-genome-function.pdf
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture05-alignment-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture03-genomesize-complexity.pdf
Does RNA avoidance dictate protein expression level?
Machine learning methods
Clustering
Monte Carlo methods
The jackknife and bootstrap
Contingency tables
Analysis of covariation and correlation
Analysis of two samples
Analysis of single samples
Centrality and spread
Fundamentals of statistical analysis
Random RNA interactions control protein expression in prokaryotes
Avoidance of stochastic RNA interactions can be harnessed to control protein ...
A meta-analysis of computational biology benchmarks reveals predictors of pro...
01 nc rna-intro

Recently uploaded (20)

PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Wound infection.pdfWound infection.pdf123
PPTX
Probability.pptx pearl lecture first year
PPT
Computional quantum chemistry study .ppt
PDF
Social preventive and pharmacy. Pdf
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPTX
limit test definition and all limit tests
PPT
LEC Synthetic Biology and its application.ppt
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPT
Mutation in dna of bacteria and repairss
PPTX
Microbes in human welfare class 12 .pptx
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
endocrine - management of adrenal incidentaloma.pptx
PDF
Packaging materials of fruits and vegetables
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
gene cloning powerpoint for general biology 2
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Wound infection.pdfWound infection.pdf123
Probability.pptx pearl lecture first year
Computional quantum chemistry study .ppt
Social preventive and pharmacy. Pdf
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
limit test definition and all limit tests
LEC Synthetic Biology and its application.ppt
Presentation1 INTRODUCTION TO ENZYMES.pptx
Mutation in dna of bacteria and repairss
Microbes in human welfare class 12 .pptx
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
BODY FLUIDS AND CIRCULATION class 11 .pptx
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
endocrine - management of adrenal incidentaloma.pptx
Packaging materials of fruits and vegetables
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
gene cloning powerpoint for general biology 2

Regression (II)

  • 1. BIOL209: Regression 2 Paul Gardner May 9, 2017 Paul Gardner BIOL209: Regression 2
  • 2. You now know: The Famous Five (parametric statistics) 1. x 2. y 3. x2 4. y2 5. xy The famous five values (and n) can be used to compute: SSx = x2 − ( x)2 n SSy = y2 − ( y)2 n SSxy = xy − x y n Which, are used to compute the mean x n , variance SSx n−1 , covariance SSxy n−1 , correlation SSxy√ SSx×SSy and carry out linear regression b = SSxy SSx , a = y n − b x n ! Paul Gardner BIOL209: Regression 2
  • 3. What else do we need to know about regression? Paul Gardner BIOL209: Regression 2
  • 4. What is the explanatory power of our model? We can take the total variation in y, SSy, & partition it into compoonents that tell us about the explanatory power of our model SSy = SSR + SSE The variation that is explained by the model is called the regression sum of squares (SSR) Paul Gardner BIOL209: Regression 2
  • 5. What is the explanatory power of our model? We could compute SSy and SSE, using: SSy = y2 − ( y)2 n , SSE = (y − a − bx)2 But, since we know a and b we can use a shortcut: SSR = b.SSxy = SSxy2 SSx The explained variation is the regression sum of squares (SSR) The unexplained variation is the error sum of squares (SSE) Paul Gardner BIOL209: Regression 2
  • 6. Recall: SE¯x = s2 n Similarly, we want to know SEb and SEa We need a “variance” measure for regression parameters a & b The variance of x is given by: s2 = (x−¯x) n−1 = SSx df Similar justification can be used to show that the error variance for regression is: s2 = SSE n − 2 Paul Gardner BIOL209: Regression 2
  • 7. Standard error for the slope (b) SEb = s2 SSx q q q 0 20 40 60 80 100 050100200 Less reliable estimate of b Independent variable Dependentvariable SSX q q q 0 20 40 60 80 100 050100200 More reliable estimate of b Independent variable Dependentvariable SSX Paul Gardner BIOL209: Regression 2
  • 8. Standard error for the intercept (a) SEa = s2 SSx × x2 n q 0 20 40 60 80 100 050100200 Less reliable estimate of a Independent variable Dependentvariable ∑x2 q 0 20 40 60 80 100 050100200 More reliable estimate of a Independent variable Dependentvariable ∑x2 NB. Small errors in the estimate of b have a larger impact on a, the further your x values are from 0. Paul Gardner BIOL209: Regression 2
  • 9. Coefficient of determination (goodness of fit): r2 Recall: r = SSxy√ SSx×SSy r2 = SSxy2 SSx × SSy = SSR SSy r2 is very important! the fraction of variation in y that is explained by x https://en.wikipedia.org/wiki/Coefficient of determination Paul Gardner BIOL209: Regression 2
  • 10. Summary Error variance for regression: s2 = SSE n − 2 Standard errors of a and b: SEb = s2 SSx SEa = s2 SSx × x2 n Fraction of the total variation in y explained by x: r2 = SSxy2 SSx × SSy Paul Gardner BIOL209: Regression 2
  • 11. Example q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 50 02060100 High r squared Independent variable (x) Dependentvariable(y) Fitted line: lm(y~x) True line q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 50 02060100 Low r squared Independent variable (x) Dependentvariable(y) Paul Gardner BIOL209: Regression 2
  • 12. Example (high r2 ) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 50 02060100 High r squared Independent variable (x) Dependentvariable(y) Fitted line: lm(y~x) True line q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 50 02060100 Low r squared Independent variable (x) Dependentvariable(y) > cor(x,yh)^2 [1] 0.999939 > summary(lm(yh ~ x)) Call: lm(formula = yh ~ x) Residuals: Min 1Q Median 3Q Max -0.55650 -0.13005 0.01291 0.09565 0.71169 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.908630 0.079617 187.3 <2e-16 *** x 2.003788 0.002958 677.4 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.2529 on 28 degrees of freedom Multiple R-squared: 0.9999,Adjusted R-squared: 0.9999 F-statistic: 4.588e+05 on 1 and 28 DF, p-value: < 2.2e-16 Paul Gardner BIOL209: Regression 2
  • 13. Example (low r2 ) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 50 02060100 High r squared Independent variable (x) Dependentvariable(y) Fitted line: lm(y~x) True line q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 50 02060100 Low r squared Independent variable (x) Dependentvariable(y) > cor(x,yl)^2 [1] 0.5832612 > summary(lm(yl ~ x)) Call: lm(formula = yl ~ x) Residuals: Min 1Q Median 3Q Max -53.653 -11.984 -0.706 8.400 81.706 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.7987 9.1658 1.178 0.249 x 2.1320 0.3406 6.260 9.12e-07 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 29.12 on 28 degrees of freedom Multiple R-squared: 0.5833,Adjusted R-squared: 0.5684 F-statistic: 39.19 on 1 and 28 DF, p-value: 9.116e-07 Paul Gardner BIOL209: Regression 2
  • 14. Example Questions Given the below output & SEb = s2 SSx & SEa = s2 SSx × x2 n , answer the following: > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -48.695 -7.629 0.869 10.036 53.315 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.1583 7.4235 1.773 0.0872 . x 2.0293 0.2874 7.060 1.11e-07 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 22.08 on 28 degrees of freedom Multiple R-squared: 0.6403,Adjusted R-squared: 0.6275 F-statistic: 49.84 on 1 and 28 DF, p-value: 1.114e-07 Paul Gardner BIOL209: Regression 2
  • 15. Example Questions Given the output of lm(y∼x), write down the formula for the regression line. Given the followinf statistics, compute the standard errors for a and b. (n = 30, Error variance: s2 = 94.29961, SSxy = 11, 976.55, SSx = 5, 901.755, SSy = 37, 957.72, x2 = 209, 971.6) What percentage of the variation in y is explained by x? Is this a significant association? Paul Gardner BIOL209: Regression 2
  • 16. Model checking: very important model<-lm(y~x) par(mfrow=c(2,2), cex=2.5) plot(model) 20 40 60 80 100 −50050 Fitted values Residuals q q q qq qq qq q q q q q q qq q q q q q q q q q q q q q Residuals vs Fitted 2215 28 q q q qq q q qq q q q q q q qq q q qq q q q q q q q q q −2 −1 0 1 2 −2012 Theoretical Quantiles Standardizedresiduals Normal Q−Q 22 15 28 20 40 60 80 100 0.00.51.01.5 Fitted values Standardizedresiduals q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q Scale−Location 221528 0.00 0.04 0.08 0.12 −2012 Leverage Standardizedresiduals q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q Cook's distance 0.5 0.5 Residuals vs Leverage 28 13 15 Paul Gardner BIOL209: Regression 2
  • 17. Model checking: 1. Residuals: should not have lots of “structure” (uniform scatter) 2. Q-Q plot should be a straight line 3. √ standardized residuals, similar to plot 1. Ideally scatter should not increase. 4. Highlighting influential points: leverage and “Cook’s distance” Leverage is a measure of how far an x value is from other observations in a dataset High leverage and a high residual is bad! Cook’s distance is a measure of the impact of each observation upon a regression model Cook’s distances more than 1 (> 1) are particularly influential Paul Gardner BIOL209: Regression 2
  • 18. Transformation: do we need to use linear models? q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q 0 10 20 30 40 16182022 x y Fitted line: lm(y~x) Fitted line: lm(y~log(x)) Paul Gardner BIOL209: Regression 2
  • 19. Transformation: a + bx q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q 0 10 20 30 40 16182022 x y Fitted line: lm(y~x) Fitted line: lm(y~log(x)) > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -2.7199 -0.4498 0.2708 0.5185 0.7030 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.033888 0.273499 65.94 < 2e-16 *** x 0.115540 0.009666 11.95 1.63e-12 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.736 on 28 degrees of freedom Multiple R-squared: 0.8361,Adjusted R-squared: 0.8303 F-statistic: 142.9 on 1 and 28 DF, p-value: 1.633e-12 Paul Gardner BIOL209: Regression 2
  • 20. Transformation: a + b log(x) q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q 0 10 20 30 40 16182022 x y Fitted line: lm(y~x) Fitted line: lm(y~log(x)) > summary(lm(y~log(x))) Call: lm(formula = y ~ log(x)) Residuals: Min 1Q Median 3Q Max -0.077534 -0.019527 0.005513 0.015164 0.049287 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.993292 0.018889 793.8 <2e-16 *** log(x) 2.005402 0.006165 325.3 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.02957 on 28 degrees of freedom Multiple R-squared: 0.9997,Adjusted R-squared: 0.9997 F-statistic: 1.058e+05 on 1 and 28 DF, p-value: < 2.2e-16 Paul Gardner BIOL209: Regression 2
  • 21. Other transformations: log(y) & a + b x log(y) y = exp(a + bx) lm(log(y)~x) q q q q q q q q qq q q q qqq q q q q q qqq q q q q q q 0 1 2 3 4 5 0100200300400 x y Fitted line: lm(y~x) Fitted line: lm(log(y)~x) Reciprocal: y = a + b x xx <- 1/x lm(y~xx) q q q q q q q q qq q q q q qq q q q q qq q q q q q q q q 10 20 30 40 50 35.035.435.8 x y Fitted line: lm(y~x) Fitted line: lm(y~1/x) Paul Gardner BIOL209: Regression 2
  • 22. Other transformations Asymptotic: y = ax 1 + bx nls(y~a*x/(1+b*x), start=list(a=50,b=5)) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 101214161820 x y Fitted line: lm(y~x) Fitted line: nls(y~a*x/(1+b*x)) Power law: y = axb ??????? q q q qq q q q q q q q q q qq q q qq q qq q q q q q q q −4 −2 0 2 4 −1000100200 x y Paul Gardner BIOL209: Regression 2
  • 23. Quiz 1. What models should be fit to the below datasets? q q q q q q q qq q q qqq q q q q qq q q q q q q q qq q 0 1 2 3 4 5 050100150200250300 x y1 q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q 0 1 2 3 4 5 681012141618 x y2 q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q 0 1 2 3 4 5 1618202224 x y3 q qq qq qq qq qqqqq qq q q q q q qq q q q qqq q 0 1 2 3 4 5 50100150200 x y4 Paul Gardner BIOL209: Regression 2
  • 25. Course surveys: we value your feedback! Try to be constructive What worked well, where could further improvements be made? Paul Gardner BIOL209: Regression 2
  • 26. Further reading Chapter 7 of Crawley (2015) Statistics: An introduction using R. Paul Gardner BIOL209: Regression 2
  • 27. The End Paul Gardner BIOL209: Regression 2