1. Linear Regression and Correlation
• Explanatory and Response Variables are Numeric
• Relationship between the mean of the response
variable and the level of the explanatory variable
assumed to be approximately linear (straight line)
• Model:
)
,
0
(
~
1
0
N
x
Y
• 1 > 0 Positive Association
• 1 < 0 Negative Association
• 1 = 0 No Association
2. Least Squares Estimation of 0, 1
0 Mean response when x=0 (y-intercept)
1 Change in mean response when x increases by 1
unit (slope)
• 0,1 are unknown parameters (like )
• 0+1x Mean response when explanatory variable
takes on the value x
• Goal: Choose values (estimates) that minimize the sum
of squared errors (SSE) of observed values to the
straight-line:
2
1 1
^
0
^
1
2
^
1
^
0
^
^
n
i i
i
n
i i
i x
y
y
y
SSE
x
y
3. Example - Pharmacodynamics of LSD
Score (y) LSD Conc (x)
78.93 1.17
58.20 2.97
67.47 3.26
37.47 4.69
45.65 5.83
32.92 6.00
29.97 6.41
• Response (y) - Math score (mean among 5 volunteers)
• Predictor (x) - LSD tissue concentration (mean of 5 volunteers)
• Raw Data and scatterplot of Score vs LSD concentration:
LSD_CONC
7
6
5
4
3
2
1
SCORE
80
70
60
50
40
30
20
Source: Wagner, et al (1968)
4. Least Squares Computations
2
2
2
^
2
1
^
0
^
2
1
^
2
2
n
SSE
n
y
y
s
x
y
S
S
x
x
y
y
x
x
y
y
S
y
y
x
x
S
x
x
S
xx
xy
yy
xy
xx
6. SPSS Output and Plot of Equation
Coefficientsa
89.124 7.048 12.646 .000
-9.009 1.503 -.937 -5.994 .002
(Constant)
LSD_CONC
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: SCORE
a.
Linear Regression
1.00 2.00 3.00 4.00 5.00 6.00
lsd_conc
30.00
40.00
50.00
60.00
70.00
80.00
score
score = 89.12 + -9.01 * lsd_conc
R-Square = 0.88
Math Score vs LSD Concentration (SPSS)
7. Inference Concerning the Slope (1)
• Parameter: Slope in the population model (1)
• Estimator: Least squares estimate:
• Estimated standard error:
• Methods of making inference regarding population:
– Hypothesis tests (2-sided or 1-sided)
– Confidence Intervals
1
^
xx
S
s /
^
1
^
8. Hypothesis Test for 1
• 2-Sided Test
– H0: 1 = 0
– HA: 1 0
• 1-sided Test
– H0: 1 = 0
– HA
+
: 1 > 0 or
– HA
-
: 1 < 0
|)
|
(
2
:
|
|
:
.
.
:
.
.
2
,
2
/
^
1
^
1
^
obs
n
obs
obs
t
t
P
val
P
t
t
R
R
t
S
T
)
(
:
)
(
:
:
.
.
:
.
.
:
.
.
2
,
2
,
^
1
^
1
^
obs
obs
n
obs
n
obs
obs
t
t
P
val
P
t
t
P
val
P
t
t
R
R
t
t
R
R
t
S
T
9. (1-)100% Confidence Interval for 1
xx
S
s
t
t 2
/
1
^
^
2
/
1
^
1
^
• Conclude positive association if entire interval above 0
• Conclude negative association if entire interval below 0
• Cannot conclude an association if interval contains 0
• Conclusion based on interval is same as 2-sided hypothesis test
10. Example - Pharmacodynamics of LSD
50
.
1
475
.
22
12
.
7
475
.
22
12
.
7
72
.
50
01
.
9
7
1
^
^
1
^
xx
S
s
n
• Testing H0: 1 = 0 vs HA: 1 0
571
.
2
|
:|
.
.
01
.
6
50
.
1
01
.
9
:
.
. 5
,
025
.
t
t
R
R
t
S
T obs
obs
• 95% Confidence Interval for 1 :
)
15
.
5
,
87
.
12
(
86
.
3
01
.
9
)
50
.
1
(
571
.
2
01
.
9
11. Correlation Coefficient
• Measures the strength of the linear association
between two variables
• Takes on the same sign as the slope estimate from
the linear regression
• Not effected by linear transformations of y or x
• Does not distinguish between dependent and
independent variable (e.g. height and weight)
• Population Parameter -
• Pearson’s Correlation Coefficient:
1
1
r
S
S
S
r
yy
xx
xy
12. Correlation Coefficient
• Values close to 1 in absolute value strong linear
association, positive or negative from sign
• Values close to 0 imply little or no association
• If data contain outliers (are non-normal),
Spearman’s coefficient of correlation can be
computed based on the ranks of the x and y values
• Test of H0: = 0 is equivalent to test of H0:1=0
• Coefficient of Determination (r2
) - Proportion of
variation in y “explained” by the regression on x:
1
0
)
( 2
2
2
r
S
SSE
S
r
r
yy
yy
14. Example - SPSS Output
Pearson’s and Spearman’s Measures
Correlations
1 -.937**
. .002
7 7
-.937** 1
.002 .
7 7
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
SCORE
LSD_CONC
SCORE LSD_CONC
Correlation is significant at the 0.01 level (2-tailed).
**.
Correlations
1.000 -.929**
. .003
7 7
-.929** 1.000
.003 .
7 7
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
SCORE
LSD_CONC
Spearman's rho
SCORE LSD_CONC
Correlation is significant at the 0.01 level (2-tailed).
**.
15. Analysis of Variance in Regression
• Goal: Partition the total variation in y into variation
“explained” by x and random variation
2
^
2
^
2
^
^
)
(
)
(
)
(
)
(
)
(
)
(
y
y
y
y
y
y
y
y
y
y
y
y
i
i
i
i
i
i
i
i
• These three sums of squares and degrees of freedom are:
•Total (Syy) dfTotal = n-1
• Error (SSE) dfError = n-2
• Model (SSR) dfModel = 1
16. Analysis of Variance in Regression
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Square F
Model SSR 1 MSR = SSR/1 F = MSR/MSE
Error SSE n-2 MSE = SSE/(n-2)
Total Syy n-1
• Analysis of Variance - F-test
• H0: 1 = 0 HA: 1 0
)
(
:
:
.
.
:
.
.
2
,
1
,
obs
n
obs
obs
F
F
P
val
P
F
F
R
R
MSE
MSR
F
S
T
17. Example - Pharmacodynamics of LSD
• Total Sum of squares:
6
1
7
183
.
2078
)
( 2
Total
i
yy df
y
y
S
• Error Sum of squares:
5
2
7
890
.
253
)
( 2
^
Error
i
i df
y
y
SSE
• Model Sum of Squares:
1
293
.
1824
890
.
253
183
.
2078
)
( 2
^
Model
i
df
y
y
SSR
18. Example - Pharmacodynamics of LSD
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Square F
Model 1824.293 1 1824.293 35.93
Error 253.890 5 50.778
Total 2078.183 6
•Analysis of Variance - F-test
• H0: 1 = 0 HA: 1 0
)
93
.
35
(
:
61
.
6
:
.
.
93
.
35
:
.
.
5
,
1
,
05
.
F
P
val
P
F
F
R
R
MSE
MSR
F
S
T
obs
obs
19. Example - SPSS Output
ANOVAb
1824.302 1 1824.302 35.928 .002a
253.881 5 50.776
2078.183 6
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), LSD_CONC
a.
Dependent Variable: SCORE
b.
20. Multiple Regression
• Numeric Response variable (Y)
• p Numeric predictor variables
• Model:
Y = 0 + 1x1 + + pxp +
• Partial Regression Coefficients: i effect (on
the mean response) of increasing the ith
predictor
variable by 1 unit, holding all other predictors
constant
21. Example - Effect of Birth weight on
Body Size in Early Adolescence
• Response: Height at Early adolescence (n =250 cases)
• Predictors (p=6 explanatory variables)
• Adolescent Age (x1, in years -- 11-14)
• Tanner stage (x2, units not given)
• Gender (x3=1 if male, 0 if female)
• Gestational age (x4, in weeks at birth)
• Birth length (x5, units not given)
• Birthweight Group (x6=1,...,6 <1500g (1), 1500-
1999g(2), 2000-2499g(3), 2500-2999g(4), 3000-
3499g(5), >3500g(6))
Source: Falkner, et al (2004)
22. Least Squares Estimation
• Population Model for mean response:
p
p x
x
Y
E
1
1
0
)
(
• Least Squares Fitted (predicted) equation, minimizing SSE:
2
^
^
1
1
^
0
^
^
Y
Y
SSE
x
x
Y p
p
• All statistical software packages/spreadsheets can
compute least squares estimates and their standard errors
23. Analysis of Variance
• Direct extension to ANOVA based on simple linear
regression
• Only adjustments are to degrees of freedom:
– dfModel = p dfError = n-p-1
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Square F
Model SSR p MSR = SSR/p F = MSR/MSE
Error SSE n-p-1 MSE = SSE/(n-p-1)
Total Syy n-1
yy
yy
yy
S
SSR
S
SSE
S
R
2
24. Testing for the Overall Model - F-test
• Tests whether any of the explanatory variables are
associated with the response
• H0: 1==p=0 (None of the xs
associated with y)
• HA: Not all i = 0
)
(
:
:
.
.
)
1
/(
)
1
(
/
:
.
.
1
,
,
2
2
obs
p
n
p
obs
obs
F
F
P
val
P
F
F
R
R
p
n
R
p
R
MSE
MSR
F
S
T
25. Example - Effect of Birth weight on
Body Size in Early Adolescence
• Authors did not print ANOVA, but did provide following:
• n=250 p=6 R2
=0.26
• H0: 1==6=0
• HA: Not all i = 0
)
2
.
14
(
:
13
.
2
:
.
.
2
.
14
0030
.
0433
.
)
1
6
250
/(
)
26
.
0
1
(
6
/
26
.
0
)
1
/(
)
1
(
/
:
.
.
243
,
6
,
2
2
F
P
val
P
F
F
R
R
p
n
R
p
R
MSE
MSR
F
S
T
obs
obs
26. Testing Individual Partial Coefficients - t-tests
• Wish to determine whether the response is
associated with a single explanatory variable, after
controlling for the others
• H0: i = 0 HA: i 0 (2-sided alternative)
|)
|
(
2
:
|
|
:
.
.
:
.
.
1
,
2
/
^
^
^
obs
p
n
obs
i
obs
t
t
P
val
P
t
t
R
R
t
S
T
i
27. Example - Effect of Birth weight on
Body Size in Early Adolescence
Variable b sb t=b/sb P-val (z)
Adolescent Age 2.86 0.99 2.89 .0038
Tanner Stage 3.41 0.89 3.83 <.001
Male 0.08 1.26 0.06 .9522
Gestational Age -0.11 0.21 -0.52 .6030
Birth Length 0.44 0.19 2.32 .0204
Birth Wt Grp -0.78 0.64 -1.22 .2224
Controlling for all other predictors, adolescent age,
Tanner stage, and Birth length are associated with
adolescent height measurement
28. Models with Dummy Variables
• Some models have both numeric and categorical
explanatory variables (Recall gender in example)
• If a categorical variable has k levels, need to create
k-1 dummy variables that take on the values 1 if
the level of interest is present, 0 otherwise.
• The baseline level of the categorical variable for
which all k-1 dummy variables are set to 0
• The regression coefficient corresponding to a
dummy variable is the difference between the
mean for that level and the mean for baseline
group, controlling for all numeric predictors
29. Example - Deep Cervical Infections
• Subjects - Patients with deep neck infections
• Response (Y) - Length of Stay in hospital
• Predictors: (One numeric, 11 Dichotomous)
– Age (x1)
– Gender (x2=1 if female, 0 if male)
– Fever (x3=1 if Body Temp > 38C, 0 if not)
– Neck swelling (x4=1 if Present, 0 if absent)
– Neck Pain (x5=1 if Present, 0 if absent)
– Trismus (x6=1 if Present, 0 if absent)
– Underlying Disease (x7=1 if Present, 0 if absent)
– Respiration Difficulty (x8=1 if Present, 0 if absent)
– Complication (x9=1 if Present, 0 if absent)
– WBC > 15000/mm3
(x10=1 if Present, 0 if absent)
– CRP > 100g/ml (x11=1 if Present, 0 if absent)
Source: Wang, et al (2003)
30. Example - Weather and Spinal Patients
• Subjects - Visitors to National Spinal Network in 23 cities
Completing SF-36 Form
• Response - Physical Function subscale (1 of 10 reported)
• Predictors:
– Patient’s age (x1)
– Gender (x2=1 if female, 0 if male)
– High temperature on day of visit (x3)
– Low temperature on day of visit (x4)
– Dew point (x5)
– Wet bulb (x6)
– Total precipitation (x7)
– Barometric Pressure (x7)
– Length of sunlight (x8)
– Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon, wan
gibbous, last Qtr, wan crescent, presumably had 8-1=7 dummy
variables)
Source: Glaser, et al (2004)
31. Analysis of Covariance
• Combination of 1-Way ANOVA and Linear
Regression
• Goal: Comparing numeric responses among k
groups, adjusting for numeric concomitant
variable(s), referred to as Covariate(s)
• Clinical trial applications: Response is Post-Trt
score, covariate is Pre-Trt score
• Epidemiological applications: Outcomes
compared across exposure conditions, adjusted for
other risk factors (age, smoking status, sex,...)