Module 3 Course Slides Lesson 2 McGill University

© Faculty of Management
Regression
LESSON 2: Multiple Regression

In most applications, we need more than one variable to explain
the variation of the dependent variable.
Examples
Dependent variable Independent variables
Profitability of a Starbucks store
Montreal house prices
Q1: Which model should we hypothesize?
Q2: Which variables should we include in the model?
Population, college, income levels,
businesses nearby, competition,…
Area, age, # bedrooms, …
Multiple regression
2

Business Reality
Hypothesized
relationships
Estimation and
Analysis
Need for location analysis Profit= 0 + 1XPop + 2XCollege…
Y = b0 + b1Xpop + b2XCollege …
Add or drop variables
Change the model
 Iterative process by nature… (adding or dropping independent variable)
 Independent variables may interact with each other and form a new
independent variable called interaction variable
 May include categorical variables
Once again, use the process!
Interpretation
3

ො
yi = b0 + b1x1i + b2x2i + ⋯ + bkxpi
Estimated
(or predicted)
value of y
Estimated slope coefficients
Estimated multiple regression model:
Estimated
intercept
 All explanatory variables are statistically independent
 Homoskedasticity and normality: i is independent of x and follows
N(0, ) for all x.
 No autocorrelation: Cov(i, j)=0
Multiple regression
yi = β0 + β1x1i + β2x2i + ⋯ + βkxpi + εi
Y-intercept
Population model:
Population slopes Random Error
4

Estimating Multiple
Regression

Example: Two variable model
y
x1
x2
i
i
i x
b
x
b
b
y 2
2
1
1
0
ˆ +
+
=
yi
ŷi
ei = (yi – ŷi)
x2i
x1i
The best fit equation, ŷ,
is found by minimizing the
sum of squared errors, e2
Sample
observation
estimate
Estimated regression line
i
i
i
i x
x
y 


 +
+
+
= 2
2
1
1
0
6

Want to explain the house price (Y) with 3 independent variables
Xsq. ft: the size of the house (in sq. ft)
Xage: the age of the house (in yrs.)
Xbed: the number of bedrooms in the house
Population regression model:
Sample regression line:
Tools/Data Analysis/Regression
i
i
bed
i
age
i
sqft
i X
X
X
Y 



 +
+
+
+
= ,
3
,
2
,
1
0
i
bed
i
age
i
sqft
i X
b
X
b
X
b
b
Y ,
3
,
2
,
1
0
ˆ +
+
+
=
Montreal house price example: Part II
7

Tools/Data Analysis/Regression
Range of Y (include label)
Range of Xj’s (include label)
Check if you have labels
output range
output options
Montreal house price example: Part II
8

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9751
R Square 0.9509
Adjusted R Square
0.9403
Standard Error
396.7315
Observations 18
ANOVA
df SS MS F Significance F
Regression 3 42643276 14214425 90.31003 2.1252E-09
Residual 14 2203542.1 157395.9
Total 17 44846818
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 2036.48 1164.9000 1.7482 0.1023 -461.9861 4534.9379
Sq. Ft 1.15 0.1793 6.4259 0.0000 0.7677 1.5369
Age -58.70 25.4548 -2.3060 0.0369 -113.2931 -4.1029
Bedrooms 54.36 197.9985 0.2746 0.7877 -370.3003 479.0290
Regression summary statistics
ANOVA: overall significance of the model
bj’s (b0 , b1, b2, b3)
Montreal house price example: Regression results
9

SUMMARY OUTPUT
Multiple R 0.9751
R Square 0.9509
Adjusted R Square
0.9403
Standard Error
396.7315
Observations 18
ANOVA
Regression 3 42643276 14214425 90.31003 2.1252E-09
Residual 14 2203542.1 157395.9
Total 17 44846818
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 2036.48 1164.9000 1.7482 0.1023 -461.9861 4534.9379
Sq. Ft 1.15 0.1793 6.4259 0.0000 0.7677 1.5369
Age -58.70 25.4548 -2.3060 0.0369 -113.2931 -4.1029
Bedrooms 54.36 197.9985 0.2746 0.7877 -370.3003 479.0290
Regression summary statistics
ANOVA: overall significance of the model
bj’s (b0 , b1, b2, b3)
Estimated multiple regression model
=
ŷ 2036.48 + 1.15 Xsqft – 58.70 Xage + 54.36 Xbed
Ann Arbor house price example: Regression results
Montreal house price example: Regression results
10

Estimated Multiple Regression Model
bed
age
ft
sq x
x
x
y 36
.
54
7
.
58
15
.
1
48
.
2036
ˆ . +
−
+
=
b0 (intercept):
b1 (slope of xsq ft.):
b2 (slope of xage.):
b3 (slope of xbed.):
If house is 1yr older and nothing else changes, its price decreases
by 58.7 ($’00 ‘s), or $5,870.
If house has 1 more bedroom and nothing else changes, its price
increases by 54.36 ($’00’s), or $5,436.
Montreal house example: Estimated regression
the estimated avg. y value when all ind. variables are zero or
the estimated avg. y value not explained by the model
the estimated avg. change in house price (in ’00 $) per sq.
ft when xage and xbed remain constant.
11

Confidence and
Prediction Intervals

ഥ
𝒚 = 𝑏0 + 𝑏1𝑥𝑠𝑞.𝑓𝑡 + 𝑏2𝑥𝑎𝑔𝑒 + 𝑏3𝑥𝑏𝑒𝑑
Sample regression line
Prediction Interval for a particular y value for given x=(x1, x2,…,xp)
(Prediction interval for a particular y value at x=(x1, x2,…,xp))
is a point estimate for the pop. regression line.
Confidence Interval for the avg. y value for given x=(x1, x2,…,xp)
(CI for the population regression line at x=(x1, x2,…,xp))
If xsq ft = 2500, xage = 5 yrs and xbed = 4
95% CI provides:
95% PI provides:
Interval for price of avg 2500 sqft, 5 yr old, 4 bed houses
Interval for price of a specific 2500 sqft, 5 yr, 4 bed house.
Estimation of mean value (y|x) and prediction for an individual value (yi)
13

Point estimate: ഥ
𝒚 = 𝑏0 + 𝑏1𝑥𝑠𝑞.𝑓𝑡 + 𝑏2𝑥𝑎𝑔𝑒 + 𝑏3𝑥𝑏𝑒𝑑
Critical value: t/2 with df = n-p-1
n
s
t
y p
x
1
2
/ 


The standard error of the estimate, s
Approximated 100(1-)% CI: Approximated 100(1-)% PI:
n
s
t
y p
x
1
1
2
/ +
 

MTL House example II: CI and PI for the y value at x=(x1, x2,…,xp)
14

ഥ
𝒚 = 2036.48 + 1.15𝑥𝑠𝑞.𝑓𝑡 − 58.7𝑥𝑎𝑔𝑒 + 54.36𝑏𝑒𝑑
Example Construct a 95% CI for the avg. price of 5-year old, 4
bedroom houses with 2500 sq.ft. area:
Point estimate:
Critical value: t/2 with df=
n
s
t
y p
x
1
2
/ 


Approximated 95% CI
2036.48 + 1.15*2500 - 58.7*5 + 54.36*4 = 4835.42 ($’00)
14 crit val = 2.1448
4,835.42 ± 2.1448 * 396.7315 / √18
= [4,634.86, 5,035.98] ($’00)
MTL House example II: CI for the avg. y value at x=(x1, x2,…,xp)
15

ഥ
𝒚 = 2036.48 + 1.15𝑥𝑠𝑞.𝑓𝑡 − 58.7𝑥𝑎𝑔𝑒 + 54.36𝑏𝑒𝑑
Construct a 95% PI for the price of a 5-year old, 4 bedroom house
with 2500 sq. ft. area
Point estimate:
Critical value: t/2 with df=
Approximated 95% PI
n
s
t
y p
x
1
1
2
/ +
 

2036.48 + 1.15*2500 - 58.7*5 + 54.36*4 = 4835.42 ($’00)
14 critical value = 2.1448
4,835.42 ± 2.1448 * 396.7315 * √(1+1/18)
= 4,835.42 ± 2.1448* 407.60
= [3,961.20, 5,709.64] ($’00)
MTL house example II: PI for a predicted y value at xp=(x1p, x2p,…,xkp)
16

Evaluating Multiple
Regression Model

R2: fraction of the total variation in Y about its mean (SST) that is
explained by the regression (SSR)




 +
+
+
+
= bed
age
ft
sq x
x
x
y 3
2
.
1
0
Hypothesized population regression model:
𝑅2
=
𝑆𝑆𝑅
𝑆𝑆𝑇
=
𝑆𝑆𝑇 − 𝑆𝑆𝐸
𝑆𝑆𝑇
= 1 −
𝑆𝑆𝐸
𝑆𝑆𝑇
=
42,643,276
44,846,818
= 1 −
2,203,542.11
44,846,818.44
= 0.9509
Evaluating the overall regression model: Coefficient of Determination (R2)
18

 R2 will always increase when you add more independent
variables (even if they are not significant.)
 It is a mistake to use R2 in isolation as a criterion for choosing
among competing models.
Adjusted R2
Evaluating the overall regression model: R2 and adjusted R2
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
= 1 − (1 − 𝑅2
)
𝑛 − 1
𝑛 − 𝑝 − 1
 It is better to compare models via “Adjusted R2” rather than R2, but
neither is sufficient in isolation.
 Use logical judgment and consider other factors (residuals)
Adjusted R2: takes into account the cost of adding independent variables and
increased explanatory power.
Adjusted R2 declines if we add a non-significant variable.
19





 +
+
+
+
= bed
age
ft
sq x
x
x
y 3
2
.
1
0
Population regression model of Montreal house prices:
ሜ
𝑅2
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
ሜ
𝑅2
= 1 −
2,203,542
14
44,846,818
17
= 1 −
157,395.86
2,638,048.14
= 0.9403
Evaluating the overall regression model: Adjusted-R2
20




 +
+
+
= age
ft
sq x
x
y 2
.
1
0
New population regression model:
Adjusted R2
If we evaluate the two models using adjusted R2 in isolation, which
model is better?
= 1 −
2,215,407.93
15
44,846,818.44
17
= 1 −
147,693.86
2,638,048.14
= 0.9440
Second model is slightly better
Evaluating the overall regression model: Adjusted-R2
21
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
Adjusted R2

Building a Multiple
Regression Model

Stepwise Model
Building

 Compare adjusted R2, std. error of the estimates, t-test results
Business Reality
Hypothesized
relationships
Estimation and
Analysis
Interpretation
Add or drop variables
Change the model
Choosing the right model
24

 Step 1: Choose the independent variable that results in the largest
value of adjusted 𝑅2
when included in the model.
 Step 2: Continue adding a new independent variable one at a time
as long as adding this independent variable increases adjusted 𝑅2
.
 Step 3: Stop adding variables when adding any remaining
independent variables does not result in an increase in adjusted 𝑅2
.
Stepwise Model Building (based on adjusted 𝑅2)
25

 Step 1: Initially, we have three independent variables:
 𝑋1: Area (Square-ft)
 𝑋2: Age (Years)
 𝑋3: Number of Bedrooms
Question: Which independent variable should be initially selected?
Answer: Note that adding 𝑋1 results in the largest value of adjusted 𝑅2
, which is 0.9214.
So, we start by adding 𝑋1.
Example: Montreal House Dataset
26
IndependentVariable Adjusted 𝑹𝟐
𝑋1 0.9214
𝑋2 0.7917
𝑋3 0.3816

 Step 2:
We have two remaining independent variables:
 𝑋2: Age (Years)
 𝑋3: Number of Bedrooms
Question: Which one results in an increase in adjusted 𝑅2
?
Answer: Note that adding either 𝑋2 or 𝑋3 results in an increase in adjusted 𝑅2
.
Question: Which one results in larger increase in adjusted 𝑅2
?
Answer: Adding 𝑋2 results in a larger increase in adjusted 𝑅2
than adding 𝑋3 does.
So, we add 𝑋2.
27
𝑋1 0.9214
𝑋1, 𝑋2 0.9440
𝑋1, 𝑋3 0.9231

 Step 3: We have only one remaining independent variable:
 X3: Number of Bedrooms (Number)
Question: Does adding 𝑋3 result in an increase in adjusted 𝑅2
?
Answer: No, adding 𝑋3 (i.e., number of bedrooms) does not result in an increase in adjusted 𝑅2
.
So, we don’t add 𝑋3 and stop.
Our final regression model contains only two independent variables: X1, and 𝑋2.
28
𝑋1, 𝑋2 0.9440
𝑋1, 𝑋2, 𝑋3 0.9403

Estimated Regression Model:
𝑌 = 2274.085 + 1.155 𝑋1 − 61.533 𝑋2
29

Suppose that we want to predict price of a house with 𝑋1 = 3000 sq ft, and 𝑋2 = 10 years of
age. Construct 95% Prediction Intervals.
Model:
𝑌 = 2274.085 + 1.155 𝑋1 − 61.533 𝑋2
Step 1: Use the above model to find predicted price:
ത
𝑌𝑥 = 2274.085 + 1.155 × 3000 − 61.533 × 10 ≅ 5123.755
Example: Constructing Prediction Intervals
30

Step 2: Next, calculate Margin of Error for 95% prediction interval:
Margin of Error = 𝑠𝜖 × 1 +
1
𝑛
× 𝑡𝛼/2
Standard deviation of residuals 𝑠𝜖 = 384.31
df (degree of freedom) is 18-2-1=15. Critical Value: 𝑡 0.025, 𝑑𝑓 = 15 = 2.131 (Read from
t-table)
Margin of Error = 384.31 × 1 +
1
18
× 2.131 ≅ 841.406
31

Suppose that we want to predict price of a house with 𝑋1 = 3000 sq ft, and 𝑋2 = 10 years of
age. Construct 95% Prediction Intervals.
Step 3: Calculate 95% LPI and UPI:
𝐿𝑃𝐼 = ത
𝑌𝑥 − 𝑀𝐸 = 5123.755 − 841.406 ≅ 4282.349
𝑈𝑃𝐼 = ഥ
𝑌𝑥 + 𝑀𝐸 = 5123.755 + 841.406 ≅ 5965.161
32

Optional Materials

Testing significance of
regression coefficients

Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
100(1-)% CI for i
Point Estimate  (Crit. value)(Std. Err) 𝑏𝑖 ± 𝑡𝛼/2𝑠𝑏𝑖
where 𝑑𝑓 = 𝑛 − 𝑝 − 1
95% CI for 1 (sq. ft ) 1.15 ± 2.1448 * 0.1793
= [0.7654, 1.5346] (in $00’s)
Montreal house example: Estimating population parameters: CI’s
35

Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
100(1-)% CI for i
Point Estimate  (Crit. value)(Std. Err) 𝑏𝑖 ± 𝑡𝛼/2𝑠𝑏𝑖
where 𝑑𝑓 = 𝑛 − 𝑝 − 1
90% CI for 3 (bed ) 54.36 ± 1.7613 * 198.00
= [-294.38, 403.10] (in $00’s)
Estimating population regression parameters: CI’s
36

Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
H0: i = 0
HA: i  0
H0: 1 = 0 HA: 1  0 at =0.05
t-stat: =
−
=
1
0
1
b
s
b
t
p-value:
df = n-p-1 = 18-3-1=14
Verdict:
6.4259
0
Reject H0. Conclude that coefficient of square footage is not equal
to zero.
Test for regression parameter (i ): Excel default test
Default test
37

Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
c) H0: 3 = 0 HA: 3  0 at =0.05
t-stat: =
−
=
3
0
3
b
s
b
t
p-value:
df =
Verdict:
In this model, does the # of bedrooms help in explaining
house prices effectively?
0.2746
= T.DIST.2T(.2746,14) = 0.7877
No!
Fail to reject null hypothesis that β3 = 0.
n-p-1 = 14
Test for regression parameter (i ): What if i is not statistically significant?
Critical t = T.INV(.975,14)=2.145
38

-800
-600
-400
-200
0
200
400
600
800
0 2000 4000 6000
Residuals
Sq. Ft
Sq. Ft Residual Plot Age Residual Plot
-800
-600
-400
-200
0
200
400
600
800
0 10 20 30 40
Age
Residuals
Any violations of Regression assumptions?
- homoskedasticity:
- normality:
- no correlation among residuals:
Montreal house: Residual analysis
39

Modeling Nonlinear
Relations

Note that price seems to depend on age in nonlinear fashion. How can we model this nonlinear
relation between Price and Age?
Nonlinear relation between Dependent and Independent Variables
41

42
Solution:We can add a nonlinear term to the regression model.
Revised Model:
Price = 𝑏0 + 𝑏1 × 𝐴𝑔𝑒 + 𝑏2 × 𝐴𝑔𝑒2

43
Question: How can we estimate this by using Excel?
Solution:We can calculate square of age variable for each house and add 𝐴𝑔𝑒2
as
a new independent variable.The revised model will be as follows:
Price = 𝑏0 + 𝑏1 × 𝐴𝑔𝑒 + 𝑏2 × 𝐴𝑔𝑒2

44
Estimated Model:
Price = 8374.55 − 414.633 × 𝐴𝑔𝑒 + 7.578 × 𝐴𝑔𝑒2

Encoding
categorical variables

Categorical Variables
Categories: Labels, Flavors, or Colors
Single numerical scale doesn’t make sense
Q: How do we assign numerical values to categorical data?
Use dummy or indicator variables.
Dummy variable (a.k.a. Indicator, Binary, Categorical, 0-1)
 A 0-1 variable that indicates the presence or absence of an attribute
 Why is it called “dummy”? Numeric value indicates whether the obs. has
the attribute or not.
Example:
Coffee drinker vs. not a coffee drinker
ICoffee =
Example:
Male vs. Female
IFemale =
1 if coffee drinker
0 if not coffee drinker
1 if Female
0 if Male
Categorical data and dummy variables
46

For 2 categories: we need ( ) dummy variable.
Why? The “base” case was represented by I=0.
Q: How many dummy var.’s do we need to model “n” mutually exclusive
and collectively exhaustive categories?
Gender has two
categories (n=2)
Choose one (e.g.
male) as base case
Define a “0 or 1” dummy
variable for the other
(e.g. IFemale)
Ij = 1 if the obs. belongs to category j
0 otherwise
j=1,2…… (?)
1
Answer: n-1 dummy variables
Categorical data and dummy variables: How many variables do we need?
47

Example
Categories
First class
Business class
Economy class
Base case
Economy
IF = 1 for a first-class passenger
0 otherwise
IB = 1 for a bus. class passenger
0 otherwise
a) How do we model if the i-th passenger is a first-class passenger?
b) How do we model if the i-th passenger is a bus-class passenger?
c) How do we model if the i-th passenger is a economy passenger?
IF,i = 1 and IB,i = 0
Categorical data and dummy variables: How many variables do we need?
48

Modeling Interaction
between Independent
Variables

Interaction: the effect of one explanatory variable onto Y is dependent
on the value of other explanatory variable(s).
Sales Volume of Dell PC’s (Y) Price (XPrice)
Advertisement (XAdv)
Q: When does advertisement have a large effect on sales volume?
(Price is high or Price is low)
Interaction term:
If there is an (omitted) interaction term, our assumption that all
explanatory variables are independent is violated
Example:
Trade off: Technical assumptions vs. Modeling bus. reality
Price * advertising
Interactions among explanatory variables
50

Interaction Example: Internet, Computer games and Fatigue
0
20
40
60
80
100
0 5 10 15 20 25 30
Fatigue
Hrs spent
Male Female
Y: Fatigue index, on 0-100 scale
X: Hrs spent on the internet and computer games, per week
Do males and females appear to respond differently?
Can we model both in a single regression?
51

Proposed population regression model:
i
i
i
fem
fem
i
hrs
hrs
i
i
fem
i
hrs
i
fem
fem
i
hrs
hrs
i
x
x
x
x
x
x
x
y










+
+
+
+
=
+
+
+
+
=
int,
int
,
,
0
,
,
int
,
,
0 )
(
52

Proposed population regression model:
Interpretation
:
fem

i
i
i
fem
fem
i
hrs
hrs
i
i
fem
i
hrs
i
fem
fem
i
hrs
hrs
i
x
x
x
x
x
x
x
y










+
+
+
+
=
+
+
+
+
=
int,
int
,
,
0
,
,
int
,
,
0 )
(
:
int

hrs
fatigue
β0
slope = βhrs
males: β0 + βhrsxhrs,i + εi
βfem
slope = βhrs+βint
females: (β0 + βfem)
+ (βhrs+βint)xhrs,i + εi
Difference in intercept between
females and males.
Avg diff in fatigue index between
females and males when hrs=0
Diff in slopes between
females and males
Avg diff between females and
males of “impact of 1 hr of computer
games on fatigue index”
53

Multiple R 0.9680
R Square 0.9369
Adjusted R Square
0.9251
Standard Error 7.5872
Observations 20
ANOVA
Regression 3 13682.21831 4560.739 79.22681 8.12E-10
Residual 16 921.0496881 57.56561
Total 19 14603.268
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -29.644 7.5663 -3.9180 0.0012 -45.6844 -13.6046
Hrs spent on Internet and Games
4.738 0.3464 13.6774 0.0000
Female 50.748 10.6102 4.7829 0.0002 28.2550 73.2404
Hr*Female -3.233 0.5137 -6.2942 0.0000 -4.3221 -2.1442
a) Sample regression line:
b) R2 Adj. R2
c) Standard error estimate (s)
fem
hrs
fem
hrs x
x
x
x
y 233
.
3
748
.
50
738
.
4
644
.
29
ˆ −
+
+
−
=
= .9369 = .9251
= 7.5872
54

d) Sample regression line:
fem
hrs
female
hrs x
x
x
x
y 23
.
3
75
.
50
74
.
4
644
.
29
ˆ −
+
+
−
=
Interpretation
b0
bhrs
bfem
bint
hrs
fatigue
Sample regression line for female students:
Sample regression line for male students:
21.11 + 1.51 xhrs
21.11
slope = 1.51
-29.644 + 4.74 xhrs
0
-29.644
slope = 4.74
males
females
intercept for males
slope for males
intercept diff btw females and males
slope diff btw females and males
55

Key learning points
 Estimating Multiple Regression
 Assessing a Multiple Regression Model
 Stepwise Model Building
 Option Material:
 Testing significance of regression coefficients
 Modeling nonlinear relations
 Encoding Categorical Variables
 Modeling Interaction terms
56

Module 3 Course Slides Lesson 2 McGill University

More Related Content

Similar to Module 3 Course Slides Lesson 2 McGill University (20)

Recently uploaded (20)

Module 3 Course Slides Lesson 2 McGill University