2. In most applications, we need more than one variable to explain
the variation of the dependent variable.
Examples
Dependent variable Independent variables
Profitability of a Starbucks store
Montreal house prices
Q1: Which model should we hypothesize?
Q2: Which variables should we include in the model?
Population, college, income levels,
businesses nearby, competition,…
Area, age, # bedrooms, …
Multiple regression
2
3. Business Reality
Hypothesized
relationships
Estimation and
Analysis
Need for location analysis Profit= 0 + 1XPop + 2XCollege…
Y = b0 + b1Xpop + b2XCollege …
Add or drop variables
Change the model
Iterative process by nature… (adding or dropping independent variable)
Independent variables may interact with each other and form a new
independent variable called interaction variable
May include categorical variables
Once again, use the process!
Interpretation
3
4. ො
yi = b0 + b1x1i + b2x2i + ⋯ + bkxpi
Estimated
(or predicted)
value of y
Estimated slope coefficients
Estimated multiple regression model:
Estimated
intercept
All explanatory variables are statistically independent
Homoskedasticity and normality: i is independent of x and follows
N(0, ) for all x.
No autocorrelation: Cov(i, j)=0
Multiple regression
yi = β0 + β1x1i + β2x2i + ⋯ + βkxpi + εi
Y-intercept
Population model:
Population slopes Random Error
4
6. Example: Two variable model
y
x1
x2
i
i
i x
b
x
b
b
y 2
2
1
1
0
ˆ +
+
=
yi
ŷi
ei = (yi – ŷi)
x2i
x1i
The best fit equation, ŷ,
is found by minimizing the
sum of squared errors, e2
Sample
observation
estimate
Estimated regression line
i
i
i
i x
x
y
+
+
+
= 2
2
1
1
0
6
7. Want to explain the house price (Y) with 3 independent variables
Xsq. ft: the size of the house (in sq. ft)
Xage: the age of the house (in yrs.)
Xbed: the number of bedrooms in the house
Population regression model:
Sample regression line:
Tools/Data Analysis/Regression
i
i
bed
i
age
i
sqft
i X
X
X
Y
+
+
+
+
= ,
3
,
2
,
1
0
i
bed
i
age
i
sqft
i X
b
X
b
X
b
b
Y ,
3
,
2
,
1
0
ˆ +
+
+
=
Montreal house price example: Part II
7
8. Tools/Data Analysis/Regression
Range of Y (include label)
Range of Xj’s (include label)
Check if you have labels
output range
output options
Montreal house price example: Part II
8
9. SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9751
R Square 0.9509
Adjusted R Square
0.9403
Standard Error
396.7315
Observations 18
ANOVA
df SS MS F Significance F
Regression 3 42643276 14214425 90.31003 2.1252E-09
Residual 14 2203542.1 157395.9
Total 17 44846818
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 2036.48 1164.9000 1.7482 0.1023 -461.9861 4534.9379
Sq. Ft 1.15 0.1793 6.4259 0.0000 0.7677 1.5369
Age -58.70 25.4548 -2.3060 0.0369 -113.2931 -4.1029
Bedrooms 54.36 197.9985 0.2746 0.7877 -370.3003 479.0290
Regression summary statistics
ANOVA: overall significance of the model
bj’s (b0 , b1, b2, b3)
Montreal house price example: Regression results
9
10. SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9751
R Square 0.9509
Adjusted R Square
0.9403
Standard Error
396.7315
Observations 18
ANOVA
df SS MS F Significance F
Regression 3 42643276 14214425 90.31003 2.1252E-09
Residual 14 2203542.1 157395.9
Total 17 44846818
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 2036.48 1164.9000 1.7482 0.1023 -461.9861 4534.9379
Sq. Ft 1.15 0.1793 6.4259 0.0000 0.7677 1.5369
Age -58.70 25.4548 -2.3060 0.0369 -113.2931 -4.1029
Bedrooms 54.36 197.9985 0.2746 0.7877 -370.3003 479.0290
Regression summary statistics
ANOVA: overall significance of the model
bj’s (b0 , b1, b2, b3)
Estimated multiple regression model
=
ŷ 2036.48 + 1.15 Xsqft – 58.70 Xage + 54.36 Xbed
Ann Arbor house price example: Regression results
Montreal house price example: Regression results
10
11. Estimated Multiple Regression Model
bed
age
ft
sq x
x
x
y 36
.
54
7
.
58
15
.
1
48
.
2036
ˆ . +
−
+
=
b0 (intercept):
b1 (slope of xsq ft.):
b2 (slope of xage.):
b3 (slope of xbed.):
If house is 1yr older and nothing else changes, its price decreases
by 58.7 ($’00 ‘s), or $5,870.
If house has 1 more bedroom and nothing else changes, its price
increases by 54.36 ($’00’s), or $5,436.
Montreal house example: Estimated regression
the estimated avg. y value when all ind. variables are zero or
the estimated avg. y value not explained by the model
the estimated avg. change in house price (in ’00 $) per sq.
ft when xage and xbed remain constant.
11
13. ഥ
𝒚 = 𝑏0 + 𝑏1𝑥𝑠𝑞.𝑓𝑡 + 𝑏2𝑥𝑎𝑔𝑒 + 𝑏3𝑥𝑏𝑒𝑑
Sample regression line
Prediction Interval for a particular y value for given x=(x1, x2,…,xp)
(Prediction interval for a particular y value at x=(x1, x2,…,xp))
is a point estimate for the pop. regression line.
Confidence Interval for the avg. y value for given x=(x1, x2,…,xp)
(CI for the population regression line at x=(x1, x2,…,xp))
If xsq ft = 2500, xage = 5 yrs and xbed = 4
95% CI provides:
95% PI provides:
Interval for price of avg 2500 sqft, 5 yr old, 4 bed houses
Interval for price of a specific 2500 sqft, 5 yr, 4 bed house.
Estimation of mean value (y|x) and prediction for an individual value (yi)
13
14. Point estimate: ഥ
𝒚 = 𝑏0 + 𝑏1𝑥𝑠𝑞.𝑓𝑡 + 𝑏2𝑥𝑎𝑔𝑒 + 𝑏3𝑥𝑏𝑒𝑑
Critical value: t/2 with df = n-p-1
n
s
t
y p
x
1
2
/
The standard error of the estimate, s
Approximated 100(1-)% CI: Approximated 100(1-)% PI:
n
s
t
y p
x
1
1
2
/ +
MTL House example II: CI and PI for the y value at x=(x1, x2,…,xp)
14
15. Sample regression line
ഥ
𝒚 = 2036.48 + 1.15𝑥𝑠𝑞.𝑓𝑡 − 58.7𝑥𝑎𝑔𝑒 + 54.36𝑏𝑒𝑑
Example Construct a 95% CI for the avg. price of 5-year old, 4
bedroom houses with 2500 sq.ft. area:
Point estimate:
Critical value: t/2 with df=
n
s
t
y p
x
1
2
/
Approximated 95% CI
2036.48 + 1.15*2500 - 58.7*5 + 54.36*4 = 4835.42 ($’00)
14 crit val = 2.1448
4,835.42 ± 2.1448 * 396.7315 / √18
= [4,634.86, 5,035.98] ($’00)
MTL House example II: CI for the avg. y value at x=(x1, x2,…,xp)
15
16. Sample regression line
ഥ
𝒚 = 2036.48 + 1.15𝑥𝑠𝑞.𝑓𝑡 − 58.7𝑥𝑎𝑔𝑒 + 54.36𝑏𝑒𝑑
Construct a 95% PI for the price of a 5-year old, 4 bedroom house
with 2500 sq. ft. area
Point estimate:
Critical value: t/2 with df=
Approximated 95% PI
n
s
t
y p
x
1
1
2
/ +
2036.48 + 1.15*2500 - 58.7*5 + 54.36*4 = 4835.42 ($’00)
14 critical value = 2.1448
4,835.42 ± 2.1448 * 396.7315 * √(1+1/18)
= 4,835.42 ± 2.1448* 407.60
= [3,961.20, 5,709.64] ($’00)
MTL house example II: PI for a predicted y value at xp=(x1p, x2p,…,xkp)
16
18. R2: fraction of the total variation in Y about its mean (SST) that is
explained by the regression (SSR)
+
+
+
+
= bed
age
ft
sq x
x
x
y 3
2
.
1
0
Hypothesized population regression model:
𝑅2
=
𝑆𝑆𝑅
𝑆𝑆𝑇
=
𝑆𝑆𝑇 − 𝑆𝑆𝐸
𝑆𝑆𝑇
= 1 −
𝑆𝑆𝐸
𝑆𝑆𝑇
=
42,643,276
44,846,818
= 1 −
2,203,542.11
44,846,818.44
= 0.9509
Evaluating the overall regression model: Coefficient of Determination (R2)
18
19. R2 will always increase when you add more independent
variables (even if they are not significant.)
It is a mistake to use R2 in isolation as a criterion for choosing
among competing models.
Adjusted R2
Evaluating the overall regression model: R2 and adjusted R2
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
= 1 − (1 − 𝑅2
)
𝑛 − 1
𝑛 − 𝑝 − 1
It is better to compare models via “Adjusted R2” rather than R2, but
neither is sufficient in isolation.
Use logical judgment and consider other factors (residuals)
Adjusted R2: takes into account the cost of adding independent variables and
increased explanatory power.
Adjusted R2 declines if we add a non-significant variable.
19
20.
+
+
+
+
= bed
age
ft
sq x
x
x
y 3
2
.
1
0
Population regression model of Montreal house prices:
ሜ
𝑅2
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
ሜ
𝑅2
= 1 −
2,203,542
14
44,846,818
17
= 1 −
157,395.86
2,638,048.14
= 0.9403
Evaluating the overall regression model: Adjusted-R2
20
21.
+
+
+
= age
ft
sq x
x
y 2
.
1
0
New population regression model:
Adjusted R2
If we evaluate the two models using adjusted R2 in isolation, which
model is better?
= 1 −
2,215,407.93
15
44,846,818.44
17
= 1 −
147,693.86
2,638,048.14
= 0.9440
Second model is slightly better
Evaluating the overall regression model: Adjusted-R2
21
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
Adjusted R2
24. Compare adjusted R2, std. error of the estimates, t-test results
Business Reality
Hypothesized
relationships
Estimation and
Analysis
Interpretation
Add or drop variables
Change the model
Choosing the right model
24
25. Step 1: Choose the independent variable that results in the largest
value of adjusted 𝑅2
when included in the model.
Step 2: Continue adding a new independent variable one at a time
as long as adding this independent variable increases adjusted 𝑅2
.
Step 3: Stop adding variables when adding any remaining
independent variables does not result in an increase in adjusted 𝑅2
.
Stepwise Model Building (based on adjusted 𝑅2)
25
26. Step 1: Initially, we have three independent variables:
𝑋1: Area (Square-ft)
𝑋2: Age (Years)
𝑋3: Number of Bedrooms
Question: Which independent variable should be initially selected?
Answer: Note that adding 𝑋1 results in the largest value of adjusted 𝑅2
, which is 0.9214.
So, we start by adding 𝑋1.
Example: Montreal House Dataset
26
IndependentVariable Adjusted 𝑹𝟐
𝑋1 0.9214
𝑋2 0.7917
𝑋3 0.3816
27. Step 2:
We have two remaining independent variables:
𝑋2: Age (Years)
𝑋3: Number of Bedrooms
Question: Which one results in an increase in adjusted 𝑅2
?
Answer: Note that adding either 𝑋2 or 𝑋3 results in an increase in adjusted 𝑅2
.
Question: Which one results in larger increase in adjusted 𝑅2
?
Answer: Adding 𝑋2 results in a larger increase in adjusted 𝑅2
than adding 𝑋3 does.
So, we add 𝑋2.
Example: Montreal House Dataset
27
IndependentVariable Adjusted 𝑹𝟐
𝑋1 0.9214
𝑋1, 𝑋2 0.9440
𝑋1, 𝑋3 0.9231
28. Step 3: We have only one remaining independent variable:
X3: Number of Bedrooms (Number)
Question: Does adding 𝑋3 result in an increase in adjusted 𝑅2
?
Answer: No, adding 𝑋3 (i.e., number of bedrooms) does not result in an increase in adjusted 𝑅2
.
So, we don’t add 𝑋3 and stop.
Our final regression model contains only two independent variables: X1, and 𝑋2.
Example: Montreal House Dataset
28
IndependentVariable Adjusted 𝑹𝟐
𝑋1, 𝑋2 0.9440
𝑋1, 𝑋2, 𝑋3 0.9403
30. Suppose that we want to predict price of a house with 𝑋1 = 3000 sq ft, and 𝑋2 = 10 years of
age. Construct 95% Prediction Intervals.
Model:
𝑌 = 2274.085 + 1.155 𝑋1 − 61.533 𝑋2
Step 1: Use the above model to find predicted price:
ത
𝑌𝑥 = 2274.085 + 1.155 × 3000 − 61.533 × 10 ≅ 5123.755
Example: Constructing Prediction Intervals
30
31. Step 2: Next, calculate Margin of Error for 95% prediction interval:
Margin of Error = 𝑠𝜖 × 1 +
1
𝑛
× 𝑡𝛼/2
Standard deviation of residuals 𝑠𝜖 = 384.31
df (degree of freedom) is 18-2-1=15. Critical Value: 𝑡 0.025, 𝑑𝑓 = 15 = 2.131 (Read from
t-table)
Margin of Error = 384.31 × 1 +
1
18
× 2.131 ≅ 841.406
Example: Constructing Prediction Intervals
31
32. Suppose that we want to predict price of a house with 𝑋1 = 3000 sq ft, and 𝑋2 = 10 years of
age. Construct 95% Prediction Intervals.
Step 3: Calculate 95% LPI and UPI:
𝐿𝑃𝐼 = ത
𝑌𝑥 − 𝑀𝐸 = 5123.755 − 841.406 ≅ 4282.349
𝑈𝑃𝐼 = ഥ
𝑌𝑥 + 𝑀𝐸 = 5123.755 + 841.406 ≅ 5965.161
Example: Constructing Prediction Intervals
32
35. Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
100(1-)% CI for i
Point Estimate (Crit. value)(Std. Err) 𝑏𝑖 ± 𝑡𝛼/2𝑠𝑏𝑖
where 𝑑𝑓 = 𝑛 − 𝑝 − 1
95% CI for 1 (sq. ft ) 1.15 ± 2.1448 * 0.1793
= [0.7654, 1.5346] (in $00’s)
Montreal house example: Estimating population parameters: CI’s
35
36. Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
100(1-)% CI for i
Point Estimate (Crit. value)(Std. Err) 𝑏𝑖 ± 𝑡𝛼/2𝑠𝑏𝑖
where 𝑑𝑓 = 𝑛 − 𝑝 − 1
90% CI for 3 (bed ) 54.36 ± 1.7613 * 198.00
= [-294.38, 403.10] (in $00’s)
Estimating population regression parameters: CI’s
36
37. Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
H0: i = 0
HA: i 0
H0: 1 = 0 HA: 1 0 at =0.05
t-stat: =
−
=
1
0
1
b
s
b
t
p-value:
df = n-p-1 = 18-3-1=14
Verdict:
6.4259
0
Reject H0. Conclude that coefficient of square footage is not equal
to zero.
Test for regression parameter (i ): Excel default test
Default test
37
38. Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
c) H0: 3 = 0 HA: 3 0 at =0.05
t-stat: =
−
=
3
0
3
b
s
b
t
p-value:
df =
Verdict:
In this model, does the # of bedrooms help in explaining
house prices effectively?
0.2746
= T.DIST.2T(.2746,14) = 0.7877
No!
Fail to reject null hypothesis that β3 = 0.
n-p-1 = 14
Test for regression parameter (i ): What if i is not statistically significant?
Critical t = T.INV(.975,14)=2.145
38
39. -800
-600
-400
-200
0
200
400
600
800
0 2000 4000 6000
Residuals
Sq. Ft
Sq. Ft Residual Plot Age Residual Plot
-800
-600
-400
-200
0
200
400
600
800
0 10 20 30 40
Age
Residuals
Any violations of Regression assumptions?
- homoskedasticity:
- normality:
- no correlation among residuals:
Montreal house: Residual analysis
39
41. Note that price seems to depend on age in nonlinear fashion. How can we model this nonlinear
relation between Price and Age?
Nonlinear relation between Dependent and Independent Variables
41
42. Nonlinear relation between Dependent and Independent Variables
42
Solution:We can add a nonlinear term to the regression model.
Revised Model:
Price = 𝑏0 + 𝑏1 × 𝐴𝑔𝑒 + 𝑏2 × 𝐴𝑔𝑒2
43. Nonlinear relation between Dependent and Independent Variables
43
Question: How can we estimate this by using Excel?
Solution:We can calculate square of age variable for each house and add 𝐴𝑔𝑒2
as
a new independent variable.The revised model will be as follows:
Price = 𝑏0 + 𝑏1 × 𝐴𝑔𝑒 + 𝑏2 × 𝐴𝑔𝑒2
46. Categorical Variables
Categories: Labels, Flavors, or Colors
Single numerical scale doesn’t make sense
Q: How do we assign numerical values to categorical data?
Use dummy or indicator variables.
Dummy variable (a.k.a. Indicator, Binary, Categorical, 0-1)
A 0-1 variable that indicates the presence or absence of an attribute
Why is it called “dummy”? Numeric value indicates whether the obs. has
the attribute or not.
Example:
Coffee drinker vs. not a coffee drinker
ICoffee =
Example:
Male vs. Female
IFemale =
1 if coffee drinker
0 if not coffee drinker
1 if Female
0 if Male
Categorical data and dummy variables
46
47. For 2 categories: we need ( ) dummy variable.
Why? The “base” case was represented by I=0.
Q: How many dummy var.’s do we need to model “n” mutually exclusive
and collectively exhaustive categories?
Gender has two
categories (n=2)
Choose one (e.g.
male) as base case
Define a “0 or 1” dummy
variable for the other
(e.g. IFemale)
Ij = 1 if the obs. belongs to category j
0 otherwise
j=1,2…… (?)
1
Answer: n-1 dummy variables
Categorical data and dummy variables: How many variables do we need?
47
48. Example
Categories
First class
Business class
Economy class
Base case
Economy
IF = 1 for a first-class passenger
0 otherwise
IB = 1 for a bus. class passenger
0 otherwise
a) How do we model if the i-th passenger is a first-class passenger?
b) How do we model if the i-th passenger is a bus-class passenger?
c) How do we model if the i-th passenger is a economy passenger?
IF,i = 1 and IB,i = 0
IF,i = 0 and IB,i = 1
IF,i = 0 and IB,i = 0
Categorical data and dummy variables: How many variables do we need?
48
50. Interaction: the effect of one explanatory variable onto Y is dependent
on the value of other explanatory variable(s).
Sales Volume of Dell PC’s (Y) Price (XPrice)
Advertisement (XAdv)
Q: When does advertisement have a large effect on sales volume?
(Price is high or Price is low)
Interaction term:
If there is an (omitted) interaction term, our assumption that all
explanatory variables are independent is violated
Example:
Trade off: Technical assumptions vs. Modeling bus. reality
Price * advertising
Interactions among explanatory variables
50
51. Interaction Example: Internet, Computer games and Fatigue
0
20
40
60
80
100
0 5 10 15 20 25 30
Fatigue
Hrs spent
Male Female
Y: Fatigue index, on 0-100 scale
X: Hrs spent on the internet and computer games, per week
Do males and females appear to respond differently?
Can we model both in a single regression?
51
52. Proposed population regression model:
Interaction Example: Internet, Computer games and Fatigue
i
i
i
fem
fem
i
hrs
hrs
i
i
fem
i
hrs
i
fem
fem
i
hrs
hrs
i
x
x
x
x
x
x
x
y
+
+
+
+
=
+
+
+
+
=
int,
int
,
,
0
,
,
int
,
,
0 )
(
52
53. Proposed population regression model:
Interpretation
:
fem
i
i
i
fem
fem
i
hrs
hrs
i
i
fem
i
hrs
i
fem
fem
i
hrs
hrs
i
x
x
x
x
x
x
x
y
+
+
+
+
=
+
+
+
+
=
int,
int
,
,
0
,
,
int
,
,
0 )
(
:
int
hrs
fatigue
β0
slope = βhrs
males: β0 + βhrsxhrs,i + εi
βfem
slope = βhrs+βint
females: (β0 + βfem)
+ (βhrs+βint)xhrs,i + εi
Difference in intercept between
females and males.
Avg diff in fatigue index between
females and males when hrs=0
Diff in slopes between
females and males
Avg diff between females and
males of “impact of 1 hr of computer
games on fatigue index”
Interaction Example: Internet, Computer games and Fatigue
53
54. Regression Statistics
Multiple R 0.9680
R Square 0.9369
Adjusted R Square
0.9251
Standard Error 7.5872
Observations 20
ANOVA
df SS MS F Significance F
Regression 3 13682.21831 4560.739 79.22681 8.12E-10
Residual 16 921.0496881 57.56561
Total 19 14603.268
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -29.644 7.5663 -3.9180 0.0012 -45.6844 -13.6046
Hrs spent on Internet and Games
4.738 0.3464 13.6774 0.0000
Female 50.748 10.6102 4.7829 0.0002 28.2550 73.2404
Hr*Female -3.233 0.5137 -6.2942 0.0000 -4.3221 -2.1442
a) Sample regression line:
b) R2 Adj. R2
c) Standard error estimate (s)
fem
hrs
fem
hrs x
x
x
x
y 233
.
3
748
.
50
738
.
4
644
.
29
ˆ −
+
+
−
=
= .9369 = .9251
= 7.5872
Interaction Example: Internet, Computer games and Fatigue
54
55. d) Sample regression line:
fem
hrs
female
hrs x
x
x
x
y 23
.
3
75
.
50
74
.
4
644
.
29
ˆ −
+
+
−
=
Interpretation
b0
bhrs
bfem
bint
hrs
fatigue
Sample regression line for female students:
Sample regression line for male students:
21.11 + 1.51 xhrs
21.11
slope = 1.51
-29.644 + 4.74 xhrs
0
-29.644
slope = 4.74
males
females
intercept for males
slope for males
intercept diff btw females and males
slope diff btw females and males
Interaction Example: Internet, Computer games and Fatigue
55
56. Key learning points
Estimating Multiple Regression
Assessing a Multiple Regression Model
Stepwise Model Building
Option Material:
Testing significance of regression coefficients
Modeling nonlinear relations
Encoding Categorical Variables
Modeling Interaction terms
56