SlideShare a Scribd company logo
© Faculty of Management
Regression
LESSON 2: Multiple Regression
In most applications, we need more than one variable to explain
the variation of the dependent variable.
Examples
Dependent variable Independent variables
Profitability of a Starbucks store
Montreal house prices
Q1: Which model should we hypothesize?
Q2: Which variables should we include in the model?
Population, college, income levels,
businesses nearby, competition,…
Area, age, # bedrooms, …
Multiple regression
2
Business Reality
Hypothesized
relationships
Estimation and
Analysis
Need for location analysis Profit= 0 + 1XPop + 2XCollege…
Y = b0 + b1Xpop + b2XCollege …
Add or drop variables
Change the model
 Iterative process by nature… (adding or dropping independent variable)
 Independent variables may interact with each other and form a new
independent variable called interaction variable
 May include categorical variables
Once again, use the process!
Interpretation
3
ො
yi = b0 + b1x1i + b2x2i + ⋯ + bkxpi
Estimated
(or predicted)
value of y
Estimated slope coefficients
Estimated multiple regression model:
Estimated
intercept
 All explanatory variables are statistically independent
 Homoskedasticity and normality: i is independent of x and follows
N(0, ) for all x.
 No autocorrelation: Cov(i, j)=0
Multiple regression
yi = β0 + β1x1i + β2x2i + ⋯ + βkxpi + εi
Y-intercept
Population model:
Population slopes Random Error
4
© Faculty of Management
Estimating Multiple
Regression
Example: Two variable model
y
x1
x2
i
i
i x
b
x
b
b
y 2
2
1
1
0
ˆ +
+
=
yi
ŷi
ei = (yi – ŷi)
x2i
x1i
The best fit equation, ŷ,
is found by minimizing the
sum of squared errors, e2
Sample
observation
estimate
Estimated regression line
i
i
i
i x
x
y 


 +
+
+
= 2
2
1
1
0
6
Want to explain the house price (Y) with 3 independent variables
Xsq. ft: the size of the house (in sq. ft)
Xage: the age of the house (in yrs.)
Xbed: the number of bedrooms in the house
Population regression model:
Sample regression line:
Tools/Data Analysis/Regression
i
i
bed
i
age
i
sqft
i X
X
X
Y 



 +
+
+
+
= ,
3
,
2
,
1
0
i
bed
i
age
i
sqft
i X
b
X
b
X
b
b
Y ,
3
,
2
,
1
0
ˆ +
+
+
=
Montreal house price example: Part II
7
Tools/Data Analysis/Regression
Range of Y (include label)
Range of Xj’s (include label)
Check if you have labels
output range
output options
Montreal house price example: Part II
8
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9751
R Square 0.9509
Adjusted R Square
0.9403
Standard Error
396.7315
Observations 18
ANOVA
df SS MS F Significance F
Regression 3 42643276 14214425 90.31003 2.1252E-09
Residual 14 2203542.1 157395.9
Total 17 44846818
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 2036.48 1164.9000 1.7482 0.1023 -461.9861 4534.9379
Sq. Ft 1.15 0.1793 6.4259 0.0000 0.7677 1.5369
Age -58.70 25.4548 -2.3060 0.0369 -113.2931 -4.1029
Bedrooms 54.36 197.9985 0.2746 0.7877 -370.3003 479.0290
Regression summary statistics
ANOVA: overall significance of the model
bj’s (b0 , b1, b2, b3)
Montreal house price example: Regression results
9
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9751
R Square 0.9509
Adjusted R Square
0.9403
Standard Error
396.7315
Observations 18
ANOVA
df SS MS F Significance F
Regression 3 42643276 14214425 90.31003 2.1252E-09
Residual 14 2203542.1 157395.9
Total 17 44846818
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 2036.48 1164.9000 1.7482 0.1023 -461.9861 4534.9379
Sq. Ft 1.15 0.1793 6.4259 0.0000 0.7677 1.5369
Age -58.70 25.4548 -2.3060 0.0369 -113.2931 -4.1029
Bedrooms 54.36 197.9985 0.2746 0.7877 -370.3003 479.0290
Regression summary statistics
ANOVA: overall significance of the model
bj’s (b0 , b1, b2, b3)
Estimated multiple regression model
=
ŷ 2036.48 + 1.15 Xsqft – 58.70 Xage + 54.36 Xbed
Ann Arbor house price example: Regression results
Montreal house price example: Regression results
10
Estimated Multiple Regression Model
bed
age
ft
sq x
x
x
y 36
.
54
7
.
58
15
.
1
48
.
2036
ˆ . +
−
+
=
b0 (intercept):
b1 (slope of xsq ft.):
b2 (slope of xage.):
b3 (slope of xbed.):
If house is 1yr older and nothing else changes, its price decreases
by 58.7 ($’00 ‘s), or $5,870.
If house has 1 more bedroom and nothing else changes, its price
increases by 54.36 ($’00’s), or $5,436.
Montreal house example: Estimated regression
the estimated avg. y value when all ind. variables are zero or
the estimated avg. y value not explained by the model
the estimated avg. change in house price (in ’00 $) per sq.
ft when xage and xbed remain constant.
11
© Faculty of Management
Confidence and
Prediction Intervals
ഥ
𝒚 = 𝑏0 + 𝑏1𝑥𝑠𝑞.𝑓𝑡 + 𝑏2𝑥𝑎𝑔𝑒 + 𝑏3𝑥𝑏𝑒𝑑
Sample regression line
Prediction Interval for a particular y value for given x=(x1, x2,…,xp)
(Prediction interval for a particular y value at x=(x1, x2,…,xp))
is a point estimate for the pop. regression line.
Confidence Interval for the avg. y value for given x=(x1, x2,…,xp)
(CI for the population regression line at x=(x1, x2,…,xp))
If xsq ft = 2500, xage = 5 yrs and xbed = 4
95% CI provides:
95% PI provides:
Interval for price of avg 2500 sqft, 5 yr old, 4 bed houses
Interval for price of a specific 2500 sqft, 5 yr, 4 bed house.
Estimation of mean value (y|x) and prediction for an individual value (yi)
13
Point estimate: ഥ
𝒚 = 𝑏0 + 𝑏1𝑥𝑠𝑞.𝑓𝑡 + 𝑏2𝑥𝑎𝑔𝑒 + 𝑏3𝑥𝑏𝑒𝑑
Critical value: t/2 with df = n-p-1
n
s
t
y p
x
1
2
/ 


The standard error of the estimate, s
Approximated 100(1-)% CI: Approximated 100(1-)% PI:
n
s
t
y p
x
1
1
2
/ +
 

MTL House example II: CI and PI for the y value at x=(x1, x2,…,xp)
14
Sample regression line
ഥ
𝒚 = 2036.48 + 1.15𝑥𝑠𝑞.𝑓𝑡 − 58.7𝑥𝑎𝑔𝑒 + 54.36𝑏𝑒𝑑
Example Construct a 95% CI for the avg. price of 5-year old, 4
bedroom houses with 2500 sq.ft. area:
Point estimate:
Critical value: t/2 with df=
n
s
t
y p
x
1
2
/ 


Approximated 95% CI
2036.48 + 1.15*2500 - 58.7*5 + 54.36*4 = 4835.42 ($’00)
14 crit val = 2.1448
4,835.42 ± 2.1448 * 396.7315 / √18
= [4,634.86, 5,035.98] ($’00)
MTL House example II: CI for the avg. y value at x=(x1, x2,…,xp)
15
Sample regression line
ഥ
𝒚 = 2036.48 + 1.15𝑥𝑠𝑞.𝑓𝑡 − 58.7𝑥𝑎𝑔𝑒 + 54.36𝑏𝑒𝑑
Construct a 95% PI for the price of a 5-year old, 4 bedroom house
with 2500 sq. ft. area
Point estimate:
Critical value: t/2 with df=
Approximated 95% PI
n
s
t
y p
x
1
1
2
/ +
 

2036.48 + 1.15*2500 - 58.7*5 + 54.36*4 = 4835.42 ($’00)
14 critical value = 2.1448
4,835.42 ± 2.1448 * 396.7315 * √(1+1/18)
= 4,835.42 ± 2.1448* 407.60
= [3,961.20, 5,709.64] ($’00)
MTL house example II: PI for a predicted y value at xp=(x1p, x2p,…,xkp)
16
© Faculty of Management
Evaluating Multiple
Regression Model
R2: fraction of the total variation in Y about its mean (SST) that is
explained by the regression (SSR)




 +
+
+
+
= bed
age
ft
sq x
x
x
y 3
2
.
1
0
Hypothesized population regression model:
𝑅2
=
𝑆𝑆𝑅
𝑆𝑆𝑇
=
𝑆𝑆𝑇 − 𝑆𝑆𝐸
𝑆𝑆𝑇
= 1 −
𝑆𝑆𝐸
𝑆𝑆𝑇
=
42,643,276
44,846,818
= 1 −
2,203,542.11
44,846,818.44
= 0.9509
Evaluating the overall regression model: Coefficient of Determination (R2)
18
 R2 will always increase when you add more independent
variables (even if they are not significant.)
 It is a mistake to use R2 in isolation as a criterion for choosing
among competing models.
Adjusted R2
Evaluating the overall regression model: R2 and adjusted R2
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
= 1 − (1 − 𝑅2
)
𝑛 − 1
𝑛 − 𝑝 − 1
 It is better to compare models via “Adjusted R2” rather than R2, but
neither is sufficient in isolation.
 Use logical judgment and consider other factors (residuals)
Adjusted R2: takes into account the cost of adding independent variables and
increased explanatory power.
Adjusted R2 declines if we add a non-significant variable.
19




 +
+
+
+
= bed
age
ft
sq x
x
x
y 3
2
.
1
0
Population regression model of Montreal house prices:
ሜ
𝑅2
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
ሜ
𝑅2
= 1 −
2,203,542
14
44,846,818
17
= 1 −
157,395.86
2,638,048.14
= 0.9403
Evaluating the overall regression model: Adjusted-R2
20



 +
+
+
= age
ft
sq x
x
y 2
.
1
0
New population regression model:
Adjusted R2
If we evaluate the two models using adjusted R2 in isolation, which
model is better?
= 1 −
2,215,407.93
15
44,846,818.44
17
= 1 −
147,693.86
2,638,048.14
= 0.9440
Second model is slightly better
Evaluating the overall regression model: Adjusted-R2
21
= 1 −
𝑆𝑆𝐸
𝑛 − 𝑝 − 1
𝑆𝑆𝑇
𝑛 − 1
= 1 −
𝑀𝑆𝐸
𝑀𝑆𝑇
Adjusted R2
© Faculty of Management
Building a Multiple
Regression Model
© Faculty of Management
Stepwise Model
Building
 Compare adjusted R2, std. error of the estimates, t-test results
Business Reality
Hypothesized
relationships
Estimation and
Analysis
Interpretation
Add or drop variables
Change the model
Choosing the right model
24
 Step 1: Choose the independent variable that results in the largest
value of adjusted 𝑅2
when included in the model.
 Step 2: Continue adding a new independent variable one at a time
as long as adding this independent variable increases adjusted 𝑅2
.
 Step 3: Stop adding variables when adding any remaining
independent variables does not result in an increase in adjusted 𝑅2
.
Stepwise Model Building (based on adjusted 𝑅2)
25
 Step 1: Initially, we have three independent variables:
 𝑋1: Area (Square-ft)
 𝑋2: Age (Years)
 𝑋3: Number of Bedrooms
Question: Which independent variable should be initially selected?
Answer: Note that adding 𝑋1 results in the largest value of adjusted 𝑅2
, which is 0.9214.
So, we start by adding 𝑋1.
Example: Montreal House Dataset
26
IndependentVariable Adjusted 𝑹𝟐
𝑋1 0.9214
𝑋2 0.7917
𝑋3 0.3816
 Step 2:
We have two remaining independent variables:
 𝑋2: Age (Years)
 𝑋3: Number of Bedrooms
Question: Which one results in an increase in adjusted 𝑅2
?
Answer: Note that adding either 𝑋2 or 𝑋3 results in an increase in adjusted 𝑅2
.
Question: Which one results in larger increase in adjusted 𝑅2
?
Answer: Adding 𝑋2 results in a larger increase in adjusted 𝑅2
than adding 𝑋3 does.
So, we add 𝑋2.
Example: Montreal House Dataset
27
IndependentVariable Adjusted 𝑹𝟐
𝑋1 0.9214
𝑋1, 𝑋2 0.9440
𝑋1, 𝑋3 0.9231
 Step 3: We have only one remaining independent variable:
 X3: Number of Bedrooms (Number)
Question: Does adding 𝑋3 result in an increase in adjusted 𝑅2
?
Answer: No, adding 𝑋3 (i.e., number of bedrooms) does not result in an increase in adjusted 𝑅2
.
So, we don’t add 𝑋3 and stop.
Our final regression model contains only two independent variables: X1, and 𝑋2.
Example: Montreal House Dataset
28
IndependentVariable Adjusted 𝑹𝟐
𝑋1, 𝑋2 0.9440
𝑋1, 𝑋2, 𝑋3 0.9403
Estimated Regression Model:
𝑌 = 2274.085 + 1.155 𝑋1 − 61.533 𝑋2
Example: Montreal House Dataset
29
Suppose that we want to predict price of a house with 𝑋1 = 3000 sq ft, and 𝑋2 = 10 years of
age. Construct 95% Prediction Intervals.
Model:
𝑌 = 2274.085 + 1.155 𝑋1 − 61.533 𝑋2
Step 1: Use the above model to find predicted price:
ത
𝑌𝑥 = 2274.085 + 1.155 × 3000 − 61.533 × 10 ≅ 5123.755
Example: Constructing Prediction Intervals
30
Step 2: Next, calculate Margin of Error for 95% prediction interval:
Margin of Error = 𝑠𝜖 × 1 +
1
𝑛
× 𝑡𝛼/2
Standard deviation of residuals 𝑠𝜖 = 384.31
df (degree of freedom) is 18-2-1=15. Critical Value: 𝑡 0.025, 𝑑𝑓 = 15 = 2.131 (Read from
t-table)
Margin of Error = 384.31 × 1 +
1
18
× 2.131 ≅ 841.406
Example: Constructing Prediction Intervals
31
Suppose that we want to predict price of a house with 𝑋1 = 3000 sq ft, and 𝑋2 = 10 years of
age. Construct 95% Prediction Intervals.
Step 3: Calculate 95% LPI and UPI:
𝐿𝑃𝐼 = ത
𝑌𝑥 − 𝑀𝐸 = 5123.755 − 841.406 ≅ 4282.349
𝑈𝑃𝐼 = ഥ
𝑌𝑥 + 𝑀𝐸 = 5123.755 + 841.406 ≅ 5965.161
Example: Constructing Prediction Intervals
32
© Faculty of Management
Optional Materials
© Faculty of Management
Testing significance of
regression coefficients
Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
100(1-)% CI for i
Point Estimate  (Crit. value)(Std. Err) 𝑏𝑖 ± 𝑡𝛼/2𝑠𝑏𝑖
where 𝑑𝑓 = 𝑛 − 𝑝 − 1
95% CI for 1 (sq. ft ) 1.15 ± 2.1448 * 0.1793
= [0.7654, 1.5346] (in $00’s)
Montreal house example: Estimating population parameters: CI’s
35
Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
100(1-)% CI for i
Point Estimate  (Crit. value)(Std. Err) 𝑏𝑖 ± 𝑡𝛼/2𝑠𝑏𝑖
where 𝑑𝑓 = 𝑛 − 𝑝 − 1
90% CI for 3 (bed ) 54.36 ± 1.7613 * 198.00
= [-294.38, 403.10] (in $00’s)
Estimating population regression parameters: CI’s
36
Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
H0: i = 0
HA: i  0
H0: 1 = 0 HA: 1  0 at =0.05
t-stat: =
−
=
1
0
1
b
s
b
t
p-value:
df = n-p-1 = 18-3-1=14
Verdict:
6.4259
0
Reject H0. Conclude that coefficient of square footage is not equal
to zero.
Test for regression parameter (i ): Excel default test
Default test
37
Coefficients Standard Error t Stat P-value
Intercept 2036.48 1164.9000 1.7482 0.1023
Sq. Ft 1.15 0.1793 6.4259 0.0000
Age -58.70 25.4548 -2.3060 0.0369
Bedrooms 54.36 197.9985 0.2746 0.7877
c) H0: 3 = 0 HA: 3  0 at =0.05
t-stat: =
−
=
3
0
3
b
s
b
t
p-value:
df =
Verdict:
In this model, does the # of bedrooms help in explaining
house prices effectively?
0.2746
= T.DIST.2T(.2746,14) = 0.7877
No!
Fail to reject null hypothesis that β3 = 0.
n-p-1 = 14
Test for regression parameter (i ): What if i is not statistically significant?
Critical t = T.INV(.975,14)=2.145
38
-800
-600
-400
-200
0
200
400
600
800
0 2000 4000 6000
Residuals
Sq. Ft
Sq. Ft Residual Plot Age Residual Plot
-800
-600
-400
-200
0
200
400
600
800
0 10 20 30 40
Age
Residuals
Any violations of Regression assumptions?
- homoskedasticity:
- normality:
- no correlation among residuals:
Montreal house: Residual analysis
39
© Faculty of Management
Modeling Nonlinear
Relations
Note that price seems to depend on age in nonlinear fashion. How can we model this nonlinear
relation between Price and Age?
Nonlinear relation between Dependent and Independent Variables
41
Nonlinear relation between Dependent and Independent Variables
42
Solution:We can add a nonlinear term to the regression model.
Revised Model:
Price = 𝑏0 + 𝑏1 × 𝐴𝑔𝑒 + 𝑏2 × 𝐴𝑔𝑒2
Nonlinear relation between Dependent and Independent Variables
43
Question: How can we estimate this by using Excel?
Solution:We can calculate square of age variable for each house and add 𝐴𝑔𝑒2
as
a new independent variable.The revised model will be as follows:
Price = 𝑏0 + 𝑏1 × 𝐴𝑔𝑒 + 𝑏2 × 𝐴𝑔𝑒2
Nonlinear relation between Dependent and Independent Variables
44
Estimated Model:
Price = 8374.55 − 414.633 × 𝐴𝑔𝑒 + 7.578 × 𝐴𝑔𝑒2
© Faculty of Management
Encoding
categorical variables
Categorical Variables
Categories: Labels, Flavors, or Colors
Single numerical scale doesn’t make sense
Q: How do we assign numerical values to categorical data?
Use dummy or indicator variables.
Dummy variable (a.k.a. Indicator, Binary, Categorical, 0-1)
 A 0-1 variable that indicates the presence or absence of an attribute
 Why is it called “dummy”? Numeric value indicates whether the obs. has
the attribute or not.
Example:
Coffee drinker vs. not a coffee drinker
ICoffee =
Example:
Male vs. Female
IFemale =
1 if coffee drinker
0 if not coffee drinker
1 if Female
0 if Male
Categorical data and dummy variables
46
For 2 categories: we need ( ) dummy variable.
Why? The “base” case was represented by I=0.
Q: How many dummy var.’s do we need to model “n” mutually exclusive
and collectively exhaustive categories?
Gender has two
categories (n=2)
Choose one (e.g.
male) as base case
Define a “0 or 1” dummy
variable for the other
(e.g. IFemale)
Ij = 1 if the obs. belongs to category j
0 otherwise
j=1,2…… (?)
1
Answer: n-1 dummy variables
Categorical data and dummy variables: How many variables do we need?
47
Example
Categories
First class
Business class
Economy class
Base case
Economy
IF = 1 for a first-class passenger
0 otherwise
IB = 1 for a bus. class passenger
0 otherwise
a) How do we model if the i-th passenger is a first-class passenger?
b) How do we model if the i-th passenger is a bus-class passenger?
c) How do we model if the i-th passenger is a economy passenger?
IF,i = 1 and IB,i = 0
IF,i = 0 and IB,i = 1
IF,i = 0 and IB,i = 0
Categorical data and dummy variables: How many variables do we need?
48
© Faculty of Management
Modeling Interaction
between Independent
Variables
Interaction: the effect of one explanatory variable onto Y is dependent
on the value of other explanatory variable(s).
Sales Volume of Dell PC’s (Y) Price (XPrice)
Advertisement (XAdv)
Q: When does advertisement have a large effect on sales volume?
(Price is high or Price is low)
Interaction term:
If there is an (omitted) interaction term, our assumption that all
explanatory variables are independent is violated
Example:
Trade off: Technical assumptions vs. Modeling bus. reality
Price * advertising
Interactions among explanatory variables
50
Interaction Example: Internet, Computer games and Fatigue
0
20
40
60
80
100
0 5 10 15 20 25 30
Fatigue
Hrs spent
Male Female
Y: Fatigue index, on 0-100 scale
X: Hrs spent on the internet and computer games, per week
Do males and females appear to respond differently?
Can we model both in a single regression?
51
Proposed population regression model:
Interaction Example: Internet, Computer games and Fatigue
i
i
i
fem
fem
i
hrs
hrs
i
i
fem
i
hrs
i
fem
fem
i
hrs
hrs
i
x
x
x
x
x
x
x
y










+
+
+
+
=
+
+
+
+
=
int,
int
,
,
0
,
,
int
,
,
0 )
(
52
Proposed population regression model:
Interpretation
:
fem

i
i
i
fem
fem
i
hrs
hrs
i
i
fem
i
hrs
i
fem
fem
i
hrs
hrs
i
x
x
x
x
x
x
x
y










+
+
+
+
=
+
+
+
+
=
int,
int
,
,
0
,
,
int
,
,
0 )
(
:
int

hrs
fatigue
β0
slope = βhrs
males: β0 + βhrsxhrs,i + εi
βfem
slope = βhrs+βint
females: (β0 + βfem)
+ (βhrs+βint)xhrs,i + εi
Difference in intercept between
females and males.
Avg diff in fatigue index between
females and males when hrs=0
Diff in slopes between
females and males
Avg diff between females and
males of “impact of 1 hr of computer
games on fatigue index”
Interaction Example: Internet, Computer games and Fatigue
53
Regression Statistics
Multiple R 0.9680
R Square 0.9369
Adjusted R Square
0.9251
Standard Error 7.5872
Observations 20
ANOVA
df SS MS F Significance F
Regression 3 13682.21831 4560.739 79.22681 8.12E-10
Residual 16 921.0496881 57.56561
Total 19 14603.268
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -29.644 7.5663 -3.9180 0.0012 -45.6844 -13.6046
Hrs spent on Internet and Games
4.738 0.3464 13.6774 0.0000
Female 50.748 10.6102 4.7829 0.0002 28.2550 73.2404
Hr*Female -3.233 0.5137 -6.2942 0.0000 -4.3221 -2.1442
a) Sample regression line:
b) R2 Adj. R2
c) Standard error estimate (s)
fem
hrs
fem
hrs x
x
x
x
y 233
.
3
748
.
50
738
.
4
644
.
29
ˆ −
+
+
−
=
= .9369 = .9251
= 7.5872
Interaction Example: Internet, Computer games and Fatigue
54
d) Sample regression line:
fem
hrs
female
hrs x
x
x
x
y 23
.
3
75
.
50
74
.
4
644
.
29
ˆ −
+
+
−
=
Interpretation
b0
bhrs
bfem
bint
hrs
fatigue
Sample regression line for female students:
Sample regression line for male students:
21.11 + 1.51 xhrs
21.11
slope = 1.51
-29.644 + 4.74 xhrs
0
-29.644
slope = 4.74
males
females
intercept for males
slope for males
intercept diff btw females and males
slope diff btw females and males
Interaction Example: Internet, Computer games and Fatigue
55
Key learning points
 Estimating Multiple Regression
 Assessing a Multiple Regression Model
 Stepwise Model Building
 Option Material:
 Testing significance of regression coefficients
 Modeling nonlinear relations
 Encoding Categorical Variables
 Modeling Interaction terms
56

More Related Content

PPT
Multiple Regression.ppt
PDF
Module 3 Course Slides Lesson 1 McGill University
PPTX
Lecture - 8 MLR.pptx
PDF
15 ch ken black solution
PPTX
Regression Presentation.pptx
PPT
Lesson07_new
PPT
Multiple Regression
PPT
Rsh qam11 ch04 ge
Multiple Regression.ppt
Module 3 Course Slides Lesson 1 McGill University
Lecture - 8 MLR.pptx
15 ch ken black solution
Regression Presentation.pptx
Lesson07_new
Multiple Regression
Rsh qam11 ch04 ge

Similar to Module 3 Course Slides Lesson 2 McGill University (20)

PPT
Bba 3274 qm week 6 part 1 regression models
PDF
Multivariate Regression Analysis
PPTX
PPT
Newbold_chap14.ppt
PPT
regression_with_variate_type_multiple.ppt
PPT
Regression Analysis - Linear & Multiple Models
PPT
multiple.ppt
PPT
multiple.ppt
PPT
multiple.ppt
PPT
Chapter14
PPT
simple linear regression statistics course
PPTX
Different Types of Machine Learning Algorithms
PPTX
Chap12 multiple regression
PDF
Regression
PPT
Data Analysison Regression
PPT
604_multiplee.ppt
PPT
An introduction to the Multivariable analysis.ppt
DOC
Statistics project2
PDF
Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2
PPT
Input analysis
Bba 3274 qm week 6 part 1 regression models
Multivariate Regression Analysis
Newbold_chap14.ppt
regression_with_variate_type_multiple.ppt
Regression Analysis - Linear & Multiple Models
multiple.ppt
multiple.ppt
multiple.ppt
Chapter14
simple linear regression statistics course
Different Types of Machine Learning Algorithms
Chap12 multiple regression
Regression
Data Analysison Regression
604_multiplee.ppt
An introduction to the Multivariable analysis.ppt
Statistics project2
Quantitative Methods for Lawyers - Class #19 - Regression Analysis - Part 2
Input analysis
Ad

Recently uploaded (20)

PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
web development for engineering and engineering
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
DOCX
573137875-Attendance-Management-System-original
PPT
Mechanical Engineering MATERIALS Selection
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Sustainable Sites - Green Building Construction
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
composite construction of structures.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
web development for engineering and engineering
Automation-in-Manufacturing-Chapter-Introduction.pdf
UNIT 4 Total Quality Management .pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
573137875-Attendance-Management-System-original
Mechanical Engineering MATERIALS Selection
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Embodied AI: Ushering in the Next Era of Intelligent Systems
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Sustainable Sites - Green Building Construction
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
composite construction of structures.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Ad

Module 3 Course Slides Lesson 2 McGill University

  • 1. © Faculty of Management Regression LESSON 2: Multiple Regression
  • 2. In most applications, we need more than one variable to explain the variation of the dependent variable. Examples Dependent variable Independent variables Profitability of a Starbucks store Montreal house prices Q1: Which model should we hypothesize? Q2: Which variables should we include in the model? Population, college, income levels, businesses nearby, competition,… Area, age, # bedrooms, … Multiple regression 2
  • 3. Business Reality Hypothesized relationships Estimation and Analysis Need for location analysis Profit= 0 + 1XPop + 2XCollege… Y = b0 + b1Xpop + b2XCollege … Add or drop variables Change the model  Iterative process by nature… (adding or dropping independent variable)  Independent variables may interact with each other and form a new independent variable called interaction variable  May include categorical variables Once again, use the process! Interpretation 3
  • 4. ො yi = b0 + b1x1i + b2x2i + ⋯ + bkxpi Estimated (or predicted) value of y Estimated slope coefficients Estimated multiple regression model: Estimated intercept  All explanatory variables are statistically independent  Homoskedasticity and normality: i is independent of x and follows N(0, ) for all x.  No autocorrelation: Cov(i, j)=0 Multiple regression yi = β0 + β1x1i + β2x2i + ⋯ + βkxpi + εi Y-intercept Population model: Population slopes Random Error 4
  • 5. © Faculty of Management Estimating Multiple Regression
  • 6. Example: Two variable model y x1 x2 i i i x b x b b y 2 2 1 1 0 ˆ + + = yi ŷi ei = (yi – ŷi) x2i x1i The best fit equation, ŷ, is found by minimizing the sum of squared errors, e2 Sample observation estimate Estimated regression line i i i i x x y     + + + = 2 2 1 1 0 6
  • 7. Want to explain the house price (Y) with 3 independent variables Xsq. ft: the size of the house (in sq. ft) Xage: the age of the house (in yrs.) Xbed: the number of bedrooms in the house Population regression model: Sample regression line: Tools/Data Analysis/Regression i i bed i age i sqft i X X X Y      + + + + = , 3 , 2 , 1 0 i bed i age i sqft i X b X b X b b Y , 3 , 2 , 1 0 ˆ + + + = Montreal house price example: Part II 7
  • 8. Tools/Data Analysis/Regression Range of Y (include label) Range of Xj’s (include label) Check if you have labels output range output options Montreal house price example: Part II 8
  • 9. SUMMARY OUTPUT Regression Statistics Multiple R 0.9751 R Square 0.9509 Adjusted R Square 0.9403 Standard Error 396.7315 Observations 18 ANOVA df SS MS F Significance F Regression 3 42643276 14214425 90.31003 2.1252E-09 Residual 14 2203542.1 157395.9 Total 17 44846818 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 2036.48 1164.9000 1.7482 0.1023 -461.9861 4534.9379 Sq. Ft 1.15 0.1793 6.4259 0.0000 0.7677 1.5369 Age -58.70 25.4548 -2.3060 0.0369 -113.2931 -4.1029 Bedrooms 54.36 197.9985 0.2746 0.7877 -370.3003 479.0290 Regression summary statistics ANOVA: overall significance of the model bj’s (b0 , b1, b2, b3) Montreal house price example: Regression results 9
  • 10. SUMMARY OUTPUT Regression Statistics Multiple R 0.9751 R Square 0.9509 Adjusted R Square 0.9403 Standard Error 396.7315 Observations 18 ANOVA df SS MS F Significance F Regression 3 42643276 14214425 90.31003 2.1252E-09 Residual 14 2203542.1 157395.9 Total 17 44846818 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 2036.48 1164.9000 1.7482 0.1023 -461.9861 4534.9379 Sq. Ft 1.15 0.1793 6.4259 0.0000 0.7677 1.5369 Age -58.70 25.4548 -2.3060 0.0369 -113.2931 -4.1029 Bedrooms 54.36 197.9985 0.2746 0.7877 -370.3003 479.0290 Regression summary statistics ANOVA: overall significance of the model bj’s (b0 , b1, b2, b3) Estimated multiple regression model = ŷ 2036.48 + 1.15 Xsqft – 58.70 Xage + 54.36 Xbed Ann Arbor house price example: Regression results Montreal house price example: Regression results 10
  • 11. Estimated Multiple Regression Model bed age ft sq x x x y 36 . 54 7 . 58 15 . 1 48 . 2036 ˆ . + − + = b0 (intercept): b1 (slope of xsq ft.): b2 (slope of xage.): b3 (slope of xbed.): If house is 1yr older and nothing else changes, its price decreases by 58.7 ($’00 ‘s), or $5,870. If house has 1 more bedroom and nothing else changes, its price increases by 54.36 ($’00’s), or $5,436. Montreal house example: Estimated regression the estimated avg. y value when all ind. variables are zero or the estimated avg. y value not explained by the model the estimated avg. change in house price (in ’00 $) per sq. ft when xage and xbed remain constant. 11
  • 12. © Faculty of Management Confidence and Prediction Intervals
  • 13. ഥ 𝒚 = 𝑏0 + 𝑏1𝑥𝑠𝑞.𝑓𝑡 + 𝑏2𝑥𝑎𝑔𝑒 + 𝑏3𝑥𝑏𝑒𝑑 Sample regression line Prediction Interval for a particular y value for given x=(x1, x2,…,xp) (Prediction interval for a particular y value at x=(x1, x2,…,xp)) is a point estimate for the pop. regression line. Confidence Interval for the avg. y value for given x=(x1, x2,…,xp) (CI for the population regression line at x=(x1, x2,…,xp)) If xsq ft = 2500, xage = 5 yrs and xbed = 4 95% CI provides: 95% PI provides: Interval for price of avg 2500 sqft, 5 yr old, 4 bed houses Interval for price of a specific 2500 sqft, 5 yr, 4 bed house. Estimation of mean value (y|x) and prediction for an individual value (yi) 13
  • 14. Point estimate: ഥ 𝒚 = 𝑏0 + 𝑏1𝑥𝑠𝑞.𝑓𝑡 + 𝑏2𝑥𝑎𝑔𝑒 + 𝑏3𝑥𝑏𝑒𝑑 Critical value: t/2 with df = n-p-1 n s t y p x 1 2 /    The standard error of the estimate, s Approximated 100(1-)% CI: Approximated 100(1-)% PI: n s t y p x 1 1 2 / +    MTL House example II: CI and PI for the y value at x=(x1, x2,…,xp) 14
  • 15. Sample regression line ഥ 𝒚 = 2036.48 + 1.15𝑥𝑠𝑞.𝑓𝑡 − 58.7𝑥𝑎𝑔𝑒 + 54.36𝑏𝑒𝑑 Example Construct a 95% CI for the avg. price of 5-year old, 4 bedroom houses with 2500 sq.ft. area: Point estimate: Critical value: t/2 with df= n s t y p x 1 2 /    Approximated 95% CI 2036.48 + 1.15*2500 - 58.7*5 + 54.36*4 = 4835.42 ($’00) 14 crit val = 2.1448 4,835.42 ± 2.1448 * 396.7315 / √18 = [4,634.86, 5,035.98] ($’00) MTL House example II: CI for the avg. y value at x=(x1, x2,…,xp) 15
  • 16. Sample regression line ഥ 𝒚 = 2036.48 + 1.15𝑥𝑠𝑞.𝑓𝑡 − 58.7𝑥𝑎𝑔𝑒 + 54.36𝑏𝑒𝑑 Construct a 95% PI for the price of a 5-year old, 4 bedroom house with 2500 sq. ft. area Point estimate: Critical value: t/2 with df= Approximated 95% PI n s t y p x 1 1 2 / +    2036.48 + 1.15*2500 - 58.7*5 + 54.36*4 = 4835.42 ($’00) 14 critical value = 2.1448 4,835.42 ± 2.1448 * 396.7315 * √(1+1/18) = 4,835.42 ± 2.1448* 407.60 = [3,961.20, 5,709.64] ($’00) MTL house example II: PI for a predicted y value at xp=(x1p, x2p,…,xkp) 16
  • 17. © Faculty of Management Evaluating Multiple Regression Model
  • 18. R2: fraction of the total variation in Y about its mean (SST) that is explained by the regression (SSR)      + + + + = bed age ft sq x x x y 3 2 . 1 0 Hypothesized population regression model: 𝑅2 = 𝑆𝑆𝑅 𝑆𝑆𝑇 = 𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑆𝑆𝑇 = 1 − 𝑆𝑆𝐸 𝑆𝑆𝑇 = 42,643,276 44,846,818 = 1 − 2,203,542.11 44,846,818.44 = 0.9509 Evaluating the overall regression model: Coefficient of Determination (R2) 18
  • 19.  R2 will always increase when you add more independent variables (even if they are not significant.)  It is a mistake to use R2 in isolation as a criterion for choosing among competing models. Adjusted R2 Evaluating the overall regression model: R2 and adjusted R2 = 1 − 𝑆𝑆𝐸 𝑛 − 𝑝 − 1 𝑆𝑆𝑇 𝑛 − 1 = 1 − 𝑀𝑆𝐸 𝑀𝑆𝑇 = 1 − (1 − 𝑅2 ) 𝑛 − 1 𝑛 − 𝑝 − 1  It is better to compare models via “Adjusted R2” rather than R2, but neither is sufficient in isolation.  Use logical judgment and consider other factors (residuals) Adjusted R2: takes into account the cost of adding independent variables and increased explanatory power. Adjusted R2 declines if we add a non-significant variable. 19
  • 20.      + + + + = bed age ft sq x x x y 3 2 . 1 0 Population regression model of Montreal house prices: ሜ 𝑅2 = 1 − 𝑆𝑆𝐸 𝑛 − 𝑝 − 1 𝑆𝑆𝑇 𝑛 − 1 = 1 − 𝑀𝑆𝐸 𝑀𝑆𝑇 ሜ 𝑅2 = 1 − 2,203,542 14 44,846,818 17 = 1 − 157,395.86 2,638,048.14 = 0.9403 Evaluating the overall regression model: Adjusted-R2 20
  • 21.     + + + = age ft sq x x y 2 . 1 0 New population regression model: Adjusted R2 If we evaluate the two models using adjusted R2 in isolation, which model is better? = 1 − 2,215,407.93 15 44,846,818.44 17 = 1 − 147,693.86 2,638,048.14 = 0.9440 Second model is slightly better Evaluating the overall regression model: Adjusted-R2 21 = 1 − 𝑆𝑆𝐸 𝑛 − 𝑝 − 1 𝑆𝑆𝑇 𝑛 − 1 = 1 − 𝑀𝑆𝐸 𝑀𝑆𝑇 Adjusted R2
  • 22. © Faculty of Management Building a Multiple Regression Model
  • 23. © Faculty of Management Stepwise Model Building
  • 24.  Compare adjusted R2, std. error of the estimates, t-test results Business Reality Hypothesized relationships Estimation and Analysis Interpretation Add or drop variables Change the model Choosing the right model 24
  • 25.  Step 1: Choose the independent variable that results in the largest value of adjusted 𝑅2 when included in the model.  Step 2: Continue adding a new independent variable one at a time as long as adding this independent variable increases adjusted 𝑅2 .  Step 3: Stop adding variables when adding any remaining independent variables does not result in an increase in adjusted 𝑅2 . Stepwise Model Building (based on adjusted 𝑅2) 25
  • 26.  Step 1: Initially, we have three independent variables:  𝑋1: Area (Square-ft)  𝑋2: Age (Years)  𝑋3: Number of Bedrooms Question: Which independent variable should be initially selected? Answer: Note that adding 𝑋1 results in the largest value of adjusted 𝑅2 , which is 0.9214. So, we start by adding 𝑋1. Example: Montreal House Dataset 26 IndependentVariable Adjusted 𝑹𝟐 𝑋1 0.9214 𝑋2 0.7917 𝑋3 0.3816
  • 27.  Step 2: We have two remaining independent variables:  𝑋2: Age (Years)  𝑋3: Number of Bedrooms Question: Which one results in an increase in adjusted 𝑅2 ? Answer: Note that adding either 𝑋2 or 𝑋3 results in an increase in adjusted 𝑅2 . Question: Which one results in larger increase in adjusted 𝑅2 ? Answer: Adding 𝑋2 results in a larger increase in adjusted 𝑅2 than adding 𝑋3 does. So, we add 𝑋2. Example: Montreal House Dataset 27 IndependentVariable Adjusted 𝑹𝟐 𝑋1 0.9214 𝑋1, 𝑋2 0.9440 𝑋1, 𝑋3 0.9231
  • 28.  Step 3: We have only one remaining independent variable:  X3: Number of Bedrooms (Number) Question: Does adding 𝑋3 result in an increase in adjusted 𝑅2 ? Answer: No, adding 𝑋3 (i.e., number of bedrooms) does not result in an increase in adjusted 𝑅2 . So, we don’t add 𝑋3 and stop. Our final regression model contains only two independent variables: X1, and 𝑋2. Example: Montreal House Dataset 28 IndependentVariable Adjusted 𝑹𝟐 𝑋1, 𝑋2 0.9440 𝑋1, 𝑋2, 𝑋3 0.9403
  • 29. Estimated Regression Model: 𝑌 = 2274.085 + 1.155 𝑋1 − 61.533 𝑋2 Example: Montreal House Dataset 29
  • 30. Suppose that we want to predict price of a house with 𝑋1 = 3000 sq ft, and 𝑋2 = 10 years of age. Construct 95% Prediction Intervals. Model: 𝑌 = 2274.085 + 1.155 𝑋1 − 61.533 𝑋2 Step 1: Use the above model to find predicted price: ത 𝑌𝑥 = 2274.085 + 1.155 × 3000 − 61.533 × 10 ≅ 5123.755 Example: Constructing Prediction Intervals 30
  • 31. Step 2: Next, calculate Margin of Error for 95% prediction interval: Margin of Error = 𝑠𝜖 × 1 + 1 𝑛 × 𝑡𝛼/2 Standard deviation of residuals 𝑠𝜖 = 384.31 df (degree of freedom) is 18-2-1=15. Critical Value: 𝑡 0.025, 𝑑𝑓 = 15 = 2.131 (Read from t-table) Margin of Error = 384.31 × 1 + 1 18 × 2.131 ≅ 841.406 Example: Constructing Prediction Intervals 31
  • 32. Suppose that we want to predict price of a house with 𝑋1 = 3000 sq ft, and 𝑋2 = 10 years of age. Construct 95% Prediction Intervals. Step 3: Calculate 95% LPI and UPI: 𝐿𝑃𝐼 = ത 𝑌𝑥 − 𝑀𝐸 = 5123.755 − 841.406 ≅ 4282.349 𝑈𝑃𝐼 = ഥ 𝑌𝑥 + 𝑀𝐸 = 5123.755 + 841.406 ≅ 5965.161 Example: Constructing Prediction Intervals 32
  • 33. © Faculty of Management Optional Materials
  • 34. © Faculty of Management Testing significance of regression coefficients
  • 35. Coefficients Standard Error t Stat P-value Intercept 2036.48 1164.9000 1.7482 0.1023 Sq. Ft 1.15 0.1793 6.4259 0.0000 Age -58.70 25.4548 -2.3060 0.0369 Bedrooms 54.36 197.9985 0.2746 0.7877 100(1-)% CI for i Point Estimate  (Crit. value)(Std. Err) 𝑏𝑖 ± 𝑡𝛼/2𝑠𝑏𝑖 where 𝑑𝑓 = 𝑛 − 𝑝 − 1 95% CI for 1 (sq. ft ) 1.15 ± 2.1448 * 0.1793 = [0.7654, 1.5346] (in $00’s) Montreal house example: Estimating population parameters: CI’s 35
  • 36. Coefficients Standard Error t Stat P-value Intercept 2036.48 1164.9000 1.7482 0.1023 Sq. Ft 1.15 0.1793 6.4259 0.0000 Age -58.70 25.4548 -2.3060 0.0369 Bedrooms 54.36 197.9985 0.2746 0.7877 100(1-)% CI for i Point Estimate  (Crit. value)(Std. Err) 𝑏𝑖 ± 𝑡𝛼/2𝑠𝑏𝑖 where 𝑑𝑓 = 𝑛 − 𝑝 − 1 90% CI for 3 (bed ) 54.36 ± 1.7613 * 198.00 = [-294.38, 403.10] (in $00’s) Estimating population regression parameters: CI’s 36
  • 37. Coefficients Standard Error t Stat P-value Intercept 2036.48 1164.9000 1.7482 0.1023 Sq. Ft 1.15 0.1793 6.4259 0.0000 Age -58.70 25.4548 -2.3060 0.0369 Bedrooms 54.36 197.9985 0.2746 0.7877 H0: i = 0 HA: i  0 H0: 1 = 0 HA: 1  0 at =0.05 t-stat: = − = 1 0 1 b s b t p-value: df = n-p-1 = 18-3-1=14 Verdict: 6.4259 0 Reject H0. Conclude that coefficient of square footage is not equal to zero. Test for regression parameter (i ): Excel default test Default test 37
  • 38. Coefficients Standard Error t Stat P-value Intercept 2036.48 1164.9000 1.7482 0.1023 Sq. Ft 1.15 0.1793 6.4259 0.0000 Age -58.70 25.4548 -2.3060 0.0369 Bedrooms 54.36 197.9985 0.2746 0.7877 c) H0: 3 = 0 HA: 3  0 at =0.05 t-stat: = − = 3 0 3 b s b t p-value: df = Verdict: In this model, does the # of bedrooms help in explaining house prices effectively? 0.2746 = T.DIST.2T(.2746,14) = 0.7877 No! Fail to reject null hypothesis that β3 = 0. n-p-1 = 14 Test for regression parameter (i ): What if i is not statistically significant? Critical t = T.INV(.975,14)=2.145 38
  • 39. -800 -600 -400 -200 0 200 400 600 800 0 2000 4000 6000 Residuals Sq. Ft Sq. Ft Residual Plot Age Residual Plot -800 -600 -400 -200 0 200 400 600 800 0 10 20 30 40 Age Residuals Any violations of Regression assumptions? - homoskedasticity: - normality: - no correlation among residuals: Montreal house: Residual analysis 39
  • 40. © Faculty of Management Modeling Nonlinear Relations
  • 41. Note that price seems to depend on age in nonlinear fashion. How can we model this nonlinear relation between Price and Age? Nonlinear relation between Dependent and Independent Variables 41
  • 42. Nonlinear relation between Dependent and Independent Variables 42 Solution:We can add a nonlinear term to the regression model. Revised Model: Price = 𝑏0 + 𝑏1 × 𝐴𝑔𝑒 + 𝑏2 × 𝐴𝑔𝑒2
  • 43. Nonlinear relation between Dependent and Independent Variables 43 Question: How can we estimate this by using Excel? Solution:We can calculate square of age variable for each house and add 𝐴𝑔𝑒2 as a new independent variable.The revised model will be as follows: Price = 𝑏0 + 𝑏1 × 𝐴𝑔𝑒 + 𝑏2 × 𝐴𝑔𝑒2
  • 44. Nonlinear relation between Dependent and Independent Variables 44 Estimated Model: Price = 8374.55 − 414.633 × 𝐴𝑔𝑒 + 7.578 × 𝐴𝑔𝑒2
  • 45. © Faculty of Management Encoding categorical variables
  • 46. Categorical Variables Categories: Labels, Flavors, or Colors Single numerical scale doesn’t make sense Q: How do we assign numerical values to categorical data? Use dummy or indicator variables. Dummy variable (a.k.a. Indicator, Binary, Categorical, 0-1)  A 0-1 variable that indicates the presence or absence of an attribute  Why is it called “dummy”? Numeric value indicates whether the obs. has the attribute or not. Example: Coffee drinker vs. not a coffee drinker ICoffee = Example: Male vs. Female IFemale = 1 if coffee drinker 0 if not coffee drinker 1 if Female 0 if Male Categorical data and dummy variables 46
  • 47. For 2 categories: we need ( ) dummy variable. Why? The “base” case was represented by I=0. Q: How many dummy var.’s do we need to model “n” mutually exclusive and collectively exhaustive categories? Gender has two categories (n=2) Choose one (e.g. male) as base case Define a “0 or 1” dummy variable for the other (e.g. IFemale) Ij = 1 if the obs. belongs to category j 0 otherwise j=1,2…… (?) 1 Answer: n-1 dummy variables Categorical data and dummy variables: How many variables do we need? 47
  • 48. Example Categories First class Business class Economy class Base case Economy IF = 1 for a first-class passenger 0 otherwise IB = 1 for a bus. class passenger 0 otherwise a) How do we model if the i-th passenger is a first-class passenger? b) How do we model if the i-th passenger is a bus-class passenger? c) How do we model if the i-th passenger is a economy passenger? IF,i = 1 and IB,i = 0 IF,i = 0 and IB,i = 1 IF,i = 0 and IB,i = 0 Categorical data and dummy variables: How many variables do we need? 48
  • 49. © Faculty of Management Modeling Interaction between Independent Variables
  • 50. Interaction: the effect of one explanatory variable onto Y is dependent on the value of other explanatory variable(s). Sales Volume of Dell PC’s (Y) Price (XPrice) Advertisement (XAdv) Q: When does advertisement have a large effect on sales volume? (Price is high or Price is low) Interaction term: If there is an (omitted) interaction term, our assumption that all explanatory variables are independent is violated Example: Trade off: Technical assumptions vs. Modeling bus. reality Price * advertising Interactions among explanatory variables 50
  • 51. Interaction Example: Internet, Computer games and Fatigue 0 20 40 60 80 100 0 5 10 15 20 25 30 Fatigue Hrs spent Male Female Y: Fatigue index, on 0-100 scale X: Hrs spent on the internet and computer games, per week Do males and females appear to respond differently? Can we model both in a single regression? 51
  • 52. Proposed population regression model: Interaction Example: Internet, Computer games and Fatigue i i i fem fem i hrs hrs i i fem i hrs i fem fem i hrs hrs i x x x x x x x y           + + + + = + + + + = int, int , , 0 , , int , , 0 ) ( 52
  • 53. Proposed population regression model: Interpretation : fem  i i i fem fem i hrs hrs i i fem i hrs i fem fem i hrs hrs i x x x x x x x y           + + + + = + + + + = int, int , , 0 , , int , , 0 ) ( : int  hrs fatigue β0 slope = βhrs males: β0 + βhrsxhrs,i + εi βfem slope = βhrs+βint females: (β0 + βfem) + (βhrs+βint)xhrs,i + εi Difference in intercept between females and males. Avg diff in fatigue index between females and males when hrs=0 Diff in slopes between females and males Avg diff between females and males of “impact of 1 hr of computer games on fatigue index” Interaction Example: Internet, Computer games and Fatigue 53
  • 54. Regression Statistics Multiple R 0.9680 R Square 0.9369 Adjusted R Square 0.9251 Standard Error 7.5872 Observations 20 ANOVA df SS MS F Significance F Regression 3 13682.21831 4560.739 79.22681 8.12E-10 Residual 16 921.0496881 57.56561 Total 19 14603.268 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -29.644 7.5663 -3.9180 0.0012 -45.6844 -13.6046 Hrs spent on Internet and Games 4.738 0.3464 13.6774 0.0000 Female 50.748 10.6102 4.7829 0.0002 28.2550 73.2404 Hr*Female -3.233 0.5137 -6.2942 0.0000 -4.3221 -2.1442 a) Sample regression line: b) R2 Adj. R2 c) Standard error estimate (s) fem hrs fem hrs x x x x y 233 . 3 748 . 50 738 . 4 644 . 29 ˆ − + + − = = .9369 = .9251 = 7.5872 Interaction Example: Internet, Computer games and Fatigue 54
  • 55. d) Sample regression line: fem hrs female hrs x x x x y 23 . 3 75 . 50 74 . 4 644 . 29 ˆ − + + − = Interpretation b0 bhrs bfem bint hrs fatigue Sample regression line for female students: Sample regression line for male students: 21.11 + 1.51 xhrs 21.11 slope = 1.51 -29.644 + 4.74 xhrs 0 -29.644 slope = 4.74 males females intercept for males slope for males intercept diff btw females and males slope diff btw females and males Interaction Example: Internet, Computer games and Fatigue 55
  • 56. Key learning points  Estimating Multiple Regression  Assessing a Multiple Regression Model  Stepwise Model Building  Option Material:  Testing significance of regression coefficients  Modeling nonlinear relations  Encoding Categorical Variables  Modeling Interaction terms 56