Linear Regression

U N I V E R S I T Y O F S O U T H F L O R I D A //
Linear Regression Concepts
Dr. S. Shivendu

U N I V E R S I T Y O F S O U T H F L O R I D A // 2
Objectives
Identify the mathematical basis of linear
regression.
01
Differentiate statistical inferences about
relationships based on regression output.
02
Analyze the concepts of p-value, hypothesis
testing, and confidence intervals, and their
interpretation.
03

Agenda
Regression Analysis
Introduction
Linear Regression
Concepts
Assumptions
Concepts
Coefficient Confidence Intervals
Concepts
Prediction Confidence Intervals
Concepts

Models
A mathematical model is a mathematical expression of some phenomenon
Describe relationships between variables
Deterministic
Models
Probabilistic
Models

Deterministic Models
Hypothesize exact relationships.
Suitable when the relationship is certain and known.
Example: Force is exactly mass times acceleration
 F = m·a

The relationship is not certain and all factors that impact
the outcome are not known
Hypothesize two components
Probabilistic Models
 Deterministic and random error
Example: Sales volume (y) is 10 times advertising
spending (x) + random error
 y = 10x + 
 The random error may be due to factors
other than advertising

Regression Models
Answers: “What is the relationship between the variables?”
Equations used:
One numerical dependent (response) variable
Used mainly for estimating the strength of the relationship and
for prediction
One or more numerical or categorical independent
(explanatory) variables

Regression Modeling Steps
Hypothesize the
deterministic
relationship
between the
response variable
(dependent
variable) and one
or more
explanatory
(independent
variables) in the
Population
Specify
probability
distribution of
random error
term. Estimate
the standard
deviation of the
error
Estimate
unknown model
parameters
Interpret the
estimated
parameters?
What is a
parameter?

Model Specification is Based on Theory
Theory of field
(e.g., Sociology)
Mathematical
theory
Previous research
“Common sense”

Types of Regression Models
Simple
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Linear Linear Non- Linear
Non- Linear

Linear Regression Models
Relationship between variables is a linear function
y 
Dependent (Response)
Variable
 x 
= + +
Population y - intercept Participation Slope Random Error
Independent (Explanatory)
Variable
0 1

Population Linear Regression Model
y
x
0 1
i i i
y x
  
  
  0 1
E y x
 
 
Observed value
Observed value
i = Random error

Sample Linear Regression Model
y
x
0 1
ˆ ˆ ˆ
i i i
y x
  
  
0 1
ˆ ˆ
ˆi i
y x
 
 
Unsampled observation
i = Random error
Observed value
^

Estimating Parameters: Least Squares Method
Hypothesize deterministic component
Estimate unknown model parameters
Specify probability distribution of random error term
Evaluate model
Use model for prediction and estimation

Scattergram
0
20
40
60
0 20 40 60
x
y
Plot of all (xi, yi) pairs
Suggests how well the model will fit

Thinking Challenge
How would you draw a line
through the points?
0
20
40
60
0 20 40 60
x
y
How would you determine
which line fits best?

Least Squares
“Best fit’ means the
difference between
actual y values and
estimated or predicted y
values are a minimum
 
2 2
1 1
ˆ ˆ
n n
i
i i
i i
y y 
 
 
 
Positive differences off-set
negative
Least Squares minimizes
the Sum of the Squared
Differences (SSE)

Least Squares Graphically
e2
y
x
e1 e3
e4
^
^
^
^
2 0 1 2 2
ˆ ˆ ˆ
y x
  
  
0 1
ˆ ˆ
ˆi i
y x
 
 
2 2 2 2 2
1 2 3 4
1
ˆ ˆ ˆ ˆ ˆ
LS minimizes
n
i
i
    

   


Coefficient Equations
Prediction Equation
0 1
ˆ ˆ
ŷ x
 
 
1 1
1
1 2
1
2
1
ˆ
n n
i i
n
i i
i i
xy i
n
xx
i
n
i
i
i
x y
x y
SS n
SS
x
x
n

 



  
  
  

 
 
 
 

 



Slope
0 1
ˆ ˆ
y x
 
 
y-intercept

Estimated y changes by 1 for each 1unit increase in x
Interpretation of Coefficients
If 1 = 2, then Sales (y) is expected to increase by 2 for each
1 unit increase in Advertising (x)
The average value of y when x = 0
If 0 = 4, then Average Sales (y) is expected to be 4 when
Advertising (x) is 0
Slope (1)
Y-Intercept (0)
^
^
^
^

Parameter Estimation Computer Output
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Param=0 Prob>|T|
INTERCEP 1 -0.1000 0.6350 -0.157 0.8849
ADVERT 1 0.7000 0.1914 3.656 0.0354
0
^
1
^
ˆ .1 .7
y x
  

Sales Volume (y) is expected to increase by .7 units for
each $1 increase in Advertising (x)
Coefficient Interpretation Solution
Average value of Sales Volume (y) is -.10 units when
Advertising (x) is 0
 Difficult to explain to marketing manager
 Expect some sales without advertising
Slope (1)
Y-Intercept (0)
^
^
^
^

Probability Distribution of Random Error
Evaluate model
Use model for prediction and estimation

Linear Regression Assumptions
The mean probability
distribution of error, ε, is
0
The probability
distribution of error, ε, is
approximately normally
distributed
The probability
distribution of error has
a constant variance
Errors are independent

Error Probability Distribution
x1 x2 x3
y
E(y) = β0 + β1x
x

Variation of actual y from
predicted y, y
Random Error Variation
Measured by standard error of
regression model. Sample
standard deviation of  : s
Affects several factors like
parameter significance and
prediction accuracy

Variation Measures
y
x
xi
0 1
ˆ ˆ
ˆi i
y x
 
 
yi
2
ˆ
( )
i i
y y

Unexplained sum of
squares or SSE
2
( )
i
y y

Total sum of squares
2
ˆ
( )
i
y y

Explained sum of
squares
y

Estimation of Variance of Error σ2
 
2
2
ˆ
2
i i
SSE
s where SSE y y
n
  


2
2
SSE
s s
n
 


Residual Analysis
e Y Y
= -
i i
ˆ
Check the assumptions of regression by examining the residuals
 Examine for linearity assumption
 Evaluate independence assumption
 Evaluate normal distribution assumption
 Examine for constant variance for all levels of X (homoscedasticity)
The residual for observation i, ei, is the difference between its
observed and predicted value

Residual Analysis for Linearity
Not Linear Linear
x
residuals
x
Y
x
Y
x
residuals

Residual Analysis for Independence
Not Independent Independent
X
X
residuals
residuals
X
residuals

Check for Normality
Examine the Sem-and-Leaf Display of the Residuals
Examine the Boxplot of the Residuals
Examine the Histogram of the Residuals
Construct a Normal Probability Plot of the Residuals

Residual Analysis for Normality
Percent
Residual
When using a normal probability plot, normal errors
will approximately display in a straight line
-3 -2 -1 0 1 2 3
0
100

Residual Analysis for Equal Variance
Non-constant variance Constant variance
x x
Y
x x
Y
residuals
residuals

Interpreting the Model - Testing for Significance
Interpret model

Test of Slope Coefficient
Shows if there is a linear
relationship between x
and y
Hypotheses:
Involves population
slope 1
Theoretical basis is
sampling distribution of
slope
 H0: 1 = 0 (No Linear Relationship)
 Ha: 1  0 (Linear Relationship)

Sampling Distribution of Sample Slopes
y
Population Line
x
Sample 1 Line
Sample 2 Line
1
Sampling Distribution
1
1
S
^
^
All Possible
Sample Slopes
Sample 1: 2.5
Sample 2: 1.6
Sample 3: 1.8
Sample 4: 2.1
: :
Very large number of
sample slopes

Slope Coefficient Test Statistic
1
1 1
ˆ
2
1
2
1
ˆ ˆ
2
where
xx
n
i
n
i
xx i
i
t df n
s
S
SS
x
SS x
n

 


   
 
 
 
 



Test of Slope Coefficient Computer Output
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Param=0 Prob>|T|
INTERCEP 1 -0.1000 0.6350 -0.157 0.8849
ADVERT 1 0.7000 0.1914 3.656 0.0354
t = 1 / S
P-Value
S
1 1 1
^
^
^
^

Prediction with Regression Models
Types of predictions
What is predicted?
 Point estimates
 Interval estimates
 Population mean response E (y) for given x
 Point on population regression line
 Individual response (y) for given x

Confidence Interval Estimate for Mean Value of y at x = x
 
xx
p
SS
x
x
n
S
t
y
2
2
/
1
ˆ


 
df = n – 2
p

Factors Affecting Interval Width
Level of confidence (1 – )
 Width increases as confidence increases
Data dispersion (s)
 Width increases as variation increases
Sample size
 Width decreases as sample size increases
Distance of x from mean x
 Width increases as distance increases
p
-

Prediction Interval of Individual Value of y at x = x
df = n – 2
p
 
2
/2
1
ˆ 1
p
xx
x x
y t S
n SS


  

Key Takeaway
The statistical
interpretation is the
value proposition of
the linear
regression model
The statistical
interpretation
depends on
assumptions of the
linear model being
met
Understanding
outliers is critical for
drawing meaningful
inferences from the
linear regression
model

U N I V E R S I T Y O F S O U T H F L O R I D A //
You have reached the end
of the presentation.

Linear Regression

More Related Content

Similar to Linear Regression (20)

More from Michael770443 (8)

Recently uploaded (20)

Linear Regression