SlideShare a Scribd company logo
Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy

Lesson 4 - The linear Regression Model: Theory and
Application - 21.01.2014
Introduction

In the past praticals we analyzed one variable.
For certain reasons, it is even usefull to analyze two or more
variables together.
The question we want to asnwer regards what are the relations,
the causal effects determining changes in a variable. Analyze if a
certain phenomenon is endogenous or exogenous.
In symbols, the idea can be represent as follow:
y = f (x1 , x2 , ...)
Y is the response, which is a function (it depends on) one or more
variables.
Objectives

All in all, the regression model is the instrument used to:
measure the entity of the relations between two or more
variables: Y / X ,
and to measure the causal direction ( X −→
viceversa? )

Y or

forecast the value of the variable Y in response to some
changes in the others X1 , X2 , ... (called explanatories),
or for some cases that are not considered in the sample.
Simple linear regression model
The regression model is stochastic, not deterministic.
Giving two sets of values (two variables) from a random sample of
length n: x = {x1 , x2 , ..., xi , ..xn }; y = {y1 , y2 , ..., yi , ..yn }:
Deterministic formula:
yi = β0 + β1 xi , ∀i = 1, .., n
Stochastic formula:
yi = β0 + β1 xi +
where

i

i

∀i = 1, .., n

is the stochastic component.

β1 define the slope in the relations between X and Y (See graph in
chart 1)
Simple linear regression model - 2

ˆ
ˆ ˆ
We need to find β = {β0 , β1 } as estimators of β0 and β1 .
After β is estimated, we can draw the estimated regression line,
which corresponds to the estimated regression model, as
follow:
ˆ
ˆ
yi = β0 + β1 xi
ˆ
Here, ˆi = yi − yi .
ˆ
Where yi is the i-element of the estimated Y vector, and yi is the
ˆ
i-elements of the real Y vector. (see graph in chart 2)
Steps in the Analysis

1. Study the relations (scatterplot, correlations) between two or
more variables.
ˆ
ˆ ˆ
2. Estimation of the parameters of the model β = {β0 , β1 }.
ˆ
3. Hypotesis tests on the estimated β1 to verify the casual
effects between X and Y
4. Robustness check of the model.
5. Use the model to analyze the causal effect and/or to do
forecasting.
Why linear?

It is simple to estimate, to analyze and to interpret
it likely fits with most of empirical cases, in which the
relations between two phenomenon is linear.
There are a lot of implemented methods to transorm variables
in order to obtain a linear relationship (log transformation,
normalization, etc.. )
Model Hypotesis

In order the estimation and the utilization of the model to be
correct, certain hypotesis must hold:
E ( i ) = 0, ∀i −→ E (yi ) = β0 + β1 xi
Omoschedasticity: V ( i ) = σi2 = σ 2 , ∀i
Null covariance: Cov ( i , j ) = 0, ∀i = j
Null covariance among residuals and explanatories:
Cov (xi , i ) = 0, ∀i, since X is deterministic (known)
Normal assumption:

i

∼ N(0, σ 2 )
Model Hypotesis - 2

From the hypotesis above, follow that:
V (yi ) = σ 2 , ∀i. Y is stochastic only for the

component.

Cov (yi , yj ) = 0, ∀i = j. Since the residuals are uncorrelated.
yi ∼ N[(β0 + β1 x1 ), σ 2 ] Since also the residuals are normal in
shape.
Ordinary Least Squares (OLS) Estimation

The OLS is the estimation method used to estimate the vector β.
The idea is to minimize the value of the residuals.
Since ei = yi − yi we are interested in minimize the component
ˆ
ˆ
ˆ
yi − β0 − β1 xi .
N.B.

i

ˆ
ˆ
= β0 − β1 xi , while ei = β0 − β1 xi

The method consist in minimize the sum of the square
differences:
n
i (yi

− yi )2 =
ˆ

n 2
i ei

= Min,

which is equal to solve this 2 equation system derived using
derivates.
Ordinary Least Squares (OLS) Estimation - 2

n

ei2 = 0

(1)

ei2 = 0

δ/δβ0

(2)

i
n

δ/δβ1
i

After some arithmetics, we end up with this estimators for the
vector β:

β0 = y − β1 x
¯ ˆ ¯
n
¯
¯
i (yi − y )(xi − x )
β1 =
n
2
¯
i (xi − x )

(3)
(4)
OLS estimators

ˆ
ˆ
OLS β0 and β1 are stochastic estimators (they have a
distribution in a sample space of all the possible estimtors
define with different samples)
ˆ
β1 : measure the estimated variation in Y determined by a
unitary variation in X (δY /δX )
ˆ
The OLS estimators are correct (E (β1 ) = β1 ),
and they are BLUE (corrects and with the lowest variance)
Linear dependency index (R 2 )
The R 2 index is the most used index to measure the linear fitting
of the model.
R 2 is confined in the boundary [−1, 1], where, values near to 1 (or
-1) means the explanatories are usefull to describe the changes in
Y.
Let define
SQT = SQR + SQE , or
n
i (yi

− y )2 =
¯

n
y
i (ˆi

The R 2 is defined as
R2 =

n
y y 2
i (ˆi −¯)
n
y 2
i (yi −¯)

− y )2 +
¯

SQR
SQT

or 1 −

n
i (yi

− y i )2
ˆ

SQE
SQT .

Or, equivalent:
Hypotesis testing on β1
The estimated slope parameter β1 is stochastic. It distributes as a
gaussian:
ˆ
β1 ∼ N[β1 , σ 2 /SSx]
We can make use of the hypotesis tests approach to investigate on
the causal relation between Y and X :
H0 : β1 = 0
H1 : β1 = 0,
where, alternative hypotesis mean causal relation.
The test is:
z=

ˆ
β1 −β1
sqrt(σ 2 /SSx)

∼ N(0, 1).

When SSx is unknown, we estimate it as : SSx =
and we use t − test with n − 1 degrees of freedom

n
i (xi

− y )2 ,
¯
Forecasting within the regresion model

The question we want to answer is the following: Which is the
expected value of Y (say yn+1 ), for a certain observation that is
not in the sample?.
Suppose we have, for that observation, the value for the variable X
(say xn+1 )
We make use of the estimated β to determine:
ˆ
ˆ
yn+1 = β0 + β1 xn+1
ˆ
Model Checking
Several methods are used to test the robustness of the model,
most of them based on the stochastic part of the the model: the
estimated residuals.
Graphical checks: Plot residuals versus fitted values
qq-plot for the normality
Shapiro wilk test for normality
Durbin-Watson test for serial correlation
Breusch-Pagan test for heteroschedasticity
Moreover, the leverage is used to evaluate th importance of each
observation in determining the estimated coefficients β.
The Stepwise procedure is used to choice between different model
specifications.
Model Checking using estimated residuals - Linearity
An example of departure from the linearity assumption: we can
draw a curve (not a horizontal line) to interpolate the points

Figure: residuals (Y) versus estimated (X) values
Model Checking using estimated residuals Omoscedasticity
An example of departure from the omoschedasticity assumption
(the estimated residuals increases as the predicted values
increase)
Model Checking using estimated residuals - Normality
An example of departure from the normality assumption: the
qq-points do not follow the qq-line

Figure: residuals (Y) versus estimated (X) values
Model Checking using estimated residuals - Serial
correlation
An example of departure from the serial incorrelation assumption:
the residual at i depend on the value at i − 1
Homeworks

1. Using cement data (n = 13), determine the β0 and β1
coefficients manually, using OLS formula at page 11, of the
model y = β0 + β1 x1
2. Using cement data, estimate the R 2 index of the model
y = β0 + β1 x1 , using formula at page 13.
Charts - 1

Figure: Slope coefficient in the linear model
Charts - 2

Figure: Fitted (line) versus real (points) values

More Related Content

PPT
Simple Linear Regression
PPTX
REGRESSION ANALYSIS
PPT
Multiple regression presentation
PPTX
Non parametric tests
PPTX
Bernoulli distribution
PPTX
Multiple Linear Regression
PPTX
Anova ONE WAY
PPTX
Chi square tests using SPSS
Simple Linear Regression
REGRESSION ANALYSIS
Multiple regression presentation
Non parametric tests
Bernoulli distribution
Multiple Linear Regression
Anova ONE WAY
Chi square tests using SPSS

What's hot (20)

PPT
Chapter 05
PPTX
ANOVA-One Way Classification
PPT
Independent sample t test
PPTX
Analysis Of Variance - ANOVA
PPTX
TOPIC- HYPOTHESIS TESTING RMS.pptx
PPTX
Process of Research- Stages in Social Science Research
PPTX
Exponential probability distribution
ODP
Correlation
PPT
Correlation and regression
PPTX
Correlation and regression
PPT
Linear Regression Using SPSS
PDF
Repeated Measures ANOVA
PPTX
Regression analysis
PDF
Phi Coefficient of Correlation - Thiyagu
PDF
Checking for normality (Normal distribution)
PPTX
Poisson regression models for count data
PPTX
PPTX
Multivariate analysis - Multiple regression analysis
PPTX
Business forecasting and timeseries analysis phpapp02
ODP
Multiple linear regression
Chapter 05
ANOVA-One Way Classification
Independent sample t test
Analysis Of Variance - ANOVA
TOPIC- HYPOTHESIS TESTING RMS.pptx
Process of Research- Stages in Social Science Research
Exponential probability distribution
Correlation
Correlation and regression
Correlation and regression
Linear Regression Using SPSS
Repeated Measures ANOVA
Regression analysis
Phi Coefficient of Correlation - Thiyagu
Checking for normality (Normal distribution)
Poisson regression models for count data
Multivariate analysis - Multiple regression analysis
Business forecasting and timeseries analysis phpapp02
Multiple linear regression
Ad

Viewers also liked (10)

PDF
Careers in botany
PDF
Science
PDF
Tutorialgroups
PDF
Application of Regression Analysis: Model Building and Validation
PDF
Support vector regression and its application in trading
PPT
SPSS statistics - get help using SPSS
PPTX
Multiple Regression Analysis
PPT
Chap12 simple regression
PPT
Simple Linier Regression
PPTX
Role of Statistics in Scientific Research
Careers in botany
Science
Tutorialgroups
Application of Regression Analysis: Model Building and Validation
Support vector regression and its application in trading
SPSS statistics - get help using SPSS
Multiple Regression Analysis
Chap12 simple regression
Simple Linier Regression
Role of Statistics in Scientific Research
Ad

Similar to The linear regression model: Theory and Application (20)

PPTX
REGRESSION ANALYSIS THEORY EXPLAINED HERE
PPTX
simple and multiple linear Regression. (1).pptx
PPTX
ML-UNIT-IV complete notes download here
PDF
Chapter 14 Part I
PPT
Statistics08_Cut_Regression.jdnkdjvbjddj
PDF
Linear regression model in econometrics undergraduate
PPTX
Multivariate reg analysis
PPTX
Linear regression analysis
PPTX
Chapter two 1 econometrics lecture note.pptx
PDF
Chapter 14 Part Ii
PPTX
Regression
PPT
Regression analysis ppt
PPT
Get Multiple Regression Assignment Help
PPTX
Regression-SIMPLE LINEAR (1).psssssssssptx
PDF
need help with stats 301 assignment help
PPT
regression analysis .ppt
PPTX
Simple Linear Regression.pptx
PPTX
REGRESSION ANALYSIS THEORY EXPLAINED HERE
simple and multiple linear Regression. (1).pptx
ML-UNIT-IV complete notes download here
Chapter 14 Part I
Statistics08_Cut_Regression.jdnkdjvbjddj
Linear regression model in econometrics undergraduate
Multivariate reg analysis
Linear regression analysis
Chapter two 1 econometrics lecture note.pptx
Chapter 14 Part Ii
Regression
Regression analysis ppt
Get Multiple Regression Assignment Help
Regression-SIMPLE LINEAR (1).psssssssssptx
need help with stats 301 assignment help
regression analysis .ppt
Simple Linear Regression.pptx

More from University of Salerno (20)

PDF
Modelling traffic flows with gravity models and mobile phone large data
PDF
Regression models for panel data
PDF
Carpita metulini 111220_dssr_bari_version2
PDF
A strategy for the matching of mobile phone signals with census data
PDF
Detecting and classifying moments in basketball matches using sensor tracked ...
PDF
BASKETBALL SPATIAL PERFORMANCE INDICATORS
PDF
Human activity spatio-temporal indicators using mobile phone data
PDF
Poster venezia
PDF
Metulini280818 iasi
PDF
Players Movements and Team Performance
PDF
Big Data Analytics for Smart Cities
PDF
Meeting progetto ode_sm_rm
PDF
Metulini, R., Manisera, M., Zuccolotto, P. (2017), Sensor Analytics in Basket...
PDF
Metulini, R., Manisera, M., Zuccolotto, P. (2017), Space-Time Analysis of Mov...
PDF
Metulini1503
PDF
A Spatial Filtering Zero-Inflated approach to the estimation of the Gravity M...
PPT
The Water Suitcase of Migrants: Assessing Virtual Water Fluxes Associated to ...
PPT
The Global Virtual Water Network
PDF
The Worldwide Network of Virtual Water with Kriskogram
PDF
Ad b 1702_metu_v2
Modelling traffic flows with gravity models and mobile phone large data
Regression models for panel data
Carpita metulini 111220_dssr_bari_version2
A strategy for the matching of mobile phone signals with census data
Detecting and classifying moments in basketball matches using sensor tracked ...
BASKETBALL SPATIAL PERFORMANCE INDICATORS
Human activity spatio-temporal indicators using mobile phone data
Poster venezia
Metulini280818 iasi
Players Movements and Team Performance
Big Data Analytics for Smart Cities
Meeting progetto ode_sm_rm
Metulini, R., Manisera, M., Zuccolotto, P. (2017), Sensor Analytics in Basket...
Metulini, R., Manisera, M., Zuccolotto, P. (2017), Space-Time Analysis of Mov...
Metulini1503
A Spatial Filtering Zero-Inflated approach to the estimation of the Gravity M...
The Water Suitcase of Migrants: Assessing Virtual Water Fluxes Associated to ...
The Global Virtual Water Network
The Worldwide Network of Virtual Water with Kriskogram
Ad b 1702_metu_v2

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Sports Quiz easy sports quiz sports quiz
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
master seminar digital applications in india
PPTX
Lesson notes of climatology university.
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Computing-Curriculum for Schools in Ghana
Microbial disease of the cardiovascular and lymphatic systems
Abdominal Access Techniques with Prof. Dr. R K Mishra
Sports Quiz easy sports quiz sports quiz
STATICS OF THE RIGID BODIES Hibbelers.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Renaissance Architecture: A Journey from Faith to Humanism
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
TR - Agricultural Crops Production NC III.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Complications of Minimal Access Surgery at WLH
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Supply Chain Operations Speaking Notes -ICLT Program
master seminar digital applications in india
Lesson notes of climatology university.
Anesthesia in Laparoscopic Surgery in India
O5-L3 Freight Transport Ops (International) V1.pdf
Final Presentation General Medicine 03-08-2024.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Computing-Curriculum for Schools in Ghana

The linear regression model: Theory and Application

  • 1. Statistics Lab Rodolfo Metulini IMT Institute for Advanced Studies, Lucca, Italy Lesson 4 - The linear Regression Model: Theory and Application - 21.01.2014
  • 2. Introduction In the past praticals we analyzed one variable. For certain reasons, it is even usefull to analyze two or more variables together. The question we want to asnwer regards what are the relations, the causal effects determining changes in a variable. Analyze if a certain phenomenon is endogenous or exogenous. In symbols, the idea can be represent as follow: y = f (x1 , x2 , ...) Y is the response, which is a function (it depends on) one or more variables.
  • 3. Objectives All in all, the regression model is the instrument used to: measure the entity of the relations between two or more variables: Y / X , and to measure the causal direction ( X −→ viceversa? ) Y or forecast the value of the variable Y in response to some changes in the others X1 , X2 , ... (called explanatories), or for some cases that are not considered in the sample.
  • 4. Simple linear regression model The regression model is stochastic, not deterministic. Giving two sets of values (two variables) from a random sample of length n: x = {x1 , x2 , ..., xi , ..xn }; y = {y1 , y2 , ..., yi , ..yn }: Deterministic formula: yi = β0 + β1 xi , ∀i = 1, .., n Stochastic formula: yi = β0 + β1 xi + where i i ∀i = 1, .., n is the stochastic component. β1 define the slope in the relations between X and Y (See graph in chart 1)
  • 5. Simple linear regression model - 2 ˆ ˆ ˆ We need to find β = {β0 , β1 } as estimators of β0 and β1 . After β is estimated, we can draw the estimated regression line, which corresponds to the estimated regression model, as follow: ˆ ˆ yi = β0 + β1 xi ˆ Here, ˆi = yi − yi . ˆ Where yi is the i-element of the estimated Y vector, and yi is the ˆ i-elements of the real Y vector. (see graph in chart 2)
  • 6. Steps in the Analysis 1. Study the relations (scatterplot, correlations) between two or more variables. ˆ ˆ ˆ 2. Estimation of the parameters of the model β = {β0 , β1 }. ˆ 3. Hypotesis tests on the estimated β1 to verify the casual effects between X and Y 4. Robustness check of the model. 5. Use the model to analyze the causal effect and/or to do forecasting.
  • 7. Why linear? It is simple to estimate, to analyze and to interpret it likely fits with most of empirical cases, in which the relations between two phenomenon is linear. There are a lot of implemented methods to transorm variables in order to obtain a linear relationship (log transformation, normalization, etc.. )
  • 8. Model Hypotesis In order the estimation and the utilization of the model to be correct, certain hypotesis must hold: E ( i ) = 0, ∀i −→ E (yi ) = β0 + β1 xi Omoschedasticity: V ( i ) = σi2 = σ 2 , ∀i Null covariance: Cov ( i , j ) = 0, ∀i = j Null covariance among residuals and explanatories: Cov (xi , i ) = 0, ∀i, since X is deterministic (known) Normal assumption: i ∼ N(0, σ 2 )
  • 9. Model Hypotesis - 2 From the hypotesis above, follow that: V (yi ) = σ 2 , ∀i. Y is stochastic only for the component. Cov (yi , yj ) = 0, ∀i = j. Since the residuals are uncorrelated. yi ∼ N[(β0 + β1 x1 ), σ 2 ] Since also the residuals are normal in shape.
  • 10. Ordinary Least Squares (OLS) Estimation The OLS is the estimation method used to estimate the vector β. The idea is to minimize the value of the residuals. Since ei = yi − yi we are interested in minimize the component ˆ ˆ ˆ yi − β0 − β1 xi . N.B. i ˆ ˆ = β0 − β1 xi , while ei = β0 − β1 xi The method consist in minimize the sum of the square differences: n i (yi − yi )2 = ˆ n 2 i ei = Min, which is equal to solve this 2 equation system derived using derivates.
  • 11. Ordinary Least Squares (OLS) Estimation - 2 n ei2 = 0 (1) ei2 = 0 δ/δβ0 (2) i n δ/δβ1 i After some arithmetics, we end up with this estimators for the vector β: β0 = y − β1 x ¯ ˆ ¯ n ¯ ¯ i (yi − y )(xi − x ) β1 = n 2 ¯ i (xi − x ) (3) (4)
  • 12. OLS estimators ˆ ˆ OLS β0 and β1 are stochastic estimators (they have a distribution in a sample space of all the possible estimtors define with different samples) ˆ β1 : measure the estimated variation in Y determined by a unitary variation in X (δY /δX ) ˆ The OLS estimators are correct (E (β1 ) = β1 ), and they are BLUE (corrects and with the lowest variance)
  • 13. Linear dependency index (R 2 ) The R 2 index is the most used index to measure the linear fitting of the model. R 2 is confined in the boundary [−1, 1], where, values near to 1 (or -1) means the explanatories are usefull to describe the changes in Y. Let define SQT = SQR + SQE , or n i (yi − y )2 = ¯ n y i (ˆi The R 2 is defined as R2 = n y y 2 i (ˆi −¯) n y 2 i (yi −¯) − y )2 + ¯ SQR SQT or 1 − n i (yi − y i )2 ˆ SQE SQT . Or, equivalent:
  • 14. Hypotesis testing on β1 The estimated slope parameter β1 is stochastic. It distributes as a gaussian: ˆ β1 ∼ N[β1 , σ 2 /SSx] We can make use of the hypotesis tests approach to investigate on the causal relation between Y and X : H0 : β1 = 0 H1 : β1 = 0, where, alternative hypotesis mean causal relation. The test is: z= ˆ β1 −β1 sqrt(σ 2 /SSx) ∼ N(0, 1). When SSx is unknown, we estimate it as : SSx = and we use t − test with n − 1 degrees of freedom n i (xi − y )2 , ¯
  • 15. Forecasting within the regresion model The question we want to answer is the following: Which is the expected value of Y (say yn+1 ), for a certain observation that is not in the sample?. Suppose we have, for that observation, the value for the variable X (say xn+1 ) We make use of the estimated β to determine: ˆ ˆ yn+1 = β0 + β1 xn+1 ˆ
  • 16. Model Checking Several methods are used to test the robustness of the model, most of them based on the stochastic part of the the model: the estimated residuals. Graphical checks: Plot residuals versus fitted values qq-plot for the normality Shapiro wilk test for normality Durbin-Watson test for serial correlation Breusch-Pagan test for heteroschedasticity Moreover, the leverage is used to evaluate th importance of each observation in determining the estimated coefficients β. The Stepwise procedure is used to choice between different model specifications.
  • 17. Model Checking using estimated residuals - Linearity An example of departure from the linearity assumption: we can draw a curve (not a horizontal line) to interpolate the points Figure: residuals (Y) versus estimated (X) values
  • 18. Model Checking using estimated residuals Omoscedasticity An example of departure from the omoschedasticity assumption (the estimated residuals increases as the predicted values increase)
  • 19. Model Checking using estimated residuals - Normality An example of departure from the normality assumption: the qq-points do not follow the qq-line Figure: residuals (Y) versus estimated (X) values
  • 20. Model Checking using estimated residuals - Serial correlation An example of departure from the serial incorrelation assumption: the residual at i depend on the value at i − 1
  • 21. Homeworks 1. Using cement data (n = 13), determine the β0 and β1 coefficients manually, using OLS formula at page 11, of the model y = β0 + β1 x1 2. Using cement data, estimate the R 2 index of the model y = β0 + β1 x1 , using formula at page 13.
  • 22. Charts - 1 Figure: Slope coefficient in the linear model
  • 23. Charts - 2 Figure: Fitted (line) versus real (points) values