SlideShare a Scribd company logo
Simple and Multiple
Linear Regression
Dr. M. Ramakrishnan
Associate Professor
Department of Mathematics,
Ramakrishna Mission Vivekananda College,
Mylapore, Chennai-600004
Ph: 9543082432, 9381054104
E-mail:mramkey@mramkey.co.in
Regression Analysis
• Regression analysis is a mathematical measure
of the average relationship between two or
more variables in terms of the original units of
the data.
• Dependent Variable (regressed or explained
variable.)
• Independent Variable (regressor or predictor
or explanatory variable )
Simple Linear Regression
• Linear regression : Y =  + X
Where Y : Dependent variable
X : Independent variable
 and  : Two constants are called regression
coefficients
 : Slope coefficient i.e. the change in the value of Y with
the corresponding change in one unit of X
 : Y intercept when X = 0
R is the Correlation coefficient between Observed and Predicted values
• R2
: R-squared is a goodness-of-fit measure for linear
regression models. This statistic indicates the percentage of the
variance in the dependent variable that the independent variables
explain collectively.
.
• R2
= 0.10 then only 10% of the total variation in Y can be
explained by the variation in X variables
Statistical assumptions for OLS model
• Normality —For fixed values of the
independent variables, the dependent
variable is normally distributed.
• Independence —The Yi values are
independent of each other.
• Linearity —The dependent variable is linearly
related to the independent variables.
• Homoscedasticity — The variance of the
dependent variable doesn’t vary with the
levels of the independent variables.
• Click Regression icon
• Select Linear regression under classical
• Bring dependent variable into dependent
variable box and bring independent scale
variables into covariates box
For Women Data
• From the output, you see that the prediction equation is
• Weight = −87.52 + 3.45 *Height, the regression coefficient (3.45) is
significantly different from zero(p < 0.001) and indicates that there’s an
expected increase of 3.45 pounds of weight for every 1 inch increase in
height.
• The multiple R-squared (0.991) indicates that the model accounts for 99.1
percent of the variance in weights. This fit is a good fit.
• The multiple R-squared is also the squared correlation between the actual
and predicted value.
• The residual standard error (1.53 lbs.) can be thought of as the average
error in predicting weight from height using this model.
• The F statistic tests whether the predictor variables taken together, predict
the response variable above chance levels.
• Because there’s only one predictor variable in simple regression, in this
example the F test is equivalent to the t-test for the regression coefficient
for height.
Regression analysis complete notes along with exampls
Multiple Linear Regression
• Ordinary Least Square regression fits models of
the form ; i=1,2,. . ., n
• Where n is the number of observations and k is
the number of predictor variables. In this equation
is the predicted value of the dependent variable
for observation i
• is the jth
predictor value for the ith
observation
• is the intercept and is the regression
coefficient for the jth predictor. Our aim is to
minimize the difference between observed value
and predicted value of the model.
• Click Regression icon
• Select Linear Regression under classical
• Bring dependent variable into dependent
variable box and bring independent scale
variables into covariates box
In this model only variables Population and
Illiteracy are significant
Model2 contains only two significant
variables
Call:
lm(formula = Murder ~ Illiteracy + Population, data = state)
Residuals:
Min 1Q Median 3Q Max
-4.7652 -1.6561 -0.0898 1.4570 7.6758
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.65154974 0.81011208 2.039 0.04713 *
Illiteracy 4.08073664 0.58481561 6.978 0.00000000883 ***
Population 0.00022419 0.00007984 2.808 0.00724 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.481 on 47 degrees of freedom
Multiple R-squared: 0.5668, Adjusted R-squared: 0.5484
F-statistic: 30.75 on 2 and 47 DF, p-value: 0.000000002893
Interpretation of Model 2
• When there’s more than one predictor variable, the
regression coefficients indicate the increase in the
dependent variable for a unit change in a predictor
variable, holding all other predictor variables constant. For
example, the regression coefficient for Illiteracy is 4.081,
suggesting that an increase of 1 percent in illiteracy is
associated with a 4.081 percent increase in the murder
rate, controlling for population. Its coefficient is
significantly different from zero at the p < .0001 level.
On the other hand, the coefficient for population is also
significantly different from zero (p=0.00724) <0.01 level.
• Taken together, the predictor variables account for 57
percent of the variance in murder rates across states.
Checking Assumptions
plot(MLR2)
Linearity of the Plot
plot(model,1)
The residual plot shows fitted pattern. That is, the red line should be
approximately horizontal at zero. The presence of a pattern may
indicate a linear model.
Note that, if the residual plot indicates a non-linear relationship in the
data, then a simple approach is to use non-linear transformations of
the predictors, such as log(x), sqrt(x) and x^2, in the regression
model.
Homogeneity of variance
plot(MLR2,3)
This plot shows if residuals are spread equally along the ranges of predictors.
It’s good if you see a horizontal line with equally spread points. In our example,
this is not the case.
It can be seen that the variability (variances) of the residual points increases
with the value of the fitted outcome variable, suggesting non-constant
variances in the residuals errors (or heteroscedasticity).
A possible solution to reduce the heteroscedasticity problem is to use a log or
square root transformation of the outcome variable (y).
Normality of residuals
plot(MLR2,2)
The QQ plot of residuals can be used to visually check the normality
assumption. The normal probability plot of residuals should
approximately follow a straight line.
In our example, all the points fall approximately along this reference line
except point 28, so we can assume normality.
Outliers and high leverage points
Outliers:
An outlier is a point that has an extreme outcome variable value. The
presence of outliers may affect the interpretation of the model, because it
increases the RSE.
Outliers can be identified by examining the standardized
residual (or studentized residual), which is the residual divided by its
estimated standard error. Standardized residuals can be interpreted as the
number of standard errors away from the regression line.
Observations whose standardized residuals are greater than 3 in
absolute value are possible outliers (James et al. 2014).
High leverage points:
A data point has high leverage, if it has extreme predictor x values. This can
be detected by examining the leverage statistic or the hat-value. A value of
this statistic above 2(p + 1)/n indicates an observation with high leverage (P.
Bruce and Bruce 2017); where, p is the number of predictors and n is the
number of observations.
Outliers and high leverage points can be identified by inspecting
the Residuals vs Leverage plot:
Outliers and high leverage points
plot(MLR2,5)
The plot above highlights the only two most extreme points (#11,
#28), with a standardized residuals are nearly -1.5 and 3.0. However,
there is no outliers that exceed 3 standard deviations, what is good.
Additionally, there is 3 high leverage point in the data. That is, all
data points, have a leverage statistic below 2(p + 1)/n = 6/50 = 0.12.
Influential values
An influential value is a value, which inclusion or exclusion can
alter the results of the regression analysis. Such a value is
associated with a large residual.
Not all outliers (or extreme data points) are influential in linear
regression analysis.
Statisticians have developed a metric called Cook’s distance to
determine the influence of a value. This metric defines
influence as a combination of leverage and residual size.
A rule of thumb is that an observation has high influence if
Cook’s distance exceeds 4/(n - p - 1)(P. Bruce and Bruce 2017),
where n is the number of observations and p the number of
predictor variables.
The Residuals vs Leverage plot can help us to find influential
observations if any. On this plot, outlying values are generally
located at the upper right corner or at the lower right corner.
Those spots are the places where data points can be
par(mfrow=c(1,2))
plot(MLR2,4)
plot(MLR2,5)
By default, the top 3 most extreme values are labelled on the Cook’s
distance plot. If you want to label the top 5 extreme values, specify the
option id.n as follow:
plot(MLR2,id.n=5)
In our data only one observation value(28) Exceeds Cook’s Distance
4/(50-2-1)=0.08510638
• Keep this model as it

More Related Content

PPTX
Regression_JAMOVI.pptx- Statistical data analysis
PPTX
14. Regression_RcOMMANDER .pptx
PPTX
Simple Linear Regression.pptx helloooooo
PDF
Data Science - Part IV - Regression Analysis & ANOVA
PPTX
An Introduction to Regression Models: Linear and Logistic approaches
PPTX
Linear Regression.pptx
PPT
Get Multiple Regression Assignment Help
PDF
Stats ca report_18180485
Regression_JAMOVI.pptx- Statistical data analysis
14. Regression_RcOMMANDER .pptx
Simple Linear Regression.pptx helloooooo
Data Science - Part IV - Regression Analysis & ANOVA
An Introduction to Regression Models: Linear and Logistic approaches
Linear Regression.pptx
Get Multiple Regression Assignment Help
Stats ca report_18180485

Similar to Regression analysis complete notes along with exampls (20)

PDF
Applied statistics lecture_6
PDF
Multiple linear regression
PPT
Linear regression
PPT
15 regression basics
PPT
Ders 2 ols .ppt
PPTX
regression analysis presentation slides.
PDF
Interpreting Regression Results - Machine Learning
PDF
X18125514 ca2-statisticsfor dataanalytics
PPTX
Regression analysis in R
DOCX
MLR Project (Onion)
PPT
Chapter 3 Multiple linear regression.ppt
PDF
Multiple regression
PPTX
Linear regression by Kodebay
PPT
An introduction to the Multivariable analysis.ppt
PPTX
Introduction-to-Linear-Regression.pptx
PPTX
Ch_03_Wooldridge_5e_PPT Econometrics.pptx
PDF
P G STAT 531 Lecture 10 Regression
PDF
Chapter-3.pdf
PPTX
Multiple Linear Regression.pptx
PPTX
Presentation on Regression Analysis
Applied statistics lecture_6
Multiple linear regression
Linear regression
15 regression basics
Ders 2 ols .ppt
regression analysis presentation slides.
Interpreting Regression Results - Machine Learning
X18125514 ca2-statisticsfor dataanalytics
Regression analysis in R
MLR Project (Onion)
Chapter 3 Multiple linear regression.ppt
Multiple regression
Linear regression by Kodebay
An introduction to the Multivariable analysis.ppt
Introduction-to-Linear-Regression.pptx
Ch_03_Wooldridge_5e_PPT Econometrics.pptx
P G STAT 531 Lecture 10 Regression
Chapter-3.pdf
Multiple Linear Regression.pptx
Presentation on Regression Analysis
Ad

More from S Gayu (7)

PDF
Investment.pdfGJIAURHYGiyreugIUheugfya9p7wergyfpa9er
PDF
ponzischemes2-181119153352 (1)HFIUWHGIUwgeiuGI.pdf
PPTX
Early Work World.pptxjhhbbvccxcvvfdxdxdddxcxccccv
PPTX
White and Orange Modern Gradient Financial Planning Presentation.pptx
PPTX
4. Statistical Inference Theory.pptxbhhgggg
PPTX
2. Introduction to JASP.pptx hhgggfcccffg
PPTX
6. Nonparametric Test_JASP.ppt with full example
Investment.pdfGJIAURHYGiyreugIUheugfya9p7wergyfpa9er
ponzischemes2-181119153352 (1)HFIUWHGIUwgeiuGI.pdf
Early Work World.pptxjhhbbvccxcvvfdxdxdddxcxccccv
White and Orange Modern Gradient Financial Planning Presentation.pptx
4. Statistical Inference Theory.pptxbhhgggg
2. Introduction to JASP.pptx hhgggfcccffg
6. Nonparametric Test_JASP.ppt with full example
Ad

Recently uploaded (20)

PDF
Unit 1 Cost Accounting - Cost sheet
PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
DOCX
Euro SEO Services 1st 3 General Updates.docx
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
PPTX
Starting the business from scratch using well proven technique
PPTX
Lecture (1)-Introduction.pptx business communication
PPTX
HR Introduction Slide (1).pptx on hr intro
PPTX
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
PPTX
5 Stages of group development guide.pptx
PDF
Leading with Vision_ How Mohit Bansal Is Shaping Chandigarh’s Real Estate Ren...
PDF
Nidhal Samdaie CV - International Business Consultant
PPTX
New Microsoft PowerPoint Presentation - Copy.pptx
PDF
A Brief Introduction About Julia Allison
PDF
Laughter Yoga Basic Learning Workshop Manual
PDF
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
PDF
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
PDF
COST SHEET- Tender and Quotation unit 2.pdf
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
Unit 1 Cost Accounting - Cost sheet
Belch_12e_PPT_Ch18_Accessible_university.pptx
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Euro SEO Services 1st 3 General Updates.docx
Reconciliation AND MEMORANDUM RECONCILATION
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
Starting the business from scratch using well proven technique
Lecture (1)-Introduction.pptx business communication
HR Introduction Slide (1).pptx on hr intro
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
5 Stages of group development guide.pptx
Leading with Vision_ How Mohit Bansal Is Shaping Chandigarh’s Real Estate Ren...
Nidhal Samdaie CV - International Business Consultant
New Microsoft PowerPoint Presentation - Copy.pptx
A Brief Introduction About Julia Allison
Laughter Yoga Basic Learning Workshop Manual
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
COST SHEET- Tender and Quotation unit 2.pdf
Ôn tập tiếng anh trong kinh doanh nâng cao

Regression analysis complete notes along with exampls

  • 1. Simple and Multiple Linear Regression Dr. M. Ramakrishnan Associate Professor Department of Mathematics, Ramakrishna Mission Vivekananda College, Mylapore, Chennai-600004 Ph: 9543082432, 9381054104 E-mail:mramkey@mramkey.co.in
  • 2. Regression Analysis • Regression analysis is a mathematical measure of the average relationship between two or more variables in terms of the original units of the data. • Dependent Variable (regressed or explained variable.) • Independent Variable (regressor or predictor or explanatory variable )
  • 3. Simple Linear Regression • Linear regression : Y =  + X Where Y : Dependent variable X : Independent variable  and  : Two constants are called regression coefficients  : Slope coefficient i.e. the change in the value of Y with the corresponding change in one unit of X  : Y intercept when X = 0 R is the Correlation coefficient between Observed and Predicted values • R2 : R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. . • R2 = 0.10 then only 10% of the total variation in Y can be explained by the variation in X variables
  • 4. Statistical assumptions for OLS model • Normality —For fixed values of the independent variables, the dependent variable is normally distributed. • Independence —The Yi values are independent of each other. • Linearity —The dependent variable is linearly related to the independent variables. • Homoscedasticity — The variance of the dependent variable doesn’t vary with the levels of the independent variables.
  • 5. • Click Regression icon • Select Linear regression under classical • Bring dependent variable into dependent variable box and bring independent scale variables into covariates box
  • 7. • From the output, you see that the prediction equation is • Weight = −87.52 + 3.45 *Height, the regression coefficient (3.45) is significantly different from zero(p < 0.001) and indicates that there’s an expected increase of 3.45 pounds of weight for every 1 inch increase in height. • The multiple R-squared (0.991) indicates that the model accounts for 99.1 percent of the variance in weights. This fit is a good fit. • The multiple R-squared is also the squared correlation between the actual and predicted value. • The residual standard error (1.53 lbs.) can be thought of as the average error in predicting weight from height using this model. • The F statistic tests whether the predictor variables taken together, predict the response variable above chance levels. • Because there’s only one predictor variable in simple regression, in this example the F test is equivalent to the t-test for the regression coefficient for height.
  • 9. Multiple Linear Regression • Ordinary Least Square regression fits models of the form ; i=1,2,. . ., n • Where n is the number of observations and k is the number of predictor variables. In this equation is the predicted value of the dependent variable for observation i • is the jth predictor value for the ith observation • is the intercept and is the regression coefficient for the jth predictor. Our aim is to minimize the difference between observed value and predicted value of the model.
  • 10. • Click Regression icon • Select Linear Regression under classical • Bring dependent variable into dependent variable box and bring independent scale variables into covariates box
  • 11. In this model only variables Population and Illiteracy are significant
  • 12. Model2 contains only two significant variables
  • 13. Call: lm(formula = Murder ~ Illiteracy + Population, data = state) Residuals: Min 1Q Median 3Q Max -4.7652 -1.6561 -0.0898 1.4570 7.6758 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.65154974 0.81011208 2.039 0.04713 * Illiteracy 4.08073664 0.58481561 6.978 0.00000000883 *** Population 0.00022419 0.00007984 2.808 0.00724 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.481 on 47 degrees of freedom Multiple R-squared: 0.5668, Adjusted R-squared: 0.5484 F-statistic: 30.75 on 2 and 47 DF, p-value: 0.000000002893
  • 14. Interpretation of Model 2 • When there’s more than one predictor variable, the regression coefficients indicate the increase in the dependent variable for a unit change in a predictor variable, holding all other predictor variables constant. For example, the regression coefficient for Illiteracy is 4.081, suggesting that an increase of 1 percent in illiteracy is associated with a 4.081 percent increase in the murder rate, controlling for population. Its coefficient is significantly different from zero at the p < .0001 level. On the other hand, the coefficient for population is also significantly different from zero (p=0.00724) <0.01 level. • Taken together, the predictor variables account for 57 percent of the variance in murder rates across states.
  • 16. Linearity of the Plot plot(model,1) The residual plot shows fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a linear model. Note that, if the residual plot indicates a non-linear relationship in the data, then a simple approach is to use non-linear transformations of the predictors, such as log(x), sqrt(x) and x^2, in the regression model.
  • 17. Homogeneity of variance plot(MLR2,3) This plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal line with equally spread points. In our example, this is not the case. It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity). A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).
  • 18. Normality of residuals plot(MLR2,2) The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line. In our example, all the points fall approximately along this reference line except point 28, so we can assume normality.
  • 19. Outliers and high leverage points Outliers: An outlier is a point that has an extreme outcome variable value. The presence of outliers may affect the interpretation of the model, because it increases the RSE. Outliers can be identified by examining the standardized residual (or studentized residual), which is the residual divided by its estimated standard error. Standardized residuals can be interpreted as the number of standard errors away from the regression line. Observations whose standardized residuals are greater than 3 in absolute value are possible outliers (James et al. 2014). High leverage points: A data point has high leverage, if it has extreme predictor x values. This can be detected by examining the leverage statistic or the hat-value. A value of this statistic above 2(p + 1)/n indicates an observation with high leverage (P. Bruce and Bruce 2017); where, p is the number of predictors and n is the number of observations. Outliers and high leverage points can be identified by inspecting the Residuals vs Leverage plot:
  • 20. Outliers and high leverage points plot(MLR2,5) The plot above highlights the only two most extreme points (#11, #28), with a standardized residuals are nearly -1.5 and 3.0. However, there is no outliers that exceed 3 standard deviations, what is good. Additionally, there is 3 high leverage point in the data. That is, all data points, have a leverage statistic below 2(p + 1)/n = 6/50 = 0.12.
  • 21. Influential values An influential value is a value, which inclusion or exclusion can alter the results of the regression analysis. Such a value is associated with a large residual. Not all outliers (or extreme data points) are influential in linear regression analysis. Statisticians have developed a metric called Cook’s distance to determine the influence of a value. This metric defines influence as a combination of leverage and residual size. A rule of thumb is that an observation has high influence if Cook’s distance exceeds 4/(n - p - 1)(P. Bruce and Bruce 2017), where n is the number of observations and p the number of predictor variables. The Residuals vs Leverage plot can help us to find influential observations if any. On this plot, outlying values are generally located at the upper right corner or at the lower right corner. Those spots are the places where data points can be
  • 22. par(mfrow=c(1,2)) plot(MLR2,4) plot(MLR2,5) By default, the top 3 most extreme values are labelled on the Cook’s distance plot. If you want to label the top 5 extreme values, specify the option id.n as follow: plot(MLR2,id.n=5) In our data only one observation value(28) Exceeds Cook’s Distance 4/(50-2-1)=0.08510638
  • 23. • Keep this model as it

Editor's Notes

  • #8: http://www.biostathandbook.com/linearregression.html