SlideShare a Scribd company logo
REGRESSION
ANALYSISPresented by –
Alichy Sowmya
Parth Prajapati
Vikrant Ratnakar
Department of Pharmacoinformatics , NIPER S.A.S. Nagar
What is regression?
2
“◉ Linear Regression is a supervised modeling technique for continuous data that generates a
response based on the set of input features.
◉ It is used for explaining the linear relationship between a single variable Y, called the response
(output or dependent variable), and one or more predictor (input, independent or explanatory
variables).
◉ It’s a simple regression problem if only a single variable X is considered, otherwise it takes the
form of a multiple regression problem, that is if more than one predictor is used in the model.
3
◉ Statistical Modelling is the process of obtaining a
statistical model which adequately describes the
relationships between the variables involved
◉ The model: takes the form of a prediction equation - the
values of a dependent variable (DV) are predicted by a
set of independent variables (IV)
◉ Simplest model: simple linear regression
4
Simple linear regression
Simple Linear Regression
◉ SLR is investigating the linear relation between two variables
Y (DV) and X (IV or explanatory variable)
◉ “Linear”: used because the population mean of Y is
represented as a linear or straight-line function of X
◉ “Simple”: refers to the fact that there is only one independent
variable
◉ Examples:
• air quality and lung function
• medication dose and outcome of blood test
6
Explore the relationship Between Two
Continuous Variables
◉ Step 1: Scatterplot
Shape of scatterplot gives form of relation
• linear
• quadratic
• more complex
◉ Step 2: Correlation coefficient
Strength of linear relation given by correlation coefficient
• r: ranges from –1 to +1
• –1 : perfect negative linear relationship
• +1 : perfect positive linear relationship.
• 0 : no linear relationship.
7
8
◉ Step 3: Simple linear regression
This is the population line.
• Y = dependent or response variable. Must be continuous.
• X = independent / predictor / explanatory variable or covariate
• α = population regression parameter / intercept: point where the
line crosses the vertical axis
• β = population regression parameter / slope: the change in the
mean value of Y for each increase of one unit in the value of X
• e = model error term e (residual)
= deviations between predicted values of Y and the actual values
of Y
 Assumed normally distributed with mean 0 and
standard deviation
9
10
Objective of SLR
◉ Objective: to predict or estimate the value of DV Y
corresponding to a given value of IV X, thru the estimated
regression line
◉ Sample: the observed values are Xi and Yi, I =1,2,…n
◉ Build up: an estimated regression line using the sample. The
regression line from the observed data is an estimate of the
relationship between X and Y in the population
11
Estimated Regression Equation
◉ a = regression coefficient
= the estimate of the parameter α
= the intercept of the estimated regression line
= the value of Y where the line crosses the Y axis
◉ b = regression coefficient
= the estimate of the parameter β
= the slope of estimated regression line
= the change in the mean value of Y for each one unit increase in the value of X
12
Residual
◉ For any subject i, i = 1,2,3, …, n
◉ The original observed values are Xi and Yi
◉ For any given Xi , the ‘Y’ value given by the line is called the
predicted value and denoted by
◉ The residual ei is the difference between the predicted value
and the observed value
13
Least Squares Estimation
◉ Least squares estimation is the method of estimating the equation /
fitting the model to the data in an optimal way
◉ The sum of squares of the vertical distances of the observations from
the line are minimized
◉ Least squares estimation minimizes
14
15
Is X a significant predictor of Y
◉ The association between X and Y is given by the
regression coefficient for the slope
◉ A zero slope means X has no “impact” on Y
◉ whereas a large value indicates large changes in Y
when X changes
16
◉ Denoted by R2
◉ Measures the goodness ‘fit’ of the model
◉ Assesses the usefulness or predictive value of the model
◉ Is interpreted as the proportion of variability in the observed
values of Y explained by the regression of Y on X
◉ E.g. R2 =71.9%, almost 72% of the variation in lung function
(FEV) is explained by the regression of FEV on height
◉ R2 =SSR/SST (eg., 78.34/109.01 = 0.719 =71.9%)
17
Coefficient of Determination
R2 and b
◉ The coefficient of determination R2 describes how well the
regression equation summaries the data
◉ The regression coefficient b gives the nature of the
relationship between X and Y the degree of change in Y for
certain changes in X
◉ Two data sets may have the same slope b but different R2
values and vise versa
18
19
20
Assumptions
1) The observations must be independent
2) The values of the dependent variable Y should be Normally distributed
(normality)
3) The variability (variance) of Y should be the same for each value of X -
homoscedasticity or constant variation
4) If X is continuous, the relation between X and Y should be linear (linearity)
Note
• X need not be a random variable nor have a Normal distribution
• In fact the assumptions need to hold for the residuals but can equivalently be tested for Y or the residuals
• A transformation of Y may be required
21
Assumptions - Strategies for testing
◉ Normality
• Test for Y values or for standardized residuals
• using 5 measures (histogram, Normal Q-Q plot, boxplot,
skewness and kurtosis statistics)
◉ Linearity
• Assess from scatterplot of X vs.Y
◉ Constant variation
• Plot of standardized residuals vs. X
• In plot of standardized residuals vs. X the points should scatter
randomly (without any pattern) and evenly (vertical spread the
same)
22
23
24
25
An Example: FEV (Y) and height (X)
◉ Normality of FEV:
• skewness=0.867,
• kurtosis -3 =1.028
26
Linearity:
27
Constant variance?
28
• Constant variance is not assumed
• FEV needs a natural logarithm transformation
After transformation, reassess the assumptions for the
transformed variable: ln(FEV)
◉ Normality:
• skew=0.040
• kurtosis - 3= -0.433
29
Constant variation
30
31
dataset = read.csv("SLR.csv", header=T,
colClasses = c("numeric", "numeric", "numeric"))
head(dataset,5)
#/////Simple Regression/////
simple.fit = lm(Sales~Spend,data=dataset)
summary(simple.fit)
#Loading the necessary libraries
library(lmtest) #dwtest
library(fBasics) #JarqueBeraTest
#Testing normal distribution and independence assumptions
jarqueberaTest(simple.fit$resid) #Test residuals for normality
#Null Hypothesis: Skewness and Kurtosis are equal to zero
dwtest(simple.fit) #Test for independence of residuals
#Null Hypothesis: Errors are serially UNcorrelated
#Simple Regression Residual Plots
layout(matrix(c(1,1,2,3),2,2,byrow=T))
#Spend x Residuals Plot
plot(simple.fit$resid~dataset$Spend[order(dataset$Spend)],
main="Spend x Residualsnfor Simple Regression",
xlab="Marketing Spend", ylab="Residuals")
abline(h=0,lty=2)
#Histogram of Residuals
hist(simple.fit$resid, main="Histogram of Residuals",
ylab="Residuals")
#Q-Q Plot
qqnorm(simple.fit$resid)
qqline(simple.fit$resid)
R Code for SLR
Multiple Regression
Multiple Regression
◉ Simple linear regression describes the linear relationship
between a dependent variable Y and a single explanatory
variable X
◉ Multiple regression is an extension to the case of one
dependent variable and two or more explanatory variables
33
Reasons for performing Multiple
regression
◉ Predictions on the basis of a number of variables will be better
than those based on only one explanatory variable
◉ When testing the effect of a primary variable of interest e.g.
treatment effect / exposure, one needs to account for all other
extraneous influences
• The need to ‘control’ or ‘adjust’ for the possible effects of
‘nuisance’ explanatory variables (known as confounders)
◉ The relationships may be complex e.g. variables may have
combined or synergistic effects on the dependent variable
34
Reasons for performing Multiple
regression
◉ It is almost always better to perform one comprehensive
analysis including all the relevant variables than a series of two-
way comparisons
• Reduce chances of increasing Type I error rate beyond 5%
• In multiple regression a linear model is fitted for the dependent
variable, which is expressed as a linear combination of the
independent variables
35
Importance of Predictors
◉ The regression coefficient bi represents the effect of that
independent variable on DV Y, after controlling for all
the other variables in the model
◉ The importance of each individual variable is tested by a
t test or an F test as for SLR
◉ Significance of an explanatory variable is dependent on
which other variables included in the regression model
◉ A confidence interval gives further information
36
Multiple regression models
◉ Multiple linear regression
• predictors all continuous and linearly related to the
dependent variable
◉ Analysis of covariance (ANCOVA)
• both continuous and categorical predictors
◉ Analysis of variance (eg. two-way ANOVA)
• predictors all categorical
◉ Polynomial regression
• quadratic or higher order terms included
37
Categorical predictors
◉ Association between a continuous DV Y and a categorical IV
X is assessed by comparing the mean Y values in each
category of X
◉ A reference category is chosen to compare the other
category/ies with
◉ The regression coefficient for a comparison represents the
difference in the mean for Y for the given category vs the
reference category
38
Assessing the fit of the model
◉ R2 measures usefulness or predictive value of model
◉ R2 is interpreted as the proportion of the total variability
explained by the model
◉ R2 increases in value as each additional variable is added to
the model
◉ adjusted R2 (preferred measure) takes into account the
number of explanatory variables included in the model
◉ E.g. R2 = 0.482 Radj2 =0.462
39
◉ Also assess fit by inspection of standardized residuals
• If these follow a Normal distribution
• Any value
◉ Large residual: model does not fit well for that subject
◉ Some large residuals will occur by chance, many large
residuals are of concern s > 3 and < -3 are large
40
Assumptions of multiple regression
◉ The observations must be independent
◉ The relation between each continuous X and the dependent
variable should be linear
◉ The values of the dependent variable Y should have a Normal
distribution
◉ The variability of Y should be the same for any set of values of the
explanatory variables – homoscedasticity
41
How to assess assumptions
◉ Assessing the Normality of Y (or the standardized residuals)
• Obtaining scatterplots of Y (or the standardized residuals)
against each continuous X primarily to assess linearity
◉ Obtaining
• Levene’s test for Y (or the standardized residuals) (if
categorical predictors are included in the model) to assess
equal variance
• a plot of the standardized residuals against each X (if
continuous predictors are included in the model) primarily to
assess constant variation
42
Example - assess assumptions
◉ DV: FEV1
◉ Explanatory variables:
• Height (in cm’s)
• Gender (binary)
• Smoking status (3categories)
◉ Normality of FEV1 (5measures)
• skewness= -0.11,
• kurtosis -3 = -0.80
• Assumed
43
◉ Linearity: FEV1 vs height (scatterplot)
• Assumed
44
Linearity
Constant variation:
◉ Constant variation:
• standardized residuals vs height (scatterplot): no clear pattern
◉ Assumed
45
Equality Variances
• Levenes’ Test (Robust): p = 0.937 >0.05,
• Assumed
46
Conclusion: All the assumptions are met
Note: the test could be done using standardized residuals
R Code for MLR
#loading of the data
data("mtcars")
#viewing the data
mtcars
head(mtcars)
names(mtcars)
#attach command is used in R so that we need not call the data
everytime
attach(mtcars)
#checking the realtionship between the variables
plot(mpg,cyl)
plot(mpg,disp)
plot(mpg,hp)
plot(mpg,drat)
plot(mpg,wt)
plot(mpg,qsec)
plot(mpg,vs)
47
plot(mpg,am)
plot(mpg,gear)
plot(mpg,carb)
#creating simple linear regression
#creating the multiple linear model
model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb)
model
#checking the summary of the model
summary(model)
#various parameters to check the fitness of the model
#mean square error
sqrt(sum((model$residuals)^2)/21)
summary(model)
#hypothesis testing t-test
#test statistic is just the point estimate of the slope of the model divided by the standard error
of that coefficient/slope value
#example 1
tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2]
tstat
2*pt(tstat, 21, lower.tail=FALSE)
#example2
tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2]
tstat2
2*pt(tstat2, 21, lower.tail=FALSE)
summary(model)
48
#F-test
summary(model)
#Coefficient Confidence Intervals
confint(model, level=.95)
#testing the various assumptions of the model
#1 checking whether the residuals are normally distributed or not
#histogram
resid<- model$residuals
hist(resid)
#quantile plot
qqnorm(resid)
qqline(resid)
#2 checking the homoscedasticity
plot(model$residuals ~ disp)
abline(0,0)
#residual analysis
plot(model)
#transformations
model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb)
summary(model1)
plot(model1)
#reducing the model
#calling of library
library(MASS)
#running the AIC on intial model
stepAIC(model)
#running the AIC on the transformed model
stepAIC(model1)
#constructing new models with reduced variables
model2<-lm(mpg~qsec+wt+am)
summary(model2)
model6<-lm(mpg~log(disp)+gear+carb)
summary(model6)
#partial F-test
nestmodel = lm(mpg ~ wt + qsec + am)
anova(model,nestmodel)
#Multicollinearity
plot(mtcars)
#checking the correlation
cor(qsec, wt)
cor(am, wt)
cor(am, qsec)
#variance inflation factor
install.packages("car")
library(car)
vif(model2)
#Polynomial Model
plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am)
plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
summary(quadmod)
AIC(quadmod)
#Interaction Model
model3<-lm(mpg~qsec+wt*am)
summary(model3)
AIC(model3)
resid3<- model3$residuals
hist(resid3)
qqnorm(resid3)
qqline(resid3)
plot(model3$residuals ~ disp)
abline(0,0)
plot(model3)
49
#using the model
newdata <- data.frame(wt=2.92, qsec=20.1, am=1)
predy <- predict(model3, newdata, interval="predict")
predy
confy <- predict(model3, newdata, interval="confidence")
confy
confy %*% c(0, -1, 1)
predy %*% c(0, -1, 1)
confy[1] == predy[1]
#sample prediciton
mtcars[20, ]
pred<-
coef(summary(model3))[1,1]+coef(summary(model3))[2,1]*19.9+coef(summary(model3))[3,1]*
1.835+coef(summary(model3))[4,1]*1+coef(summary(model3))[5,1]*1.835*1
pred
33.9-31.0523
THANK YOU…..
50

More Related Content

PDF
Linear regression theory
PPTX
Statistics-Regression analysis
PDF
Regression analysis algorithm
PPTX
Regression (Linear Regression and Logistic Regression) by Akanksha Bali
PDF
Ridge regression
PDF
Multiple linear regression
PPTX
Non Linear Equation
PPTX
Logistic Regression power point presentation.pptx
Linear regression theory
Statistics-Regression analysis
Regression analysis algorithm
Regression (Linear Regression and Logistic Regression) by Akanksha Bali
Ridge regression
Multiple linear regression
Non Linear Equation
Logistic Regression power point presentation.pptx

What's hot (20)

PPTX
Regression Analysis
PDF
Logistic Regression Analysis
PPTX
Logistic regression with SPSS
PDF
Simple linear regression
PPTX
Logistical Regression.pptx
PPTX
Regression
PPTX
Logistic regression
PPT
Regression analysis
PPTX
Regression analysis
PDF
Introduction to correlation and regression analysis
PPT
Logistic Regression.ppt
PPTX
Regression analysis
PPTX
Regression analysis
PPT
Logistic regression
PPTX
Regression analysis
PPTX
Logistic Regression.pptx
PPT
Simple linear regression
PPTX
Regression Analysis
PPT
Correlation and Regression
PPTX
Lesson 2 stationary_time_series
Regression Analysis
Logistic Regression Analysis
Logistic regression with SPSS
Simple linear regression
Logistical Regression.pptx
Regression
Logistic regression
Regression analysis
Regression analysis
Introduction to correlation and regression analysis
Logistic Regression.ppt
Regression analysis
Regression analysis
Logistic regression
Regression analysis
Logistic Regression.pptx
Simple linear regression
Regression Analysis
Correlation and Regression
Lesson 2 stationary_time_series
Ad

Similar to Regression analysis in R (20)

PDF
ML Module 3.pdf
PPT
604_multiplee.ppt
PPT
Regression and Co-Relation
PPTX
regression analysis presentation slides.
PPTX
Regression analysis
PPTX
Regression-SIMPLE LINEAR (1).psssssssssptx
PPTX
STATISTICAL REGRESSION MODELS
PDF
Lecture 1.pdf
PPT
PPTX
Regression Analysis.pptx
PDF
Regression
DOCX
30REGRESSION Regression is a statistical tool that a.docx
PDF
Regression analysis
PPT
Simple Linear Regression.pptSimple Linear Regression.ppt
PPTX
An Introduction to Regression Models: Linear and Logistic approaches
PPTX
Regression analysis refers to assessing the relationship between the outcome ...
PDF
simple linear regression - brief introduction
PPT
Ders 2 ols .ppt
PDF
Data Science - Part IV - Regression Analysis & ANOVA
PPTX
一比一原版(KHU毕业证书)韩国庆熙大学毕业证成绩单
ML Module 3.pdf
604_multiplee.ppt
Regression and Co-Relation
regression analysis presentation slides.
Regression analysis
Regression-SIMPLE LINEAR (1).psssssssssptx
STATISTICAL REGRESSION MODELS
Lecture 1.pdf
Regression Analysis.pptx
Regression
30REGRESSION Regression is a statistical tool that a.docx
Regression analysis
Simple Linear Regression.pptSimple Linear Regression.ppt
An Introduction to Regression Models: Linear and Logistic approaches
Regression analysis refers to assessing the relationship between the outcome ...
simple linear regression - brief introduction
Ders 2 ols .ppt
Data Science - Part IV - Regression Analysis & ANOVA
一比一原版(KHU毕业证书)韩国庆熙大学毕业证成绩单
Ad

More from Alichy Sowmya (12)

PPTX
Plant tissue culture
PPTX
Protein data bank
PPTX
Probability distribution in R
PPTX
Chemistry development kit
PPTX
Validation of homology modeling
PPTX
Big data in metabolism
PPTX
PHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY EVALUATION OF DECALEPIS HAMILTONII
PPTX
SciFinder and its utility in Drug discovery
DOCX
Prescription filling record
PPTX
Information science
PPTX
Limitations of in silico drug discovery methods
PPT
Crimean Congo Hemorrhagic fever
Plant tissue culture
Protein data bank
Probability distribution in R
Chemistry development kit
Validation of homology modeling
Big data in metabolism
PHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY EVALUATION OF DECALEPIS HAMILTONII
SciFinder and its utility in Drug discovery
Prescription filling record
Information science
Limitations of in silico drug discovery methods
Crimean Congo Hemorrhagic fever

Recently uploaded (20)

PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Lecture1 pattern recognition............
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Launch Your Data Science Career in Kochi – 2025
climate analysis of Dhaka ,Banglades.pptx
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Data_Analytics_and_PowerBI_Presentation.pptx
Lecture1 pattern recognition............
Business Acumen Training GuidePresentation.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Foundation of Data Science unit number two notes
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
Business Ppt On Nestle.pptx huunnnhhgfvu
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Knowledge Engineering Part 1
Launch Your Data Science Career in Kochi – 2025

Regression analysis in R

  • 1. REGRESSION ANALYSISPresented by – Alichy Sowmya Parth Prajapati Vikrant Ratnakar Department of Pharmacoinformatics , NIPER S.A.S. Nagar
  • 3. “◉ Linear Regression is a supervised modeling technique for continuous data that generates a response based on the set of input features. ◉ It is used for explaining the linear relationship between a single variable Y, called the response (output or dependent variable), and one or more predictor (input, independent or explanatory variables). ◉ It’s a simple regression problem if only a single variable X is considered, otherwise it takes the form of a multiple regression problem, that is if more than one predictor is used in the model. 3
  • 4. ◉ Statistical Modelling is the process of obtaining a statistical model which adequately describes the relationships between the variables involved ◉ The model: takes the form of a prediction equation - the values of a dependent variable (DV) are predicted by a set of independent variables (IV) ◉ Simplest model: simple linear regression 4
  • 6. Simple Linear Regression ◉ SLR is investigating the linear relation between two variables Y (DV) and X (IV or explanatory variable) ◉ “Linear”: used because the population mean of Y is represented as a linear or straight-line function of X ◉ “Simple”: refers to the fact that there is only one independent variable ◉ Examples: • air quality and lung function • medication dose and outcome of blood test 6
  • 7. Explore the relationship Between Two Continuous Variables ◉ Step 1: Scatterplot Shape of scatterplot gives form of relation • linear • quadratic • more complex ◉ Step 2: Correlation coefficient Strength of linear relation given by correlation coefficient • r: ranges from –1 to +1 • –1 : perfect negative linear relationship • +1 : perfect positive linear relationship. • 0 : no linear relationship. 7
  • 8. 8
  • 9. ◉ Step 3: Simple linear regression This is the population line. • Y = dependent or response variable. Must be continuous. • X = independent / predictor / explanatory variable or covariate • α = population regression parameter / intercept: point where the line crosses the vertical axis • β = population regression parameter / slope: the change in the mean value of Y for each increase of one unit in the value of X • e = model error term e (residual) = deviations between predicted values of Y and the actual values of Y  Assumed normally distributed with mean 0 and standard deviation 9
  • 10. 10
  • 11. Objective of SLR ◉ Objective: to predict or estimate the value of DV Y corresponding to a given value of IV X, thru the estimated regression line ◉ Sample: the observed values are Xi and Yi, I =1,2,…n ◉ Build up: an estimated regression line using the sample. The regression line from the observed data is an estimate of the relationship between X and Y in the population 11
  • 12. Estimated Regression Equation ◉ a = regression coefficient = the estimate of the parameter α = the intercept of the estimated regression line = the value of Y where the line crosses the Y axis ◉ b = regression coefficient = the estimate of the parameter β = the slope of estimated regression line = the change in the mean value of Y for each one unit increase in the value of X 12
  • 13. Residual ◉ For any subject i, i = 1,2,3, …, n ◉ The original observed values are Xi and Yi ◉ For any given Xi , the ‘Y’ value given by the line is called the predicted value and denoted by ◉ The residual ei is the difference between the predicted value and the observed value 13
  • 14. Least Squares Estimation ◉ Least squares estimation is the method of estimating the equation / fitting the model to the data in an optimal way ◉ The sum of squares of the vertical distances of the observations from the line are minimized ◉ Least squares estimation minimizes 14
  • 15. 15
  • 16. Is X a significant predictor of Y ◉ The association between X and Y is given by the regression coefficient for the slope ◉ A zero slope means X has no “impact” on Y ◉ whereas a large value indicates large changes in Y when X changes 16
  • 17. ◉ Denoted by R2 ◉ Measures the goodness ‘fit’ of the model ◉ Assesses the usefulness or predictive value of the model ◉ Is interpreted as the proportion of variability in the observed values of Y explained by the regression of Y on X ◉ E.g. R2 =71.9%, almost 72% of the variation in lung function (FEV) is explained by the regression of FEV on height ◉ R2 =SSR/SST (eg., 78.34/109.01 = 0.719 =71.9%) 17 Coefficient of Determination
  • 18. R2 and b ◉ The coefficient of determination R2 describes how well the regression equation summaries the data ◉ The regression coefficient b gives the nature of the relationship between X and Y the degree of change in Y for certain changes in X ◉ Two data sets may have the same slope b but different R2 values and vise versa 18
  • 19. 19
  • 20. 20
  • 21. Assumptions 1) The observations must be independent 2) The values of the dependent variable Y should be Normally distributed (normality) 3) The variability (variance) of Y should be the same for each value of X - homoscedasticity or constant variation 4) If X is continuous, the relation between X and Y should be linear (linearity) Note • X need not be a random variable nor have a Normal distribution • In fact the assumptions need to hold for the residuals but can equivalently be tested for Y or the residuals • A transformation of Y may be required 21
  • 22. Assumptions - Strategies for testing ◉ Normality • Test for Y values or for standardized residuals • using 5 measures (histogram, Normal Q-Q plot, boxplot, skewness and kurtosis statistics) ◉ Linearity • Assess from scatterplot of X vs.Y ◉ Constant variation • Plot of standardized residuals vs. X • In plot of standardized residuals vs. X the points should scatter randomly (without any pattern) and evenly (vertical spread the same) 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. An Example: FEV (Y) and height (X) ◉ Normality of FEV: • skewness=0.867, • kurtosis -3 =1.028 26
  • 28. Constant variance? 28 • Constant variance is not assumed • FEV needs a natural logarithm transformation
  • 29. After transformation, reassess the assumptions for the transformed variable: ln(FEV) ◉ Normality: • skew=0.040 • kurtosis - 3= -0.433 29
  • 31. 31 dataset = read.csv("SLR.csv", header=T, colClasses = c("numeric", "numeric", "numeric")) head(dataset,5) #/////Simple Regression///// simple.fit = lm(Sales~Spend,data=dataset) summary(simple.fit) #Loading the necessary libraries library(lmtest) #dwtest library(fBasics) #JarqueBeraTest #Testing normal distribution and independence assumptions jarqueberaTest(simple.fit$resid) #Test residuals for normality #Null Hypothesis: Skewness and Kurtosis are equal to zero dwtest(simple.fit) #Test for independence of residuals #Null Hypothesis: Errors are serially UNcorrelated #Simple Regression Residual Plots layout(matrix(c(1,1,2,3),2,2,byrow=T)) #Spend x Residuals Plot plot(simple.fit$resid~dataset$Spend[order(dataset$Spend)], main="Spend x Residualsnfor Simple Regression", xlab="Marketing Spend", ylab="Residuals") abline(h=0,lty=2) #Histogram of Residuals hist(simple.fit$resid, main="Histogram of Residuals", ylab="Residuals") #Q-Q Plot qqnorm(simple.fit$resid) qqline(simple.fit$resid) R Code for SLR
  • 33. Multiple Regression ◉ Simple linear regression describes the linear relationship between a dependent variable Y and a single explanatory variable X ◉ Multiple regression is an extension to the case of one dependent variable and two or more explanatory variables 33
  • 34. Reasons for performing Multiple regression ◉ Predictions on the basis of a number of variables will be better than those based on only one explanatory variable ◉ When testing the effect of a primary variable of interest e.g. treatment effect / exposure, one needs to account for all other extraneous influences • The need to ‘control’ or ‘adjust’ for the possible effects of ‘nuisance’ explanatory variables (known as confounders) ◉ The relationships may be complex e.g. variables may have combined or synergistic effects on the dependent variable 34
  • 35. Reasons for performing Multiple regression ◉ It is almost always better to perform one comprehensive analysis including all the relevant variables than a series of two- way comparisons • Reduce chances of increasing Type I error rate beyond 5% • In multiple regression a linear model is fitted for the dependent variable, which is expressed as a linear combination of the independent variables 35
  • 36. Importance of Predictors ◉ The regression coefficient bi represents the effect of that independent variable on DV Y, after controlling for all the other variables in the model ◉ The importance of each individual variable is tested by a t test or an F test as for SLR ◉ Significance of an explanatory variable is dependent on which other variables included in the regression model ◉ A confidence interval gives further information 36
  • 37. Multiple regression models ◉ Multiple linear regression • predictors all continuous and linearly related to the dependent variable ◉ Analysis of covariance (ANCOVA) • both continuous and categorical predictors ◉ Analysis of variance (eg. two-way ANOVA) • predictors all categorical ◉ Polynomial regression • quadratic or higher order terms included 37
  • 38. Categorical predictors ◉ Association between a continuous DV Y and a categorical IV X is assessed by comparing the mean Y values in each category of X ◉ A reference category is chosen to compare the other category/ies with ◉ The regression coefficient for a comparison represents the difference in the mean for Y for the given category vs the reference category 38
  • 39. Assessing the fit of the model ◉ R2 measures usefulness or predictive value of model ◉ R2 is interpreted as the proportion of the total variability explained by the model ◉ R2 increases in value as each additional variable is added to the model ◉ adjusted R2 (preferred measure) takes into account the number of explanatory variables included in the model ◉ E.g. R2 = 0.482 Radj2 =0.462 39
  • 40. ◉ Also assess fit by inspection of standardized residuals • If these follow a Normal distribution • Any value ◉ Large residual: model does not fit well for that subject ◉ Some large residuals will occur by chance, many large residuals are of concern s > 3 and < -3 are large 40
  • 41. Assumptions of multiple regression ◉ The observations must be independent ◉ The relation between each continuous X and the dependent variable should be linear ◉ The values of the dependent variable Y should have a Normal distribution ◉ The variability of Y should be the same for any set of values of the explanatory variables – homoscedasticity 41
  • 42. How to assess assumptions ◉ Assessing the Normality of Y (or the standardized residuals) • Obtaining scatterplots of Y (or the standardized residuals) against each continuous X primarily to assess linearity ◉ Obtaining • Levene’s test for Y (or the standardized residuals) (if categorical predictors are included in the model) to assess equal variance • a plot of the standardized residuals against each X (if continuous predictors are included in the model) primarily to assess constant variation 42
  • 43. Example - assess assumptions ◉ DV: FEV1 ◉ Explanatory variables: • Height (in cm’s) • Gender (binary) • Smoking status (3categories) ◉ Normality of FEV1 (5measures) • skewness= -0.11, • kurtosis -3 = -0.80 • Assumed 43
  • 44. ◉ Linearity: FEV1 vs height (scatterplot) • Assumed 44 Linearity
  • 45. Constant variation: ◉ Constant variation: • standardized residuals vs height (scatterplot): no clear pattern ◉ Assumed 45
  • 46. Equality Variances • Levenes’ Test (Robust): p = 0.937 >0.05, • Assumed 46 Conclusion: All the assumptions are met Note: the test could be done using standardized residuals
  • 47. R Code for MLR #loading of the data data("mtcars") #viewing the data mtcars head(mtcars) names(mtcars) #attach command is used in R so that we need not call the data everytime attach(mtcars) #checking the realtionship between the variables plot(mpg,cyl) plot(mpg,disp) plot(mpg,hp) plot(mpg,drat) plot(mpg,wt) plot(mpg,qsec) plot(mpg,vs) 47 plot(mpg,am) plot(mpg,gear) plot(mpg,carb) #creating simple linear regression #creating the multiple linear model model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb) model #checking the summary of the model summary(model) #various parameters to check the fitness of the model #mean square error sqrt(sum((model$residuals)^2)/21) summary(model) #hypothesis testing t-test #test statistic is just the point estimate of the slope of the model divided by the standard error of that coefficient/slope value #example 1 tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2] tstat 2*pt(tstat, 21, lower.tail=FALSE) #example2 tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2] tstat2 2*pt(tstat2, 21, lower.tail=FALSE) summary(model)
  • 48. 48 #F-test summary(model) #Coefficient Confidence Intervals confint(model, level=.95) #testing the various assumptions of the model #1 checking whether the residuals are normally distributed or not #histogram resid<- model$residuals hist(resid) #quantile plot qqnorm(resid) qqline(resid) #2 checking the homoscedasticity plot(model$residuals ~ disp) abline(0,0) #residual analysis plot(model) #transformations model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb) summary(model1) plot(model1) #reducing the model #calling of library library(MASS) #running the AIC on intial model stepAIC(model) #running the AIC on the transformed model stepAIC(model1) #constructing new models with reduced variables model2<-lm(mpg~qsec+wt+am) summary(model2) model6<-lm(mpg~log(disp)+gear+carb) summary(model6) #partial F-test nestmodel = lm(mpg ~ wt + qsec + am) anova(model,nestmodel) #Multicollinearity plot(mtcars) #checking the correlation cor(qsec, wt) cor(am, wt) cor(am, qsec) #variance inflation factor install.packages("car") library(car) vif(model2) #Polynomial Model plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals") abline(0,0) quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am) plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals") abline(0,0) summary(quadmod) AIC(quadmod) #Interaction Model model3<-lm(mpg~qsec+wt*am) summary(model3) AIC(model3) resid3<- model3$residuals hist(resid3) qqnorm(resid3) qqline(resid3) plot(model3$residuals ~ disp) abline(0,0) plot(model3)
  • 49. 49 #using the model newdata <- data.frame(wt=2.92, qsec=20.1, am=1) predy <- predict(model3, newdata, interval="predict") predy confy <- predict(model3, newdata, interval="confidence") confy confy %*% c(0, -1, 1) predy %*% c(0, -1, 1) confy[1] == predy[1] #sample prediciton mtcars[20, ] pred<- coef(summary(model3))[1,1]+coef(summary(model3))[2,1]*19.9+coef(summary(model3))[3,1]* 1.835+coef(summary(model3))[4,1]*1+coef(summary(model3))[5,1]*1.835*1 pred 33.9-31.0523