Regression analysis in R

REGRESSION
ANALYSISPresented by –
Alichy Sowmya
Parth Prajapati
Vikrant Ratnakar
Department of Pharmacoinformatics , NIPER S.A.S. Nagar

“◉ Linear Regression is a supervised modeling technique for continuous data that generates a
response based on the set of input features.
◉ It is used for explaining the linear relationship between a single variable Y, called the response
(output or dependent variable), and one or more predictor (input, independent or explanatory
variables).
◉ It’s a simple regression problem if only a single variable X is considered, otherwise it takes the
form of a multiple regression problem, that is if more than one predictor is used in the model.
3

◉ Statistical Modelling is the process of obtaining a
statistical model which adequately describes the
relationships between the variables involved
◉ The model: takes the form of a prediction equation - the
values of a dependent variable (DV) are predicted by a
set of independent variables (IV)
◉ Simplest model: simple linear regression
4

Simple Linear Regression
◉ SLR is investigating the linear relation between two variables
Y (DV) and X (IV or explanatory variable)
◉ “Linear”: used because the population mean of Y is
represented as a linear or straight-line function of X
◉ “Simple”: refers to the fact that there is only one independent
variable
◉ Examples:
• air quality and lung function
• medication dose and outcome of blood test
6

Explore the relationship Between Two
Continuous Variables
◉ Step 1: Scatterplot
Shape of scatterplot gives form of relation
• linear
• quadratic
• more complex
◉ Step 2: Correlation coefficient
Strength of linear relation given by correlation coefficient
• r: ranges from –1 to +1
• –1 : perfect negative linear relationship
• +1 : perfect positive linear relationship.
• 0 : no linear relationship.
7

◉ Step 3: Simple linear regression
This is the population line.
• Y = dependent or response variable. Must be continuous.
• X = independent / predictor / explanatory variable or covariate
• α = population regression parameter / intercept: point where the
line crosses the vertical axis
• β = population regression parameter / slope: the change in the
mean value of Y for each increase of one unit in the value of X
• e = model error term e (residual)
= deviations between predicted values of Y and the actual values
of Y
 Assumed normally distributed with mean 0 and
standard deviation
9

Objective of SLR
◉ Objective: to predict or estimate the value of DV Y
corresponding to a given value of IV X, thru the estimated
regression line
◉ Sample: the observed values are Xi and Yi, I =1,2,…n
◉ Build up: an estimated regression line using the sample. The
regression line from the observed data is an estimate of the
relationship between X and Y in the population
11

Estimated Regression Equation
◉ a = regression coefficient
= the estimate of the parameter α
= the intercept of the estimated regression line
= the value of Y where the line crosses the Y axis
◉ b = regression coefficient
= the estimate of the parameter β
= the slope of estimated regression line
= the change in the mean value of Y for each one unit increase in the value of X
12

Residual
◉ For any subject i, i = 1,2,3, …, n
◉ The original observed values are Xi and Yi
◉ For any given Xi , the ‘Y’ value given by the line is called the
predicted value and denoted by
◉ The residual ei is the difference between the predicted value
and the observed value
13

Least Squares Estimation
◉ Least squares estimation is the method of estimating the equation /
fitting the model to the data in an optimal way
◉ The sum of squares of the vertical distances of the observations from
the line are minimized
◉ Least squares estimation minimizes
14

Is X a significant predictor of Y
◉ The association between X and Y is given by the
regression coefficient for the slope
◉ A zero slope means X has no “impact” on Y
◉ whereas a large value indicates large changes in Y
when X changes
16

◉ Denoted by R2
◉ Measures the goodness ‘fit’ of the model
◉ Assesses the usefulness or predictive value of the model
◉ Is interpreted as the proportion of variability in the observed
values of Y explained by the regression of Y on X
◉ E.g. R2 =71.9%, almost 72% of the variation in lung function
(FEV) is explained by the regression of FEV on height
◉ R2 =SSR/SST (eg., 78.34/109.01 = 0.719 =71.9%)
17
Coefficient of Determination

R2 and b
◉ The coefficient of determination R2 describes how well the
regression equation summaries the data
◉ The regression coefficient b gives the nature of the
relationship between X and Y the degree of change in Y for
certain changes in X
◉ Two data sets may have the same slope b but different R2
values and vise versa
18

Assumptions
1) The observations must be independent
2) The values of the dependent variable Y should be Normally distributed
(normality)
3) The variability (variance) of Y should be the same for each value of X -
homoscedasticity or constant variation
4) If X is continuous, the relation between X and Y should be linear (linearity)
Note
• X need not be a random variable nor have a Normal distribution
• In fact the assumptions need to hold for the residuals but can equivalently be tested for Y or the residuals
• A transformation of Y may be required
21

Assumptions - Strategies for testing
◉ Normality
• Test for Y values or for standardized residuals
• using 5 measures (histogram, Normal Q-Q plot, boxplot,
skewness and kurtosis statistics)
◉ Linearity
• Assess from scatterplot of X vs.Y
◉ Constant variation
• Plot of standardized residuals vs. X
• In plot of standardized residuals vs. X the points should scatter
randomly (without any pattern) and evenly (vertical spread the
same)
22

An Example: FEV (Y) and height (X)
◉ Normality of FEV:
• skewness=0.867,
• kurtosis -3 =1.028
26

Constant variance?
28
• Constant variance is not assumed
• FEV needs a natural logarithm transformation

After transformation, reassess the assumptions for the
transformed variable: ln(FEV)
◉ Normality:
• skew=0.040
• kurtosis - 3= -0.433
29

31
dataset = read.csv("SLR.csv", header=T,
colClasses = c("numeric", "numeric", "numeric"))
head(dataset,5)
#/////Simple Regression/////
simple.fit = lm(Sales~Spend,data=dataset)
summary(simple.fit)
#Loading the necessary libraries
library(lmtest) #dwtest
library(fBasics) #JarqueBeraTest
#Testing normal distribution and independence assumptions
jarqueberaTest(simple.fit$resid) #Test residuals for normality
#Null Hypothesis: Skewness and Kurtosis are equal to zero
dwtest(simple.fit) #Test for independence of residuals
#Null Hypothesis: Errors are serially UNcorrelated
#Simple Regression Residual Plots
layout(matrix(c(1,1,2,3),2,2,byrow=T))
#Spend x Residuals Plot
plot(simple.fit$resid~dataset$Spend[order(dataset$Spend)],
main="Spend x Residualsnfor Simple Regression",
xlab="Marketing Spend", ylab="Residuals")
abline(h=0,lty=2)
#Histogram of Residuals
hist(simple.fit$resid, main="Histogram of Residuals",
ylab="Residuals")
#Q-Q Plot
qqnorm(simple.fit$resid)
qqline(simple.fit$resid)
R Code for SLR

Multiple Regression
◉ Simple linear regression describes the linear relationship
between a dependent variable Y and a single explanatory
variable X
◉ Multiple regression is an extension to the case of one
dependent variable and two or more explanatory variables
33

Reasons for performing Multiple
regression
◉ Predictions on the basis of a number of variables will be better
than those based on only one explanatory variable
◉ When testing the effect of a primary variable of interest e.g.
treatment effect / exposure, one needs to account for all other
extraneous influences
• The need to ‘control’ or ‘adjust’ for the possible effects of
‘nuisance’ explanatory variables (known as confounders)
◉ The relationships may be complex e.g. variables may have
combined or synergistic effects on the dependent variable
34

Reasons for performing Multiple
regression
◉ It is almost always better to perform one comprehensive
analysis including all the relevant variables than a series of two-
way comparisons
• Reduce chances of increasing Type I error rate beyond 5%
• In multiple regression a linear model is fitted for the dependent
variable, which is expressed as a linear combination of the
independent variables
35

Importance of Predictors
◉ The regression coefficient bi represents the effect of that
independent variable on DV Y, after controlling for all
the other variables in the model
◉ The importance of each individual variable is tested by a
t test or an F test as for SLR
◉ Significance of an explanatory variable is dependent on
which other variables included in the regression model
◉ A confidence interval gives further information
36

Multiple regression models
◉ Multiple linear regression
• predictors all continuous and linearly related to the
dependent variable
◉ Analysis of covariance (ANCOVA)
• both continuous and categorical predictors
◉ Analysis of variance (eg. two-way ANOVA)
• predictors all categorical
◉ Polynomial regression
• quadratic or higher order terms included
37

Categorical predictors
◉ Association between a continuous DV Y and a categorical IV
X is assessed by comparing the mean Y values in each
category of X
◉ A reference category is chosen to compare the other
category/ies with
◉ The regression coefficient for a comparison represents the
difference in the mean for Y for the given category vs the
reference category
38

Assessing the fit of the model
◉ R2 measures usefulness or predictive value of model
◉ R2 is interpreted as the proportion of the total variability
explained by the model
◉ R2 increases in value as each additional variable is added to
the model
◉ adjusted R2 (preferred measure) takes into account the
number of explanatory variables included in the model
◉ E.g. R2 = 0.482 Radj2 =0.462
39

◉ Also assess fit by inspection of standardized residuals
• If these follow a Normal distribution
• Any value
◉ Large residual: model does not fit well for that subject
◉ Some large residuals will occur by chance, many large
residuals are of concern s > 3 and < -3 are large
40

Assumptions of multiple regression
◉ The observations must be independent
◉ The relation between each continuous X and the dependent
variable should be linear
◉ The values of the dependent variable Y should have a Normal
distribution
◉ The variability of Y should be the same for any set of values of the
explanatory variables – homoscedasticity
41

How to assess assumptions
◉ Assessing the Normality of Y (or the standardized residuals)
• Obtaining scatterplots of Y (or the standardized residuals)
against each continuous X primarily to assess linearity
◉ Obtaining
• Levene’s test for Y (or the standardized residuals) (if
categorical predictors are included in the model) to assess
equal variance
• a plot of the standardized residuals against each X (if
continuous predictors are included in the model) primarily to
assess constant variation
42

Example - assess assumptions
◉ DV: FEV1
◉ Explanatory variables:
• Height (in cm’s)
• Gender (binary)
• Smoking status (3categories)
◉ Normality of FEV1 (5measures)
• skewness= -0.11,
• kurtosis -3 = -0.80
• Assumed
43

◉ Linearity: FEV1 vs height (scatterplot)
• Assumed
44
Linearity

Constant variation:
◉ Constant variation:
• standardized residuals vs height (scatterplot): no clear pattern
◉ Assumed
45

Equality Variances
• Levenes’ Test (Robust): p = 0.937 >0.05,
• Assumed
46
Conclusion: All the assumptions are met
Note: the test could be done using standardized residuals

R Code for MLR
#loading of the data
data("mtcars")
#viewing the data
mtcars
head(mtcars)
names(mtcars)
#attach command is used in R so that we need not call the data
everytime
attach(mtcars)
#checking the realtionship between the variables
plot(mpg,cyl)
plot(mpg,disp)
plot(mpg,hp)
plot(mpg,drat)
plot(mpg,wt)
plot(mpg,qsec)
plot(mpg,vs)
47
plot(mpg,am)
plot(mpg,gear)
plot(mpg,carb)
#creating simple linear regression
#creating the multiple linear model
model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb)
model
#checking the summary of the model
summary(model)
#various parameters to check the fitness of the model
#mean square error
sqrt(sum((model$residuals)^2)/21)
summary(model)
#hypothesis testing t-test
#test statistic is just the point estimate of the slope of the model divided by the standard error
of that coefficient/slope value
#example 1
tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2]
tstat
2*pt(tstat, 21, lower.tail=FALSE)
#example2
tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2]
tstat2
2*pt(tstat2, 21, lower.tail=FALSE)
summary(model)

48
#F-test
summary(model)
#Coefficient Confidence Intervals
confint(model, level=.95)
#testing the various assumptions of the model
#1 checking whether the residuals are normally distributed or not
#histogram
resid<- model$residuals
hist(resid)
#quantile plot
qqnorm(resid)
qqline(resid)
#2 checking the homoscedasticity
plot(model$residuals ~ disp)
abline(0,0)
#residual analysis
plot(model)
#transformations
model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb)
summary(model1)
plot(model1)
#reducing the model
#calling of library
library(MASS)
#running the AIC on intial model
stepAIC(model)
#running the AIC on the transformed model
stepAIC(model1)
#constructing new models with reduced variables
model2<-lm(mpg~qsec+wt+am)
summary(model2)
model6<-lm(mpg~log(disp)+gear+carb)
summary(model6)
#partial F-test
nestmodel = lm(mpg ~ wt + qsec + am)
anova(model,nestmodel)
#Multicollinearity
plot(mtcars)
#checking the correlation
cor(qsec, wt)
cor(am, wt)
cor(am, qsec)
#variance inflation factor
install.packages("car")
library(car)
vif(model2)
#Polynomial Model
plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am)
plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
summary(quadmod)
AIC(quadmod)
#Interaction Model
model3<-lm(mpg~qsec+wt*am)
summary(model3)
AIC(model3)
resid3<- model3$residuals
hist(resid3)
qqnorm(resid3)
qqline(resid3)
plot(model3$residuals ~ disp)
abline(0,0)
plot(model3)

49
#using the model
newdata <- data.frame(wt=2.92, qsec=20.1, am=1)
predy <- predict(model3, newdata, interval="predict")
predy
confy <- predict(model3, newdata, interval="confidence")
confy
confy %*% c(0, -1, 1)
predy %*% c(0, -1, 1)
confy[1] == predy[1]
#sample prediciton
mtcars[20, ]
pred<-
coef(summary(model3))[1,1]+coef(summary(model3))[2,1]*19.9+coef(summary(model3))[3,1]*
1.835+coef(summary(model3))[4,1]*1+coef(summary(model3))[5,1]*1.835*1
pred
33.9-31.0523

Regression analysis in R

More Related Content

What's hot (20)

Similar to Regression analysis in R (20)

More from Alichy Sowmya (12)

Recently uploaded (20)

Regression analysis in R