SlideShare a Scribd company logo
Introduction to R for Data Science
Lecturers
dipl. ing Branko Kovač
Data Analyst at CUBE/Data Science Mentor
at Springboard
Data Science Serbia
branko.kovac@gmail.com
dr Goran S. Milovanović
Data Scientist at DiploFoundation
Data Science Serbia
goran.s.milovanovic@gmail.com
goranm@diplomacy.edu
MultipleLinear Regression in R
• Dummy coding of categorical predictors
• Multiple regression
• Nested models and Partial
F-test
• Partial and Part Correlation
• Multicolinearity
• {Lattice} plots
• Prediction, Confidence
Intervals, Residuals
• Influential Cases and
the Influence Plot
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
########################################################
# Introduction to R for Data Science
# SESSION 7 :: 9 June, 2016
# Multiple Linear Regression in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
#### read data
library(datasets)
library(broom)
library(ggplot2)
library(lattice)
#### load
data(iris)
str(iris)
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### simple linearregression:SepalLength vs Petal
Lenth
# Predictorvs Criterion {ggplot2}
ggplot(data = iris,
aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(size = 2, colour = "black") +
geom_point(size = 1, colour = "white") +
geom_smooth(aes(colour = "black"),
method='lm') +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length") +
theme(legend.position = "none")
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# And now for something completelly different(but in
R)...
#### Problemswith linearregressionin iris
# Predictorvs Criterion {ggplot2} - group separation
ggplot(data = iris,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
geom_point(size = 2) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length")
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Predictorvs Criterion {ggplot2} - separate
regression lines
ggplot(data = iris,
aes(x = Sepal.Length,
y = Petal.Length,
colour=Species)) +
geom_smooth(method=lm) +
geom_point(size = 2) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length")
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### better... {lattice}
xyplot(Petal.Length ~ Sepal.Length | Species, #
{latice} xyplot
data = iris,
xlab = "Sepal Length", ylab = "Petal Length"
)
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Petal Length and SepalLength:Conditional
Densities
densityplot(~ Petal.Length | Species, # {latice} xyplot
data = iris,
plot.points=FALSE,
xlab = "Petal Length", ylab = "Density",
main = "P(Petal Length|Species)",
col.line = 'red'
)
densityplot(~ Sepal.Length | Species, # {latice} xyplot
data = iris,
plot.points=FALSE,
xlab = "Sepal Length", ylab = "Density",
main = "P(Sepal Length|Species)",
col.line = 'blue'
)
MultipleRegression in R
• Problems with simple linear regression:
iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Linearregressionin subgroups
species <- unique(iris$Species)
w1 <- which(iris$Species == species[1]) # setosa
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,])
tidy(reg)
w2 <- which(iris$Species == species[2]) # versicolor
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,])
tidy(reg)
w3 <- which(iris$Species == species[3]) # virginica
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,])
tidy(reg)
MultipleRegression in R
• Simple linear regressions in sub-groups
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### Dummy Coding:Species in the iris dataset
is.factor(iris$Species)
levels(iris$Species)
reg <- lm(Petal.Length ~ Species, data=iris)
tidy(reg)
glance(reg)
# Neverforget whatthe regressioncoefficientfor a dummy variablemeans:
# It tells us aboutthe effectof moving from the baselinetowardsthe respectivereferencelevel!
# Here: baseline = setosa (cmp.levels(iris$Species)vs.the outputof tidy(reg))
# NOTE: watch for the order of levels!
levels(iris$Species) # Levels: setosa versicolor virginica
iris$Species <- factor(iris$Species,
levels = c("versicolor",
"virginica",
"setosa"))
levels(iris$Species)
# baseline is now:versicolor
reg <- lm(Petal.Length ~ Species, data=iris)
tidy(reg)# The regression coefficents (!): figure out whathas happened!
MultipleRegression in R
• Dummy coding of categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### anotherway to do dummy coding
rm(iris); data(iris) # ...justto fix the order of Species backto default
levels(iris$Species)
contrasts(iris$Species) = contr.treatment(3, base = 1)
contrasts(iris$Species) # this probably whatyou rememberfrom your stats class...
iris$Species <- factor(iris$Species,
levels = c ("virginica","versicolor","setosa"))
levels(iris$Species)
contrasts(iris$Species) = contr.treatment(3, base = 1)
# baseline is now:virginica
contrasts(iris$Species) # considercarefully whatyou need to do
MultipleRegression in R
• Dummy coding of categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### Petal.Length ~ Species(Dummy Coding)+ Sepal.Length
rm(iris); data(iris) # ...just to fix the order of Species backto default
reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris)
# BTW: since is.factor(iris$Species)==T,R does the dummy coding in lm() for you
regSum <- summary(reg)
regSum$r.squared
regSum$coefficients
# compare w. Simple LinearRegression
reg <- lm(Petal.Length ~ Sepal.Length, data=iris)
regSum <- summary(reg)
regSum$r.squared
regSum$coefficients
MultipleRegression in R
• Multiple regression with dummy-coded categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### Comparingnestedmodels
reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris)
reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2
# terminology:reg2 is a "full model"
# this terminology will be used quite often in Logistic Regression
# NOTE: Nested models
# There is a set of coefficientsfor the nested model(reg1)such thatit
# can be expressedin terms of the full model(reg2); in our case it is simple
# HOME: - figure it out.
anova(reg1, reg2) # partial F-test; Speciescertainly has an effect beyond Sepal.Length
# NOTE: for partial F-test, see:
# http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf
MultipleRegression in R
• Comparison of nested models
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### Multiple Regression - by the book
# Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression
# (that's from yourreading list, to remind you...)
data(stackloss)
str(stackloss)
# Data set description
# URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html
stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
data=stackloss)
# let's see:
summary(stacklossModel)
glance(stacklossModel) # {broom}
tidy(stacklossModel) # {broom}
# predictnew data
obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
predict(stacklossModel, obs)
MultipleRegression in R
• By the book: two or three continuous predictors…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# confidence intervals
confint(stacklossModel, level=.95) #
95% CI
confint(stacklossModel, level=.99) #
99% CI
# 95% CI for Acid.Conc.only
confint(stacklossModel, "Acid.Conc.",
level=.95)
# defaultregressionplots in R
plot(stacklossModel)
MultipleRegression in R
• By the book: two or three continuous predictors…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# multicolinearity
library(car) # John Fox's carpackage
VIF <- vif(stacklossModel)
VIF
sqrt(VIF)
# Variance Inflation Factor(VIF)
# The increasein the ***variance***of an regression ceoff.due to colinearity
# NOTE: sqrt(VIF)= how much larger the ***SE*** of a reg.coeff.vs. whatit would be
# if there were no correlationswith the other predictors in the model
# NOTE: lower_bound(VIF)= 1; no upperbound;VIF > 2 --> (Concerned== TRUE)
Tolerance <- 1/VIF # obviously,tolerance and VIF are redundant
Tolerance
# NOTE: you can inspectmulticolinearity in the multiple regressionmode
# by conductinga PrincipalComponentAnalysis overthe predictors;
# when the time is right.
MultipleRegression in R
• Assumptions: multicolinearity
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### R for partial and part (semi-partial)correlations
library(ppcor) # a good one;there are many ways to do this in R
#### partialcorrelation in R
dataSet <- iris
str(dataSet)
dataSet$Species <- NULL
irisPCor <- pcor(dataSet, method="pearson")
irisPCor$estimate # partialcorrelations
irisPCor$p.value # results of significancetests
irisPCor$statistic # t-test on n-2-k degrees offreedom ;k = num. of variablesconditioned
# partial correlation between x and y while controlling forz
partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
dataSet$Sepal.Width,
method = "pearson")
partialCor$estimate
partialCor$p.value
partialCor$statistic
MultipleRegression in R
• Partial Correlation in R
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### semi-partialcorrelation in R
# NOTE: ... Semi-partialcorrelation is the correlation of two variables
# with variation from a third or more othervariables removedonly
# from the ***second variable***
# NOTE: The first variable <- rows, the secondvariable <-columns
# cf. ppcor:An R Packagefor a FastCalculationto Semi-partialCorrelation Coefficients(2015)
# SeonghoKim, BiostatisticsCore,Karmanos CancerInstitute,Wayne State University
# URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/
irisSPCor <- spcor(dataSet, method = "pearson")
irisSPCor$estimate
irisSPCor$p.value
irisSPCor$statistic
partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
dataSet$Sepal.Width,
method = "pearson")
# NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length
# when the variance ofdataSet$Petal.Length(2nd variable)due to dataSet$Sepal.Width
# is removed!
partCor$estimate
partCor$p.value
MultipleRegression in R
• Part (semi-partial) Correlation in R
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# NOTE: In multiple regression,this is the semi-partial(or part) correlation
# that you need to inspect:
# assume a modelwith X1, X2, X3 as predictors,and Y as a criterion
# You need a semi-partialof X1 and Y following the removalof X2 and X3 from Y
# It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3;
# In Step 2, you take the residualsof Y, call them RY; in Step 3, you regress (correlate)
# RY ~ X1: the correlation coefficientthat you get from Step 3 is the part correlation
# that you're looking for.
MultipleRegression in R
• NOTE on semi-partial (part) correlation in multiple regression…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

More Related Content

PDF
Alcohol Sensing Alert with Engine Locking using IOT
PPT
Power Presentation On Railway Track Fault Detector
PDF
MEMS & Sensors Market: Current Challenges & Future Opportunities presentation...
PDF
Reducing Head in Pillow Defects
PDF
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
PDF
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
PDF
Introduction to R for Data Science :: Session 2
PDF
Introduction to R for Data Science :: Session 1
Alcohol Sensing Alert with Engine Locking using IOT
Power Presentation On Railway Track Fault Detector
MEMS & Sensors Market: Current Challenges & Future Opportunities presentation...
Reducing Head in Pillow Defects
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 1

Viewers also liked (20)

PDF
Introduction to R for Data Science :: Session 4
PDF
Introduction to R for Data Science :: Session 3
PDF
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
PPTX
Linear Regression using R
PDF
Variable selection for classification and regression using R
PDF
Linear regression with R 2
PDF
Multiple regression in spss
DOCX
Latest seo news, tips and tricks website lists
PPTX
Multiple Linear Regression
PDF
Weather forecasting technology
PPSX
Electron Configuration
PDF
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
PDF
R presentation
PDF
Data analysis of weather forecasting
PPT
Rtutorial
PDF
Multiple linear regression
PDF
2 R Tutorial Programming
PPTX
R Introduction
PPTX
Building a Scalable Data Science Platform with R
PDF
Introduction to R
Introduction to R for Data Science :: Session 4
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Linear Regression using R
Variable selection for classification and regression using R
Linear regression with R 2
Multiple regression in spss
Latest seo news, tips and tricks website lists
Multiple Linear Regression
Weather forecasting technology
Electron Configuration
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
R presentation
Data analysis of weather forecasting
Rtutorial
Multiple linear regression
2 R Tutorial Programming
R Introduction
Building a Scalable Data Science Platform with R
Introduction to R
Ad

Similar to Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R] (20)

PPTX
Linear regression by Kodebay
PDF
Regression and Classification with R
PPT
Get Multiple Regression Assignment Help
PPTX
Predicating continuous variables-1.pptx
PDF
RDataMining slides-regression-classification
PDF
15 ch ken black solution
PDF
R Regression Models with Zelig
PPT
604_multiplee.ppt
PPTX
2. diagnostics, collinearity, transformation, and missing data
PPT
A presentation for Multiple linear regression.ppt
PPTX
lm() Function.pptxsfdfsfsfsfsfsfsfsdfsdfsfsfs
PDF
Linear models
 
PPTX
Linear Regression.pptx
PDF
Multiple regression
PDF
Linear Regression An Introduction To Statistical Models Peter Martin
PPTX
Tempest in teapot
PDF
R programming intro with examples
PPTX
DataScienceUsingR-Dr.P.Rajesh.PRESENTATION
Linear regression by Kodebay
Regression and Classification with R
Get Multiple Regression Assignment Help
Predicating continuous variables-1.pptx
RDataMining slides-regression-classification
15 ch ken black solution
R Regression Models with Zelig
604_multiplee.ppt
2. diagnostics, collinearity, transformation, and missing data
A presentation for Multiple linear regression.ppt
lm() Function.pptxsfdfsfsfsfsfsfsfsdfsdfsfsfs
Linear models
 
Linear Regression.pptx
Multiple regression
Linear Regression An Introduction To Statistical Models Peter Martin
Tempest in teapot
R programming intro with examples
DataScienceUsingR-Dr.P.Rajesh.PRESENTATION
Ad

More from Goran S. Milovanovic (20)

PDF
Geneva Social Media Index - Report 2015 full report
PDF
Milovanović, G.S., Krstić, M. & Filipović, O. (2015). Kršenje homogenosti pre...
PDF
247113920-Cognitive-technologies-mapping-the-Internet-governance-debate
PDF
Učenje i viši kognitivni procesi 10. Simboličke funkcije, VI Deo: Rešavanje p...
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Rezonovanje u...
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Suđenje, heur...
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, IV Deo: Analogija i ...
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, III Deo: Kauzalnost,...
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Distribuiran...
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Konekcioniza...
PDF
Učenje i viši kognitivni procesi 7a. Simboličke funkcije, I Deo: Učenje kateg...
PDF
Učenje i viši kognitivni procesi 7. Simboličke funkcije, I Deo: Koncepti, kat...
PDF
Učenje i viši kognitivni procesi 7. Učenje, IV Deo: Neasocijativno učenje, ef...
PDF
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Hernstejnov zakon slagan...
PDF
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Instrumentalno učenje
PDF
Učenje i viši kognitivni procesi 5. Učenje, II Deo: Blokiranje, osenčavanje, ...
PDF
Učenje i viši kognitivni procesi 5. Učenje, II Deo: klasično uslovljavanje i ...
PDF
Učenje i viši kognitivni procesi 5. Učenje, I Deo
PDF
Učenje i viši kognitivni procesi 4a. Debata o racionalnosti, nastavak
PDF
Učenje i viši kognitivni procesi 4. Debata o racionalnosti
Geneva Social Media Index - Report 2015 full report
Milovanović, G.S., Krstić, M. & Filipović, O. (2015). Kršenje homogenosti pre...
247113920-Cognitive-technologies-mapping-the-Internet-governance-debate
Učenje i viši kognitivni procesi 10. Simboličke funkcije, VI Deo: Rešavanje p...
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Rezonovanje u...
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Suđenje, heur...
Učenje i viši kognitivni procesi 8. Simboličke funkcije, IV Deo: Analogija i ...
Učenje i viši kognitivni procesi 9. Simboličke funkcije, III Deo: Kauzalnost,...
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Distribuiran...
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Konekcioniza...
Učenje i viši kognitivni procesi 7a. Simboličke funkcije, I Deo: Učenje kateg...
Učenje i viši kognitivni procesi 7. Simboličke funkcije, I Deo: Koncepti, kat...
Učenje i viši kognitivni procesi 7. Učenje, IV Deo: Neasocijativno učenje, ef...
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Hernstejnov zakon slagan...
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Instrumentalno učenje
Učenje i viši kognitivni procesi 5. Učenje, II Deo: Blokiranje, osenčavanje, ...
Učenje i viši kognitivni procesi 5. Učenje, II Deo: klasično uslovljavanje i ...
Učenje i viši kognitivni procesi 5. Učenje, I Deo
Učenje i viši kognitivni procesi 4a. Debata o racionalnosti, nastavak
Učenje i viši kognitivni procesi 4. Debata o racionalnosti

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Institutional Correction lecture only . . .
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Cell Types and Its function , kingdom of life
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Complications of Minimal Access Surgery at WLH
PDF
Pre independence Education in Inndia.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Basic Mud Logging Guide for educational purpose
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
RMMM.pdf make it easy to upload and study
Microbial disease of the cardiovascular and lymphatic systems
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Cell Structure & Organelles in detailed.
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Microbial diseases, their pathogenesis and prophylaxis
Institutional Correction lecture only . . .
Anesthesia in Laparoscopic Surgery in India
Cell Types and Its function , kingdom of life
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Complications of Minimal Access Surgery at WLH
Pre independence Education in Inndia.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Basic Mud Logging Guide for educational purpose
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
O5-L3 Freight Transport Ops (International) V1.pdf
RMMM.pdf make it easy to upload and study

Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

  • 1. Introduction to R for Data Science Lecturers dipl. ing Branko Kovač Data Analyst at CUBE/Data Science Mentor at Springboard Data Science Serbia branko.kovac@gmail.com dr Goran S. Milovanović Data Scientist at DiploFoundation Data Science Serbia goran.s.milovanovic@gmail.com goranm@diplomacy.edu
  • 2. MultipleLinear Regression in R • Dummy coding of categorical predictors • Multiple regression • Nested models and Partial F-test • Partial and Part Correlation • Multicolinearity • {Lattice} plots • Prediction, Confidence Intervals, Residuals • Influential Cases and the Influence Plot Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 3. ######################################################## # Introduction to R for Data Science # SESSION 7 :: 9 June, 2016 # Multiple Linear Regression in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## #### read data library(datasets) library(broom) library(ggplot2) library(lattice) #### load data(iris) str(iris) MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 4. #### simple linearregression:SepalLength vs Petal Lenth # Predictorvs Criterion {ggplot2} ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(size = 2, colour = "black") + geom_point(size = 1, colour = "white") + geom_smooth(aes(colour = "black"), method='lm') + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length") + theme(legend.position = "none") MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 5. # And now for something completelly different(but in R)... #### Problemswith linearregressionin iris # Predictorvs Criterion {ggplot2} - group separation ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + geom_point(size = 2) + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length") MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 6. # Predictorvs Criterion {ggplot2} - separate regression lines ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, colour=Species)) + geom_smooth(method=lm) + geom_point(size = 2) + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length") MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 7. ### better... {lattice} xyplot(Petal.Length ~ Sepal.Length | Species, # {latice} xyplot data = iris, xlab = "Sepal Length", ylab = "Petal Length" ) MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 8. # Petal Length and SepalLength:Conditional Densities densityplot(~ Petal.Length | Species, # {latice} xyplot data = iris, plot.points=FALSE, xlab = "Petal Length", ylab = "Density", main = "P(Petal Length|Species)", col.line = 'red' ) densityplot(~ Sepal.Length | Species, # {latice} xyplot data = iris, plot.points=FALSE, xlab = "Sepal Length", ylab = "Density", main = "P(Sepal Length|Species)", col.line = 'blue' ) MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 9. # Linearregressionin subgroups species <- unique(iris$Species) w1 <- which(iris$Species == species[1]) # setosa reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,]) tidy(reg) w2 <- which(iris$Species == species[2]) # versicolor reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,]) tidy(reg) w3 <- which(iris$Species == species[3]) # virginica reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,]) tidy(reg) MultipleRegression in R • Simple linear regressions in sub-groups Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 10. #### Dummy Coding:Species in the iris dataset is.factor(iris$Species) levels(iris$Species) reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg) glance(reg) # Neverforget whatthe regressioncoefficientfor a dummy variablemeans: # It tells us aboutthe effectof moving from the baselinetowardsthe respectivereferencelevel! # Here: baseline = setosa (cmp.levels(iris$Species)vs.the outputof tidy(reg)) # NOTE: watch for the order of levels! levels(iris$Species) # Levels: setosa versicolor virginica iris$Species <- factor(iris$Species, levels = c("versicolor", "virginica", "setosa")) levels(iris$Species) # baseline is now:versicolor reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg)# The regression coefficents (!): figure out whathas happened! MultipleRegression in R • Dummy coding of categorical predictors Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 11. ### anotherway to do dummy coding rm(iris); data(iris) # ...justto fix the order of Species backto default levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) contrasts(iris$Species) # this probably whatyou rememberfrom your stats class... iris$Species <- factor(iris$Species, levels = c ("virginica","versicolor","setosa")) levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) # baseline is now:virginica contrasts(iris$Species) # considercarefully whatyou need to do MultipleRegression in R • Dummy coding of categorical predictors Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 12. ### Petal.Length ~ Species(Dummy Coding)+ Sepal.Length rm(iris); data(iris) # ...just to fix the order of Species backto default reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # BTW: since is.factor(iris$Species)==T,R does the dummy coding in lm() for you regSum <- summary(reg) regSum$r.squared regSum$coefficients # compare w. Simple LinearRegression reg <- lm(Petal.Length ~ Sepal.Length, data=iris) regSum <- summary(reg) regSum$r.squared regSum$coefficients MultipleRegression in R • Multiple regression with dummy-coded categorical predictors Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 13. ### Comparingnestedmodels reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris) reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2 # terminology:reg2 is a "full model" # this terminology will be used quite often in Logistic Regression # NOTE: Nested models # There is a set of coefficientsfor the nested model(reg1)such thatit # can be expressedin terms of the full model(reg2); in our case it is simple # HOME: - figure it out. anova(reg1, reg2) # partial F-test; Speciescertainly has an effect beyond Sepal.Length # NOTE: for partial F-test, see: # http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf MultipleRegression in R • Comparison of nested models Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 14. #### Multiple Regression - by the book # Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression # (that's from yourreading list, to remind you...) data(stackloss) str(stackloss) # Data set description # URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss) # let's see: summary(stacklossModel) glance(stacklossModel) # {broom} tidy(stacklossModel) # {broom} # predictnew data obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85) predict(stacklossModel, obs) MultipleRegression in R • By the book: two or three continuous predictors… Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 15. # confidence intervals confint(stacklossModel, level=.95) # 95% CI confint(stacklossModel, level=.99) # 99% CI # 95% CI for Acid.Conc.only confint(stacklossModel, "Acid.Conc.", level=.95) # defaultregressionplots in R plot(stacklossModel) MultipleRegression in R • By the book: two or three continuous predictors… Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 16. # multicolinearity library(car) # John Fox's carpackage VIF <- vif(stacklossModel) VIF sqrt(VIF) # Variance Inflation Factor(VIF) # The increasein the ***variance***of an regression ceoff.due to colinearity # NOTE: sqrt(VIF)= how much larger the ***SE*** of a reg.coeff.vs. whatit would be # if there were no correlationswith the other predictors in the model # NOTE: lower_bound(VIF)= 1; no upperbound;VIF > 2 --> (Concerned== TRUE) Tolerance <- 1/VIF # obviously,tolerance and VIF are redundant Tolerance # NOTE: you can inspectmulticolinearity in the multiple regressionmode # by conductinga PrincipalComponentAnalysis overthe predictors; # when the time is right. MultipleRegression in R • Assumptions: multicolinearity Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 17. #### R for partial and part (semi-partial)correlations library(ppcor) # a good one;there are many ways to do this in R #### partialcorrelation in R dataSet <- iris str(dataSet) dataSet$Species <- NULL irisPCor <- pcor(dataSet, method="pearson") irisPCor$estimate # partialcorrelations irisPCor$p.value # results of significancetests irisPCor$statistic # t-test on n-2-k degrees offreedom ;k = num. of variablesconditioned # partial correlation between x and y while controlling forz partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width, method = "pearson") partialCor$estimate partialCor$p.value partialCor$statistic MultipleRegression in R • Partial Correlation in R Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 18. #### semi-partialcorrelation in R # NOTE: ... Semi-partialcorrelation is the correlation of two variables # with variation from a third or more othervariables removedonly # from the ***second variable*** # NOTE: The first variable <- rows, the secondvariable <-columns # cf. ppcor:An R Packagefor a FastCalculationto Semi-partialCorrelation Coefficients(2015) # SeonghoKim, BiostatisticsCore,Karmanos CancerInstitute,Wayne State University # URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/ irisSPCor <- spcor(dataSet, method = "pearson") irisSPCor$estimate irisSPCor$p.value irisSPCor$statistic partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width, method = "pearson") # NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length # when the variance ofdataSet$Petal.Length(2nd variable)due to dataSet$Sepal.Width # is removed! partCor$estimate partCor$p.value MultipleRegression in R • Part (semi-partial) Correlation in R Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 19. # NOTE: In multiple regression,this is the semi-partial(or part) correlation # that you need to inspect: # assume a modelwith X1, X2, X3 as predictors,and Y as a criterion # You need a semi-partialof X1 and Y following the removalof X2 and X3 from Y # It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3; # In Step 2, you take the residualsof Y, call them RY; in Step 3, you regress (correlate) # RY ~ X1: the correlation coefficientthat you get from Step 3 is the part correlation # that you're looking for. MultipleRegression in R • NOTE on semi-partial (part) correlation in multiple regression… Intro to R for Data Science Session 7: Multiple Linear Regression in R