R Bootcamp Day 3 Part 1 - Statistics in R

R Bootcamp Day 3 Part 1
Jefferson Davis
Olga Scrivner

Day 2 stuff
From yesterday and the day before
• R values have types/classes such as numeric, character,
logical, dataframes, and matrices.
• Much of R functionality is in libraries
• For help on a function run
? t.test()
from the R console.
• The plot() function will usually do something useful.

R: Common stats functions
Common statistical tests are very straightforward in R. Let's try
one on yesterday's dataset cars of car speeds and stopping
distances from the 1920s.
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10

Here's a t-test that the mean of the speeds in cars is not 12.
t.test(cars$speed, mu=12)
One Sample t-test
data: cars$speed
t = 4.5468, df = 49, p-value = 3.588e-05
alternative hypothesis: true mean is not equal
to 12
95 percent confidence interval:
13.89727 16.90273
sample estimates:
mean of x
15.4

We can change the parameters of t-test.
t.test(cars$speed, mu=12, alternative="less",
conf.level=.99)
One Sample t-test
data: cars$speed
t = 4.5468, df = 49, p-value = 1
alternative hypothesis: true mean is less than
12
99 percent confidence interval:
-Inf 17.19834
sample estimates:
mean of x
15.4

Anything you would see in a year long stats sequence will have
an implentation in R.
chisq.test() #Chi-squared
prop.test() #Proportions test
binom.test() #Exact binomial test
ks.test() #Kolmogorov–Smirnov
sd() #Standard deviation
cor() #Correlation

R: Linear regression
Regression analysis is one of the most popular and important
tools in statistics. If R goofed here, it would be worthless.
R uses the function lm() for linear models. The regression
formula is given in Wilkinson-Rogers notation
Predictor terms Wilkinson Notation
Intercept 1 (Default)
No intercept -1
x1 x1
x1, x2 x1 + x2
x1, x2, x1x2 x1*x2 (or x1 + x2 + x1:x2)
x1x2 x1:x2
x1
2, x1 x1^2
x1 + x2 I(x1 + x2) (The letter I)

Regression analysis is one of the most important tools in
statistics. R uses Wilkinson-Rogers notation to to specify linear
models. So a model such as
yi = β0 + β1 xi1 + εi
Shows up in the R syntax as
y ~ x1
Let's review this syntax.
(Tables from https://www.mathworks.com/help/stats/wilkinson-
notation.html)

Predictor terms Wilkinson Notation
Intercept 1 (Default)
No intercept -1
x1 x1
x1, x2 x1 + x2
x1, x2, x1x2 x1*x2
(or x1 + x2 + x1:x2)
x1x2 x1:x2
x1
2, x1 x1^2
x1 + x2 I(x1 + x2)

Model Wilkinson Notation
yi = β0 + β1 xi1 + β2 xi2 + εi
Two predictors
y ~ x1 + x2
yi = β1 xi1 + β2 xi2 + εi
Two predictors and no intercept
y ~ x1 + x2 - 1
yi = β0 + β1 xi1 + β2 xi2 +
β3 xi1 xi2 + εi
Two predictors with the interaction
term
y ~ x1 * x2
y ~ x1 + x2 + x1:x2
yi = β0 + β1 (xi1 + xi2 ) + εi
Regressing on the sum of predictors
y ~ I(x1 + x2)
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 +
β4 xi1 xi2 + εi
Three predictors with one interaction
y ~ x1 * x2 + x3

Model terms Wilkinson Notation
yi = β1 xi1 + β2 xi2 + β3 xi1 xi2 + εi
Two predictors, no intercept
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 +
β4 xi1 xi2 + β5 xi1 xi3 + β6 xi2 xi3 +
β7 xi1 xi2xi3+ εi
Three predictors, all interaction terms
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 +
εi
Three predictors, all two-way
interaction terms.

Model terms Wilkinson Notation
yi = β1 xi1 + β2 xi2 + β3 xi1 xi2 + εi
Two predictors, no intercept
y ~ x1*x2 - 1
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 +
β7 xi1 xi2xi3+ εi
Three predictors, all interaction terms
y ~ x1 * x2 * x3
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 +
εi
Three predictors, all two-way
interaction terms
y ~ x1 * x2 * x3 – x1:x2:x3

• R uses the function lm() for linear models.
• Generic syntax
lm(DV ~ IV1, NAME_OF_DATAFRAME)
• The above tells R that to regress the dependent variable (DV)
onto independent variable IV1. We can include other
variables and interaction effects.
lm(DV ~ IV1 + IV2 + IV1*IV2,
NAME_OF_DATAFRAME)

• Let's do an example using the cars data set. How about
regressing stopping distance on speed.
lm(dist ~ speed, cars)
Call:lm(formula = dist ~ speed, data = cars)
Coefficients:
(Intercept) speed
-17.579 3.932
• To work more let's store this in a variable
car.fit <- lm(dist ~ speed, cars)

summary(car.fit)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***

• We can also look at individual fields of the lm object.
car.fit$coefficients
(Intercept) speed
-17.579095 3.932409
car.fit$residuals[1:3]
1 2 3
3.849460 11.849460 -5.947766
car.fit$fitted.values[1:3]
1 2 3
-1.849460 -1.849460 9.947766

• Plot the fit
plot(cars$speed,
cars$dist,
xlab = "distance",
ylab = "speed")
abline(car.fit,
col="red")

• Class lm object have their
own overloaded plot()
function
plot(car.fit)

R: Mixed models
It doesn't seem crazy to fit a slope but use a random effect for
intercept.
fmOrthF <-
lme( distance ~ age,
data = OrthoFem,
random = ~ 1 | Subject )

R: Mixed models
• Let's take a look at a mixed model. We need a more complex
dataset. We use a subset of the Orthodont data set from the
Nonlinear Mixed-Effects Models (nlme) library.
library(nlme)
head(Orthodont)
Grouped Data: distance ~ age | Subject
distance age Subject Sex
1 26.0 8 M01 Male
2 25.0 10 M01 Male
3 29.0 12 M01 Male
4 31.0 14 M01 Male

R: Mixed models
OrthoFem <-
Orthodont[Orthodont$Sex
== "Female", ]
plot(OrthoFem)

R: Mixed models
In fact, it isn't crazy.
summary(fmOrthF)
Linear mixed-effects model fit by REML
Data: OrthoFem
AIC BIC logLik
149.2183 156.169 -70.60916
Random effects: Formula: ~1 | Subject
(Intercept) Residual
StdDev: 2.06847 0.7800331
Fixed effects: distance ~ age
Value Std.Error DF t-value p-value
(Intercept) 17.372727 0.8587419 32 20.230440 0
age 0.479545 0.0525898 32 9.118598 0
Correlation: (Intr)age -0.674

R: Conditional trees
At this point, I tag Olga in.

R Bootcamp Day 3 Part 1 - Statistics in R

More Related Content

Similar to R Bootcamp Day 3 Part 1 - Statistics in R (20)

More from Olga Scrivner (20)

Recently uploaded (20)

R Bootcamp Day 3 Part 1 - Statistics in R