24 modelling

Hadley Wickham
Stat405Intro to modelling
Tuesday, 16 November 2010

1. What is a linear model?
2. Removing trends
3. Transformations
4. Categorical data
5. Visualising models

What is a
linear
model?

observed value

predicted
value
observed value

predicted
value
observed value
residual

y ~ x
# yhat = b1x + b0
# Want to find b's that minimise distance
# between y and yhat
z ~ x + y
# zhat = b2x + b1y + b0
# Want to find b's that minimise distance
# between z and zhat
z ~ x * y
# zhat = b3(x⋅y) + b2x + b1y + b0

X is measured without error.
Relationship is linear.
Errors are independent.
Errors have normal distribution.
Errors have constant variance.
Assumptions

Removing
trends

library(ggplot2)
diamonds$x[diamonds$x == 0] <- NA
diamonds$y[diamonds$y == 0] <- NA
diamonds$y[diamonds$y > 30] <- NA
diamonds$z[diamonds$z == 0] <- NA
diamonds$z[diamonds$z > 30] <- NA
diamonds <- subset(diamonds, carat < 2)
qplot(x, y, data = diamonds)
qplot(x, z, data = diamonds)

mody <- lm(y ~ x, data = diamonds, na = na.exclude)
coef(mody)
# yhat = 0.05 + 0.99⋅x
# Plot x vs yhat
qplot(x, predict(mody), data = diamonds)
# Plot x vs (y - yhat) = residual
qplot(x, resid(mody), data = diamonds)
# Standardised residual:
qplot(x, rstandard(mody), data = diamonds)

qplot(x, resid(mody), data=dclean)

qplot(x, y - x, data=dclean)

Your turn
Do the same thing for z and x. What
threshold might you use to remove
outlying values?
Are the errors from predicting z and y
from x related?

modz <- lm(z ~ x, data = diamonds, na = na.exclude)
coef(modz)
# zhat = 0.03 + 0.61x
qplot(x, rstandard(modz), data = diamonds)
last_plot() + ylim(-10, 10)
qplot(rstandard(mody), rstandard(modz))

Transformations

Can we use a
linear model to
remove this trend?

Can we use a
linear model to
remove this trend?
Linear models are linear in
their parameters which can be
any transformation of the data

Your turn
Use a linear model to remove the effect of
carat on price. Conﬁrm that this worked
by plotting model residuals vs. color.
How can you interpret the model
coefﬁcients and residuals?

modprice <- lm(log(price) ~ log(carat),
data = diamonds, na = na.exclude)
diamonds$relprice <- exp(resid(modprice))
qplot(carat, relprice, data = diamonds)
diamonds <- subset(diamonds, carat < 2)
qplot(carat, relprice, data = diamonds)
qplot(carat, relprice, data = diamonds) +
facet_wrap(~ color)
qplot(relprice, ..density.., data = diamonds,
colour = color, geom = "freqpoly", binwidth = 0.2)
qplot(relprice, ..density.., data = diamonds,
colour = cut, geom = "freqpoly", binwidth = 0.2)

log(Y) = a * log(X) + b
Y = c . dX
An additive model becomes a
multiplicative model.
Intercept becomes starting point,
slope becomes geometric growth.
Multiplicative model

Residuals
resid(mod) = log(Y) - log(Yhat)
exp(resid(mod)) = Y / (Yhat)

# Useful trick - close to 0, exp(x) ~ x + 1
x <- seq(-0.2, 0.2, length = 100)
qplot(x, exp(x)) + geom_abline(intercept = 1)
qplot(x, x / exp(x)) + scale_y_continuous("Percent
error", formatter = percent)
# Not so useful here because the x is also
# transformed
coef(modprice)

Categorical
data

Compare the results of the following two
functions. What can you say about the
model?
ddply(diamonds, "color", summarise,
mean = mean(price))
coef(lm(price ~ color, data = diamonds))
Your turn

Categorical data
Converted into a numeric matrix, with one
column for each level. Contains 1 if that
observation has that level, 0 otherwise.
However, if we just do that naively, we end
up with too many columns (because we
have one extra column for the intercept)
So everything is relative to the ﬁrst level.

Visualising
models

# What do you think this model does?
lm(log(price) ~ log(carat) + color,
data = diamonds)
# What about this one?
lm(log(price) ~ log(carat) * color,
data = diamonds)
# Or this one?
lm(log(price) ~ cut * color,
data = diamonds)
# How can we interpret the results?

mod1 <- lm(log(price) ~ log(carat) + cut, data = diamonds)
mod2 <- lm(log(price) ~ log(carat) * cut, data = diamonds)
# One way is to explore predictions from the model
# over an evenly spaced grid. expand.grid makes
# this easy
grid <- expand.grid(
carat = seq(0.2, 2, length = 20),
cut = levels(diamonds$cut),
KEEP.OUT.ATTRS = FALSE)
str(grid)
grid
grid$p1 <- exp(predict(mod1, grid))
grid$p2 <- exp(predict(mod2, grid))

Plot the predictions from the two sets of
models. How are they different?
Your turn

qplot(carat, p1, data = grid, colour = cut,
geom = "line")
qplot(carat, p2, data = grid, colour = cut,
geom = "line")
qplot(log(carat), log(p1), data = grid,
colour = cut, geom = "line")
qplot(log(carat), log(p2), data = grid,
colour = cut, geom = "line")
qplot(carat, p1 / p2, data = grid, colour = cut,
geom = "line")

# Another approach is the effects package
# install.packages("effects")
library(effects)
effect("cut", mod1)
cut <- as.data.frame(effect("cut", mod1))
qplot(fit, reorder(cut, fit), data = cut)
qplot(fit, reorder(cut, fit), data = cut) +
geom_errorbarh(aes(xmin = lower, xmax = upper),
height = 0.1)
qplot(exp(fit), reorder(cut, fit), data = cut) +
geom_errorbarh(aes(xmin = exp(lower),
xmax = exp(upper)), height = 0.1)

24 modelling

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 24 modelling (20)

More from Hadley Wickham (20)

Recently uploaded (20)

24 modelling