SlideShare a Scribd company logo
Hadley Wickham
Stat405Intro to modelling
Tuesday, 16 November 2010
1. What is a linear model?
2. Removing trends
3. Transformations
4. Categorical data
5. Visualising models
Tuesday, 16 November 2010
What is a
linear
model?
Tuesday, 16 November 2010
Tuesday, 16 November 2010
observed value
Tuesday, 16 November 2010
observed value
Tuesday, 16 November 2010
predicted
value
observed value
Tuesday, 16 November 2010
predicted
value
observed value
Tuesday, 16 November 2010
predicted
value
observed value
residual
Tuesday, 16 November 2010
y ~ x
# yhat = b1x + b0
# Want to find b's that minimise distance
# between y and yhat
z ~ x + y
# zhat = b2x + b1y + b0
# Want to find b's that minimise distance
# between z and zhat
z ~ x * y
# zhat = b3(x⋅y) + b2x + b1y + b0
Tuesday, 16 November 2010
X is measured without error.
Relationship is linear.
Errors are independent.
Errors have normal distribution.
Errors have constant variance.
Assumptions
Tuesday, 16 November 2010
Removing
trends
Tuesday, 16 November 2010
library(ggplot2)
diamonds$x[diamonds$x == 0] <- NA
diamonds$y[diamonds$y == 0] <- NA
diamonds$y[diamonds$y > 30] <- NA
diamonds$z[diamonds$z == 0] <- NA
diamonds$z[diamonds$z > 30] <- NA
diamonds <- subset(diamonds, carat < 2)
qplot(x, y, data = diamonds)
qplot(x, z, data = diamonds)
Tuesday, 16 November 2010
Tuesday, 16 November 2010
Tuesday, 16 November 2010
mody <- lm(y ~ x, data = diamonds, na = na.exclude)
coef(mody)
# yhat = 0.05 + 0.99⋅x
# Plot x vs yhat
qplot(x, predict(mody), data = diamonds)
# Plot x vs (y - yhat) = residual
qplot(x, resid(mody), data = diamonds)
# Standardised residual:
qplot(x, rstandard(mody), data = diamonds)
Tuesday, 16 November 2010
qplot(x, resid(mody), data=dclean)
Tuesday, 16 November 2010
qplot(x, y - x, data=dclean)
Tuesday, 16 November 2010
Your turn
Do the same thing for z and x. What
threshold might you use to remove
outlying values?
Are the errors from predicting z and y
from x related?
Tuesday, 16 November 2010
modz <- lm(z ~ x, data = diamonds, na = na.exclude)
coef(modz)
# zhat = 0.03 + 0.61x
qplot(x, rstandard(modz), data = diamonds)
last_plot() + ylim(-10, 10)
qplot(rstandard(mody), rstandard(modz))
Tuesday, 16 November 2010
Transformations
Tuesday, 16 November 2010
Can we use a
linear model to
remove this trend?
Tuesday, 16 November 2010
Can we use a
linear model to
remove this trend?
Tuesday, 16 November 2010
Can we use a
linear model to
remove this trend?
Linear models are linear in
their parameters which can be
any transformation of the data
Tuesday, 16 November 2010
Your turn
Use a linear model to remove the effect of
carat on price. Confirm that this worked
by plotting model residuals vs. color.
How can you interpret the model
coefficients and residuals?
Tuesday, 16 November 2010
modprice <- lm(log(price) ~ log(carat),
data = diamonds, na = na.exclude)
diamonds$relprice <- exp(resid(modprice))
qplot(carat, relprice, data = diamonds)
diamonds <- subset(diamonds, carat < 2)
qplot(carat, relprice, data = diamonds)
qplot(carat, relprice, data = diamonds) +
facet_wrap(~ color)
qplot(relprice, ..density.., data = diamonds,
colour = color, geom = "freqpoly", binwidth = 0.2)
qplot(relprice, ..density.., data = diamonds,
colour = cut, geom = "freqpoly", binwidth = 0.2)
Tuesday, 16 November 2010
log(Y) = a * log(X) + b
Y = c . dX
An additive model becomes a
multiplicative model.
Intercept becomes starting point,
slope becomes geometric growth.
Multiplicative model
Tuesday, 16 November 2010
Residuals
resid(mod) = log(Y) - log(Yhat)
exp(resid(mod)) = Y / (Yhat)
Tuesday, 16 November 2010
# Useful trick - close to 0, exp(x) ~ x + 1
x <- seq(-0.2, 0.2, length = 100)
qplot(x, exp(x)) + geom_abline(intercept = 1)
qplot(x, x / exp(x)) + scale_y_continuous("Percent
error", formatter = percent)
# Not so useful here because the x is also
# transformed
coef(modprice)
Tuesday, 16 November 2010
Categorical
data
Tuesday, 16 November 2010
Compare the results of the following two
functions. What can you say about the
model?
ddply(diamonds, "color", summarise,
mean = mean(price))
coef(lm(price ~ color, data = diamonds))
Your turn
Tuesday, 16 November 2010
Categorical data
Converted into a numeric matrix, with one
column for each level. Contains 1 if that
observation has that level, 0 otherwise.
However, if we just do that naively, we end
up with too many columns (because we
have one extra column for the intercept)
So everything is relative to the first level.
Tuesday, 16 November 2010
Visualising
models
Tuesday, 16 November 2010
# What do you think this model does?
lm(log(price) ~ log(carat) + color,
data = diamonds)
# What about this one?
lm(log(price) ~ log(carat) * color,
data = diamonds)
# Or this one?
lm(log(price) ~ cut * color,
data = diamonds)
# How can we interpret the results?
Tuesday, 16 November 2010
mod1 <- lm(log(price) ~ log(carat) + cut, data = diamonds)
mod2 <- lm(log(price) ~ log(carat) * cut, data = diamonds)
# One way is to explore predictions from the model
# over an evenly spaced grid. expand.grid makes
# this easy
grid <- expand.grid(
carat = seq(0.2, 2, length = 20),
cut = levels(diamonds$cut),
KEEP.OUT.ATTRS = FALSE)
str(grid)
grid
grid$p1 <- exp(predict(mod1, grid))
grid$p2 <- exp(predict(mod2, grid))
Tuesday, 16 November 2010
Plot the predictions from the two sets of
models. How are they different?
Your turn
Tuesday, 16 November 2010
qplot(carat, p1, data = grid, colour = cut,
geom = "line")
qplot(carat, p2, data = grid, colour = cut,
geom = "line")
qplot(log(carat), log(p1), data = grid,
colour = cut, geom = "line")
qplot(log(carat), log(p2), data = grid,
colour = cut, geom = "line")
qplot(carat, p1 / p2, data = grid, colour = cut,
geom = "line")
Tuesday, 16 November 2010
# Another approach is the effects package
# install.packages("effects")
library(effects)
effect("cut", mod1)
cut <- as.data.frame(effect("cut", mod1))
qplot(fit, reorder(cut, fit), data = cut)
qplot(fit, reorder(cut, fit), data = cut) +
geom_errorbarh(aes(xmin = lower, xmax = upper),
height = 0.1)
qplot(exp(fit), reorder(cut, fit), data = cut) +
geom_errorbarh(aes(xmin = exp(lower),
xmax = exp(upper)), height = 0.1)
Tuesday, 16 November 2010

More Related Content

PPTX
09 funcións a anacos
PDF
Regras diferenciacao
PDF
Truth, deduction, computation lecture f
PPTX
BBMP1103 - Sept 2011 exam workshop - Part 2
PPT
Mat 128 11 3
PDF
Prml sec3
PDF
Gopher conbr golang e data science - oficial
PPTX
BBMP1103 - Sept 2011 exam workshop - part 7
09 funcións a anacos
Regras diferenciacao
Truth, deduction, computation lecture f
BBMP1103 - Sept 2011 exam workshop - Part 2
Mat 128 11 3
Prml sec3
Gopher conbr golang e data science - oficial
BBMP1103 - Sept 2011 exam workshop - part 7

What's hot (20)

PDF
Ch17 25
KEY
集合知プログラミングゼミ第1回
PDF
Ch14 23
PPT
Top School in Delhi NCR
PDF
Slides September 16
PDF
Formulas
PPTX
Graphing Exponentials
PPT
Admissions in india 2015
POT
PDF
Chapter 16
DOCX
Pde unit 1
PDF
Ch22 28
PDF
Ch16 11
PDF
TABLA DE DERIVADAS
PDF
8.1+ 8.2 graphing exponentials
PDF
13 Bi Trans
DOCX
Math basic2
PPTX
Alg2 lesson 7-2
Ch17 25
集合知プログラミングゼミ第1回
Ch14 23
Top School in Delhi NCR
Slides September 16
Formulas
Graphing Exponentials
Admissions in india 2015
Chapter 16
Pde unit 1
Ch22 28
Ch16 11
TABLA DE DERIVADAS
8.1+ 8.2 graphing exponentials
13 Bi Trans
Math basic2
Alg2 lesson 7-2
Ad

Viewers also liked (20)

PDF
PDF
04 Wrapup
PPT
Correlations, Trends, and Outliers in ggplot2
PDF
16 Sequences
PDF
20 date-times
PDF
03 Conditional
PDF
Model Visualisation (with ggplot2)
PDF
Graphical inference
PDF
R workshop iii -- 3 hours to learn ggplot2 series
PDF
03 Modelling
PDF
23 data-structures
PDF
R packages
PDF
02 Ddply
PDF
01 Intro
PDF
Reshaping Data in R
PPTX
Machine learning in R
PDF
4 R Tutorial DPLYR Apply Function
PDF
Data manipulation with dplyr
PDF
Data Manipulation Using R (& dplyr)
PDF
Introducing natural language processing(NLP) with r
04 Wrapup
Correlations, Trends, and Outliers in ggplot2
16 Sequences
20 date-times
03 Conditional
Model Visualisation (with ggplot2)
Graphical inference
R workshop iii -- 3 hours to learn ggplot2 series
03 Modelling
23 data-structures
R packages
02 Ddply
01 Intro
Reshaping Data in R
Machine learning in R
4 R Tutorial DPLYR Apply Function
Data manipulation with dplyr
Data Manipulation Using R (& dplyr)
Introducing natural language processing(NLP) with r
Ad

Similar to 24 modelling (20)

PDF
11 Simulation
PDF
10 simulation
PDF
10 simulation
PDF
03 extensions
PDF
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
PDF
Артём Акуляков - F# for Data Analysis
PDF
Monads and Monoids by Oleksiy Dyagilev
PDF
Capacity maximising traffic signal control policies
PDF
数学カフェ 確率・統計・機械学習回 「速習 確率・統計」
PPT
Randomized algorithms ver 1.0
PDF
SAT/SMT solving in Haskell
PDF
Dont Drive on the Railroad Tracks
PDF
110617 lt
PDF
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
PDF
FLATMAP ZAT SHIT : les monades expliquées aux geeks (Devoxx France 2013)
DOCX
Plot3D Package and Example in R.-Data visualizat,on
DOCX
Plot3D package in R-package-for-3d-and-4d-graph-Data visualization.
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
PDF
Number theoretic-rsa-chailos-new
PDF
ISI MSQE Entrance Question Paper (2008)
11 Simulation
10 simulation
10 simulation
03 extensions
QMC: Undergraduate Workshop, Introduction to Monte Carlo Methods with 'R' Sof...
Артём Акуляков - F# for Data Analysis
Monads and Monoids by Oleksiy Dyagilev
Capacity maximising traffic signal control policies
数学カフェ 確率・統計・機械学習回 「速習 確率・統計」
Randomized algorithms ver 1.0
SAT/SMT solving in Haskell
Dont Drive on the Railroad Tracks
110617 lt
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
FLATMAP ZAT SHIT : les monades expliquées aux geeks (Devoxx France 2013)
Plot3D Package and Example in R.-Data visualizat,on
Plot3D package in R-package-for-3d-and-4d-graph-Data visualization.
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
Number theoretic-rsa-chailos-new
ISI MSQE Entrance Question Paper (2008)

More from Hadley Wickham (20)

PDF
27 development
PDF
27 development
PDF
PDF
19 tables
PDF
18 cleaning
PDF
17 polishing
PDF
16 critique
PDF
15 time-space
PDF
14 case-study
PDF
13 case-study
PDF
12 adv-manip
PDF
11 adv-manip
PDF
11 adv-manip
PDF
09 bootstrapping
PDF
08 functions
PDF
07 problem-solving
PDF
PDF
05 subsetting
PDF
04 reports
PDF
02 large
27 development
27 development
19 tables
18 cleaning
17 polishing
16 critique
15 time-space
14 case-study
13 case-study
12 adv-manip
11 adv-manip
11 adv-manip
09 bootstrapping
08 functions
07 problem-solving
05 subsetting
04 reports
02 large

Recently uploaded (20)

PPTX
operations management : demand supply ch
PDF
Family Law: The Role of Communication in Mediation (www.kiu.ac.ug)
PPTX
2025 Product Deck V1.0.pptxCATALOGTCLCIA
PPTX
Principles of Marketing, Industrial, Consumers,
PDF
ANALYZING THE OPPORTUNITIES OF DIGITAL MARKETING IN BANGLADESH TO PROVIDE AN ...
PDF
Blood Collected straight from the donor into a blood bag and mixed with an an...
PDF
Module 3 - Functions of the Supervisor - Part 1 - Student Resource (1).pdf
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
PDF
Keppel_Proposed Divestment of M1 Limited
PDF
Technical Architecture - Chainsys dataZap
PDF
TyAnn Osborn: A Visionary Leader Shaping Corporate Workforce Dynamics
PDF
Module 2 - Modern Supervison Challenges - Student Resource.pdf
PDF
Deliverable file - Regulatory guideline analysis.pdf
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PPTX
Slide gioi thieu VietinBank Quy 2 - 2025
PPTX
3. HISTORICAL PERSPECTIVE UNIIT 3^..pptx
PDF
IFRS Notes in your pocket for study all the time
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PDF
Nante Industrial Plug Factory: Engineering Quality for Modern Power Applications
PPTX
TRAINNING, DEVELOPMENT AND APPRAISAL.pptx
operations management : demand supply ch
Family Law: The Role of Communication in Mediation (www.kiu.ac.ug)
2025 Product Deck V1.0.pptxCATALOGTCLCIA
Principles of Marketing, Industrial, Consumers,
ANALYZING THE OPPORTUNITIES OF DIGITAL MARKETING IN BANGLADESH TO PROVIDE AN ...
Blood Collected straight from the donor into a blood bag and mixed with an an...
Module 3 - Functions of the Supervisor - Part 1 - Student Resource (1).pdf
Ôn tập tiếng anh trong kinh doanh nâng cao
Keppel_Proposed Divestment of M1 Limited
Technical Architecture - Chainsys dataZap
TyAnn Osborn: A Visionary Leader Shaping Corporate Workforce Dynamics
Module 2 - Modern Supervison Challenges - Student Resource.pdf
Deliverable file - Regulatory guideline analysis.pdf
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
Slide gioi thieu VietinBank Quy 2 - 2025
3. HISTORICAL PERSPECTIVE UNIIT 3^..pptx
IFRS Notes in your pocket for study all the time
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
Nante Industrial Plug Factory: Engineering Quality for Modern Power Applications
TRAINNING, DEVELOPMENT AND APPRAISAL.pptx

24 modelling

  • 1. Hadley Wickham Stat405Intro to modelling Tuesday, 16 November 2010
  • 2. 1. What is a linear model? 2. Removing trends 3. Transformations 4. Categorical data 5. Visualising models Tuesday, 16 November 2010
  • 10. y ~ x # yhat = b1x + b0 # Want to find b's that minimise distance # between y and yhat z ~ x + y # zhat = b2x + b1y + b0 # Want to find b's that minimise distance # between z and zhat z ~ x * y # zhat = b3(x⋅y) + b2x + b1y + b0 Tuesday, 16 November 2010
  • 11. X is measured without error. Relationship is linear. Errors are independent. Errors have normal distribution. Errors have constant variance. Assumptions Tuesday, 16 November 2010
  • 13. library(ggplot2) diamonds$x[diamonds$x == 0] <- NA diamonds$y[diamonds$y == 0] <- NA diamonds$y[diamonds$y > 30] <- NA diamonds$z[diamonds$z == 0] <- NA diamonds$z[diamonds$z > 30] <- NA diamonds <- subset(diamonds, carat < 2) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) Tuesday, 16 November 2010
  • 16. mody <- lm(y ~ x, data = diamonds, na = na.exclude) coef(mody) # yhat = 0.05 + 0.99⋅x # Plot x vs yhat qplot(x, predict(mody), data = diamonds) # Plot x vs (y - yhat) = residual qplot(x, resid(mody), data = diamonds) # Standardised residual: qplot(x, rstandard(mody), data = diamonds) Tuesday, 16 November 2010
  • 18. qplot(x, y - x, data=dclean) Tuesday, 16 November 2010
  • 19. Your turn Do the same thing for z and x. What threshold might you use to remove outlying values? Are the errors from predicting z and y from x related? Tuesday, 16 November 2010
  • 20. modz <- lm(z ~ x, data = diamonds, na = na.exclude) coef(modz) # zhat = 0.03 + 0.61x qplot(x, rstandard(modz), data = diamonds) last_plot() + ylim(-10, 10) qplot(rstandard(mody), rstandard(modz)) Tuesday, 16 November 2010
  • 22. Can we use a linear model to remove this trend? Tuesday, 16 November 2010
  • 23. Can we use a linear model to remove this trend? Tuesday, 16 November 2010
  • 24. Can we use a linear model to remove this trend? Linear models are linear in their parameters which can be any transformation of the data Tuesday, 16 November 2010
  • 25. Your turn Use a linear model to remove the effect of carat on price. Confirm that this worked by plotting model residuals vs. color. How can you interpret the model coefficients and residuals? Tuesday, 16 November 2010
  • 26. modprice <- lm(log(price) ~ log(carat), data = diamonds, na = na.exclude) diamonds$relprice <- exp(resid(modprice)) qplot(carat, relprice, data = diamonds) diamonds <- subset(diamonds, carat < 2) qplot(carat, relprice, data = diamonds) qplot(carat, relprice, data = diamonds) + facet_wrap(~ color) qplot(relprice, ..density.., data = diamonds, colour = color, geom = "freqpoly", binwidth = 0.2) qplot(relprice, ..density.., data = diamonds, colour = cut, geom = "freqpoly", binwidth = 0.2) Tuesday, 16 November 2010
  • 27. log(Y) = a * log(X) + b Y = c . dX An additive model becomes a multiplicative model. Intercept becomes starting point, slope becomes geometric growth. Multiplicative model Tuesday, 16 November 2010
  • 28. Residuals resid(mod) = log(Y) - log(Yhat) exp(resid(mod)) = Y / (Yhat) Tuesday, 16 November 2010
  • 29. # Useful trick - close to 0, exp(x) ~ x + 1 x <- seq(-0.2, 0.2, length = 100) qplot(x, exp(x)) + geom_abline(intercept = 1) qplot(x, x / exp(x)) + scale_y_continuous("Percent error", formatter = percent) # Not so useful here because the x is also # transformed coef(modprice) Tuesday, 16 November 2010
  • 31. Compare the results of the following two functions. What can you say about the model? ddply(diamonds, "color", summarise, mean = mean(price)) coef(lm(price ~ color, data = diamonds)) Your turn Tuesday, 16 November 2010
  • 32. Categorical data Converted into a numeric matrix, with one column for each level. Contains 1 if that observation has that level, 0 otherwise. However, if we just do that naively, we end up with too many columns (because we have one extra column for the intercept) So everything is relative to the first level. Tuesday, 16 November 2010
  • 34. # What do you think this model does? lm(log(price) ~ log(carat) + color, data = diamonds) # What about this one? lm(log(price) ~ log(carat) * color, data = diamonds) # Or this one? lm(log(price) ~ cut * color, data = diamonds) # How can we interpret the results? Tuesday, 16 November 2010
  • 35. mod1 <- lm(log(price) ~ log(carat) + cut, data = diamonds) mod2 <- lm(log(price) ~ log(carat) * cut, data = diamonds) # One way is to explore predictions from the model # over an evenly spaced grid. expand.grid makes # this easy grid <- expand.grid( carat = seq(0.2, 2, length = 20), cut = levels(diamonds$cut), KEEP.OUT.ATTRS = FALSE) str(grid) grid grid$p1 <- exp(predict(mod1, grid)) grid$p2 <- exp(predict(mod2, grid)) Tuesday, 16 November 2010
  • 36. Plot the predictions from the two sets of models. How are they different? Your turn Tuesday, 16 November 2010
  • 37. qplot(carat, p1, data = grid, colour = cut, geom = "line") qplot(carat, p2, data = grid, colour = cut, geom = "line") qplot(log(carat), log(p1), data = grid, colour = cut, geom = "line") qplot(log(carat), log(p2), data = grid, colour = cut, geom = "line") qplot(carat, p1 / p2, data = grid, colour = cut, geom = "line") Tuesday, 16 November 2010
  • 38. # Another approach is the effects package # install.packages("effects") library(effects) effect("cut", mod1) cut <- as.data.frame(effect("cut", mod1)) qplot(fit, reorder(cut, fit), data = cut) qplot(fit, reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.1) qplot(exp(fit), reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = exp(lower), xmax = exp(upper)), height = 0.1) Tuesday, 16 November 2010