Lecture 1.pdf

Regression Modelling
Lecture 1

Lecturer (Me)
Contact details:
Dale Roberts
E: dale.roberts@anu.edu.au
T: +61 2 612 57336
Consultation time:
Friday 14:00 - 16:00 (2 hour block)
Room 3.48
CBE Building, 26c

STAT6014 - Additional material
Contact details:
Lucy Yunxi Hu
E: yunxi.hu@anu.edu.au
T: +61 2 612 50836
Consultation time:
Friday 14:00 - 16:00 (2 hour block)
Room 3.48
CBE Building, 26c

Communication
I Please consult with your allocated tutor for course content
questions
I And/or, use the discussion forum on Wattle
I Please contact the course convenor (me) for issues and
concerns including grades, illness, falling behind, and academic
accessibility issues

Lecture times
I Wednesday, 13:00 - 15:00 (2 hour lecture)
I Friday, 11:00 - 12:00 (1 hour lecture / workshop)

Tutorials
I Begin week 2; take time in Week 1 to visit the computer lab;
check you can log on, etc.
I Tutorial sign up – see instructions on wattle and course outline
I You should read through the tutorial sheet and think and
attempt the questions before class
I Best opportunity to learn skills and techniques that will be
required in the quizzes and exams
I Your tutors are your main source for help

Textbook
I The required textbook for this course is Linear Regression by
Michael H Kutner
I This is a custom printed textbook available in print at the Harry
Hartog bookstore
I eBook is available from McGraw Hill. Use the link and discount
code on wattle to buy the ebook
I There are multiple copies of this text available in the Hancock
library for 2Hr loan
I Linear Models with R by Julian J. Faraway is another good
resource. Available in Hancock library for 2 day loans.

Course website
I http://wattle.anu.edu.au
I Access to all enrolled students
I Course announcements
I Lecture resources
I Echo360 lecture recordings
I Data sets
I Tutorial questions, selected solutions
I Online quizzes
I Please check this site frequently!

Assessment
Assessment Task Value Due Date
Online Quiz 5% Week 5
Assignment 1 15% Week 6
Assignment 2 20% Week 10
Final Examination 65% Central Exam Period

Hints for success
I Attend lectures and tutorials, supplement given materials with
your own comments and notes.
I Be prepared for classes (read the textbook, attempt tutorial
questions)
I Do the tutorials - statistics is a discipline in which hands on
participation ⇒ learning
I Time spent trying questions is well spent

R and RStudio
I We will be using the R software throughout the course
I Please see course website for installation instructions for R and
RStudio
I Please attempt Tutorial 0 - Intro to R before your first tutorial

What is regression?
I Statistical methodology that utilises the relation between two or
more quantatitive variables to that a response or outcome
variable can be predicted from the other (or others)
I A core and important methodology in Statistics and Machine
Learning

What is regression?
Examples:
I Predict sales of a product using relationship between sales and
amount spent on advertising
I Predict performance of employee using relationship between
performance and aptitude test

Relations between variables
I We should distinguish between functional relation and a
statistical relation between variables
I A functional relation between two variables is expressed as a
mathematical formula. If X is the independent variable and Y
the dependent variable, a functional relation is
Y = f (X)
I A functional relation is a “perfect” mapping from X to Y

20 40 60 80 100 120 140
50
150
250
Units Sold (X)
Dollar
Sales
(Y)
Y = 2X

I A statistical relationship is not perfect and the observations
to not fall directly on the curve of relationship
I There is (hopefully) a function/curve that captures a general
tendency but the observations are typically scattered around
this curve

60 70 80 90 100
60
70
80
90
110
Mid-year Evaluation (X)
Year-end
Evaluation
(Y)

History of regression
I The term regression was first used by Francis Galton in the late
19th century to explain a biological phenomenon he observed:
“regression towards the mean”

Galton’s dataset
library(HistData)
help(GaltonFamilies)
This data set lists the individual observations for 934 children in 205
families on which Galton (1886) based his cross-tabulation.
I midparentHeight: mid-parent height, calculated as (father
+ 1.08*mother)/2
I childHeight: height of child

Galton’s dataset
64 66 68 70 72 74
60
65
70
75
midparentHeight
childHeight

Basic concepts
A regression model is a formal means of expressing two essential
ingredients of a statistical relation:
I A tendency of the response variable Y to vary with the
predictor variable X in a systematic fasion
I A scattering of points around the curve of statistical relationship
These two characteristics are embodied in a regression model by
postulating that:
I There is a probability distribution of Y for each level of X
I The means of these probability distributions vary in some
systematic fashion with X

Probability distributions varying with X
60 70 80 90
50
60
70
80
90
Mid-year Evaluation (X)
Year-end
Evaluation
(Y)

Construction of Regression Models

Selection of predictor variables / covariates
I Note on terminology:
I Independent variable X, aka. predictor, regressor, covariate,
feature (ML), . . .
I Dependent variable Y , aka. response, outcome, output, . . .
I Only a limited number of covariates should be included in the
regression model
I How do you choose? Through exploratory studies, theory, etc.

Choice of functional form of regression relation
I Choice of f in the functional form Y = f (X) is tied to the
choice of covariate(s)
I Sometimes the relevant theory may indicate the appropriate
form for f
I Typically needs to be determined empirically from the data
I Linear or quadratic regression functions are often a first good
approximation

Scope of model
I We usually need to restrict the coverage of the model to some
interval or region of values
I We may not have observed the full range of possible
observations and the effect of those observations on our model
I The model may perform badly given previously unobserved data
I Training / fitting model vs. predicting given new observations

Use of regression
I Regression serves three major purposes:
I Description (How one variable influence the other)
I Control (Set standards, monitor operations, etc.)
I Prediction (Given new observations)

Regression and Causality
I Existence of a statistical relation between response Y and
covariate X does not imply in any way that Y depends causally
on X
I Funny examples

Use of computers
I Regression analysis requires lots of tedious calculations
I So we will make extensive use of R to perform these calculations

Simple Linear Regression Model

Formal statement of model
Only one covariate and a linear regression function f (x) = β0 + β1x,
giving
Yi = β0 + β1Xi + εi
where:
I Yi: response from ith trial / observation
I β0 and β1 are parameters to be determined
I Xi: observed covariate from ith trial / observation
I εi: random error term with mean zero and variance σ2
I εi and εj are uncorrelated for all i 6= j

Fitting model
I We are given or we observe n pairs of values
(Y1, X1), (Y2, X2), . . . , (Yn, Xn)
I The process that relates X to Y is a black box but we assume
it does some linear transformation and we are trying to
determine what the parameters are
I We must fit a linear model

Important features of the model
I The response Yi is a random variable as it is sum of two
components:
I the constant term β0 + β1Xi
I the random term εi
I Since E[ε] = 0, we have
E[Yi] = E[β0 + β1Xi + εi]
= β0 + β1Xi + E[εi]
= β0 + β1Xi

I So the response Yi, for level Xi, has a probability distribution
with mean
E[Yi] = β0 + β1Xi
I So we know the regression function for the model is
E[Y ] = β0 + β1X
I The response Yi falls above or below the regression line based
on the random fluctuations of εi
I We have that
Var[Yi] = Var[β0 + β1Xi + εi] = Var[εi] = σ2

I Error terms εi and εj are uncorrelated, this implies that so are
Yi and Xi
I Our model assumes that Yi’s come from a probability
distribution with mean β0 + β1Xi and variance σ2

Summary of model
I Linear models can be specified as: Yi = β0 + β1Xi + εi
I The assumptions are E[εi] = 0, Var[εi] = σ2
, Cor[εi, εj] = 0
I Which gives E[Yi] = β0 + β1Xi, Var[Yi] = σ2
, Cor[Yi, Yj] = 0

Regression parameters
I The parameters are called regression coefficients
I The intercept: β0
I The slope: β1
I The slope gives the change in mean of the probability
distribution of Y per unit increase in X
I The intercept, when the scope of the model includes X = 0,
gives the mean of the probability distribution at X = 0

Before fitting the model
I What is your question of interest?
I Statistical formulation of the question
I Source of the data
I Sample size
I Missing data
I Coding of data and inconsistencies
I Exploratory Data Analysis
I Scatterplots
I Summary statistics

Least squares estimation
I To find a “good” estimator of the regression parameters β0 and
β1, we employ the method of least squares
I For each observation pair (Yi, Xi), we consider the deviation of
Yi from its expected value Yi − E[Yi] given by
Yi − (β0 + β1Xi)

Least squares estimation
I The method of “least squares” considers the sum of the n
squared deviations
I The criterion is denoted by Q:
Q =
n
X
i=1
(Yi − β0 − β1Xi)2
I The estimators of β0 and β1 are the values b0 and b1 that
minimise Q given the observation pairs (Y1, X1), . . . , (Yn, Xn)

Least squares estimation (Figure 1.9)
0 10 20 30 40 50 60
0
5
10
15
Age (X)
Attempts
(Y)
Y = 2.8 + 0.18*X (Q=5.7)
Y == 9.0 + 0.*X (Q=26)

Properties of LS estimators
I Unbiased and minimum variance
E[b0] = β0, E[b1] = β1
I Estimate of
σ2
= Var[εi] = Var[Yi]

What is regression?
I Modelling of a relationship or an association between variables
of interest
I Model the outcome variable on one or more predictor variables

Linear modelling
I Our core analytical method in this course
I Can be extended to nonlinear modelling
I Linear models help us in:
I Description
I Prediction
I Control

More than just fitting a model
I Fitting a model is the easy part
I Consider appropriateness of the model
I Ensuring the assumptions are met
I Diagnostics for a model to check for validity and significance
I Remedies for violations of assumptions
I Finally, make inferences

Pitfalls in regression
I Is a linear model the right model based on theory?
I Correlation does not mean causation
I Does high ice-cream sales lead to higher homicide rates?
I Does high temperature lead to higher homicide rates?
I Reverse Causality
I e.g., GDP and unemployment
I GDP causes lower unemployment but model may check for
unemployment on GDP

Pitfalls in regression
I Omitted variable bias
I Study finds “Golfers more prone to heart disease, cancer and arthritis”
I Modelling mistake: the effect of age was omitted
I Multicollinearity
I Child’s education performance predicted by “mother’s education” and
“father’s education”
I Extrapolating beyond the data and data mining (too many
variables)

Lecture 1.pdf

More Related Content

Similar to Lecture 1.pdf (20)

Recently uploaded (20)

Lecture 1.pdf