SlideShare a Scribd company logo
Regression Modelling
Lecture 1
Lecturer (Me)
Contact details:
Dale Roberts
E: dale.roberts@anu.edu.au
T: +61 2 612 57336
Consultation time:
Friday 14:00 - 16:00 (2 hour block)
Room 3.48
CBE Building, 26c
STAT6014 - Additional material
Contact details:
Lucy Yunxi Hu
E: yunxi.hu@anu.edu.au
T: +61 2 612 50836
Consultation time:
Friday 14:00 - 16:00 (2 hour block)
Room 3.48
CBE Building, 26c
Communication
I Please consult with your allocated tutor for course content
questions
I And/or, use the discussion forum on Wattle
I Please contact the course convenor (me) for issues and
concerns including grades, illness, falling behind, and academic
accessibility issues
Lecture times
I Wednesday, 13:00 - 15:00 (2 hour lecture)
I Friday, 11:00 - 12:00 (1 hour lecture / workshop)
Tutorials
I Begin week 2; take time in Week 1 to visit the computer lab;
check you can log on, etc.
I Tutorial sign up – see instructions on wattle and course outline
I You should read through the tutorial sheet and think and
attempt the questions before class
I Best opportunity to learn skills and techniques that will be
required in the quizzes and exams
I Your tutors are your main source for help
Textbook
I The required textbook for this course is Linear Regression by
Michael H Kutner
I This is a custom printed textbook available in print at the Harry
Hartog bookstore
I eBook is available from McGraw Hill. Use the link and discount
code on wattle to buy the ebook
I There are multiple copies of this text available in the Hancock
library for 2Hr loan
I Linear Models with R by Julian J. Faraway is another good
resource. Available in Hancock library for 2 day loans.
Course website
I http://wattle.anu.edu.au
I Access to all enrolled students
I Course announcements
I Lecture resources
I Echo360 lecture recordings
I Data sets
I Tutorial questions, selected solutions
I Online quizzes
I Please check this site frequently!
Assessment
Assessment Task Value Due Date
Online Quiz 5% Week 5
Assignment 1 15% Week 6
Assignment 2 20% Week 10
Final Examination 65% Central Exam Period
Hints for success
I Attend lectures and tutorials, supplement given materials with
your own comments and notes.
I Be prepared for classes (read the textbook, attempt tutorial
questions)
I Do the tutorials - statistics is a discipline in which hands on
participation ⇒ learning
I Time spent trying questions is well spent
R and RStudio
I We will be using the R software throughout the course
I Please see course website for installation instructions for R and
RStudio
I Please attempt Tutorial 0 - Intro to R before your first tutorial
Linear Regression
What is regression?
I Statistical methodology that utilises the relation between two or
more quantatitive variables to that a response or outcome
variable can be predicted from the other (or others)
I A core and important methodology in Statistics and Machine
Learning
What is regression?
Examples:
I Predict sales of a product using relationship between sales and
amount spent on advertising
I Predict performance of employee using relationship between
performance and aptitude test
Relations between variables
I We should distinguish between functional relation and a
statistical relation between variables
I A functional relation between two variables is expressed as a
mathematical formula. If X is the independent variable and Y
the dependent variable, a functional relation is
Y = f (X)
I A functional relation is a “perfect” mapping from X to Y
Relations between variables
20 40 60 80 100 120 140
50
150
250
Units Sold (X)
Dollar
Sales
(Y)
Y = 2X
Relations between variables
I A statistical relationship is not perfect and the observations
to not fall directly on the curve of relationship
I There is (hopefully) a function/curve that captures a general
tendency but the observations are typically scattered around
this curve
Relations between variables
60 70 80 90 100
60
70
80
90
110
Mid-year Evaluation (X)
Year-end
Evaluation
(Y)
Regression Models
History of regression
I The term regression was first used by Francis Galton in the late
19th century to explain a biological phenomenon he observed:
“regression towards the mean”
Galton’s dataset
library(HistData)
help(GaltonFamilies)
This data set lists the individual observations for 934 children in 205
families on which Galton (1886) based his cross-tabulation.
I midparentHeight: mid-parent height, calculated as (father
+ 1.08*mother)/2
I childHeight: height of child
Galton’s dataset
64 66 68 70 72 74
60
65
70
75
midparentHeight
childHeight
Basic concepts
A regression model is a formal means of expressing two essential
ingredients of a statistical relation:
I A tendency of the response variable Y to vary with the
predictor variable X in a systematic fasion
I A scattering of points around the curve of statistical relationship
These two characteristics are embodied in a regression model by
postulating that:
I There is a probability distribution of Y for each level of X
I The means of these probability distributions vary in some
systematic fashion with X
Probability distributions varying with X
60 70 80 90
50
60
70
80
90
Mid-year Evaluation (X)
Year-end
Evaluation
(Y)
Construction of Regression Models
Selection of predictor variables / covariates
I Note on terminology:
I Independent variable X, aka. predictor, regressor, covariate,
feature (ML), . . .
I Dependent variable Y , aka. response, outcome, output, . . .
I Only a limited number of covariates should be included in the
regression model
I How do you choose? Through exploratory studies, theory, etc.
Choice of functional form of regression relation
I Choice of f in the functional form Y = f (X) is tied to the
choice of covariate(s)
I Sometimes the relevant theory may indicate the appropriate
form for f
I Typically needs to be determined empirically from the data
I Linear or quadratic regression functions are often a first good
approximation
Scope of model
I We usually need to restrict the coverage of the model to some
interval or region of values
I We may not have observed the full range of possible
observations and the effect of those observations on our model
I The model may perform badly given previously unobserved data
I Training / fitting model vs. predicting given new observations
Use of regression
I Regression serves three major purposes:
I Description (How one variable influence the other)
I Control (Set standards, monitor operations, etc.)
I Prediction (Given new observations)
Regression and Causality
I Existence of a statistical relation between response Y and
covariate X does not imply in any way that Y depends causally
on X
I Funny examples
Use of computers
I Regression analysis requires lots of tedious calculations
I So we will make extensive use of R to perform these calculations
Simple Linear Regression Model
Formal statement of model
Only one covariate and a linear regression function f (x) = β0 + β1x,
giving
Yi = β0 + β1Xi + εi
where:
I Yi: response from ith trial / observation
I β0 and β1 are parameters to be determined
I Xi: observed covariate from ith trial / observation
I εi: random error term with mean zero and variance σ2
I εi and εj are uncorrelated for all i 6= j
Fitting model
I We are given or we observe n pairs of values
(Y1, X1), (Y2, X2), . . . , (Yn, Xn)
I The process that relates X to Y is a black box but we assume
it does some linear transformation and we are trying to
determine what the parameters are
I We must fit a linear model
Important features of the model
I The response Yi is a random variable as it is sum of two
components:
I the constant term β0 + β1Xi
I the random term εi
I Since E[ε] = 0, we have
E[Yi] = E[β0 + β1Xi + εi]
= β0 + β1Xi + E[εi]
= β0 + β1Xi
Important features of the model
I So the response Yi, for level Xi, has a probability distribution
with mean
E[Yi] = β0 + β1Xi
I So we know the regression function for the model is
E[Y ] = β0 + β1X
I The response Yi falls above or below the regression line based
on the random fluctuations of εi
I We have that
Var[Yi] = Var[β0 + β1Xi + εi] = Var[εi] = σ2
Important features of the model
I Error terms εi and εj are uncorrelated, this implies that so are
Yi and Xi
I Our model assumes that Yi’s come from a probability
distribution with mean β0 + β1Xi and variance σ2
Summary of model
I Linear models can be specified as: Yi = β0 + β1Xi + εi
I The assumptions are E[εi] = 0, Var[εi] = σ2
, Cor[εi, εj] = 0
I Which gives E[Yi] = β0 + β1Xi, Var[Yi] = σ2
, Cor[Yi, Yj] = 0
Regression parameters
I The parameters are called regression coefficients
I The intercept: β0
I The slope: β1
I The slope gives the change in mean of the probability
distribution of Y per unit increase in X
I The intercept, when the scope of the model includes X = 0,
gives the mean of the probability distribution at X = 0
Before fitting the model
I What is your question of interest?
I Statistical formulation of the question
I Source of the data
I Sample size
I Missing data
I Coding of data and inconsistencies
I Exploratory Data Analysis
I Scatterplots
I Summary statistics
Least squares estimation
I To find a “good” estimator of the regression parameters β0 and
β1, we employ the method of least squares
I For each observation pair (Yi, Xi), we consider the deviation of
Yi from its expected value Yi − E[Yi] given by
Yi − (β0 + β1Xi)
Least squares estimation
I The method of “least squares” considers the sum of the n
squared deviations
I The criterion is denoted by Q:
Q =
n
X
i=1
(Yi − β0 − β1Xi)2
I The estimators of β0 and β1 are the values b0 and b1 that
minimise Q given the observation pairs (Y1, X1), . . . , (Yn, Xn)
Least squares estimation (Figure 1.9)
0 10 20 30 40 50 60
0
5
10
15
Age (X)
Attempts
(Y)
Y = 2.8 + 0.18*X (Q=5.7)
Y == 9.0 + 0.*X (Q=26)
Properties of LS estimators
I Unbiased and minimum variance
E[b0] = β0, E[b1] = β1
I Estimate of
σ2
= Var[εi] = Var[Yi]
Summary
What is regression?
I Modelling of a relationship or an association between variables
of interest
I Model the outcome variable on one or more predictor variables
Linear modelling
I Our core analytical method in this course
I Can be extended to nonlinear modelling
I Linear models help us in:
I Description
I Prediction
I Control
More than just fitting a model
I Fitting a model is the easy part
I Consider appropriateness of the model
I Ensuring the assumptions are met
I Diagnostics for a model to check for validity and significance
I Remedies for violations of assumptions
I Finally, make inferences
Pitfalls in regression
I Is a linear model the right model based on theory?
I Correlation does not mean causation
I Does high ice-cream sales lead to higher homicide rates?
I Does high temperature lead to higher homicide rates?
I Reverse Causality
I e.g., GDP and unemployment
I GDP causes lower unemployment but model may check for
unemployment on GDP
Pitfalls in regression
I Omitted variable bias
I Study finds “Golfers more prone to heart disease, cancer and arthritis”
I Modelling mistake: the effect of age was omitted
I Multicollinearity
I Child’s education performance predicted by “mother’s education” and
“father’s education”
I Extrapolating beyond the data and data mining (too many
variables)

More Related Content

PPTX
Introduction to Quantitative Research Methods
PPT
Factorial Experiments
DOCX
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
PPT
SPSS statistics - get help using SPSS
PPTX
ders 5 hypothesis testing.pptx
PDF
Analyzing experimental research data
PDF
Applied statistics lecture_6
DOCX
For this assignment, use the aschooltest.sav dataset.The d
Introduction to Quantitative Research Methods
Factorial Experiments
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
SPSS statistics - get help using SPSS
ders 5 hypothesis testing.pptx
Analyzing experimental research data
Applied statistics lecture_6
For this assignment, use the aschooltest.sav dataset.The d

Similar to Lecture 1.pdf (20)

PPTX
151028_abajpai1
PPTX
Pearson Correlation
DOCX
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
PPT
Intermediate Statistics 1
PPT
Class 1 Introduction, Levels Of Measurement, Hypotheses, Variables
PDF
Designing Test Collections for Comparing Many Systems
PPTX
Inferential-Statistics.pptx werfgvqerfwef3e
PDF
Linear regression model in econometrics undergraduate
PPT
Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...
PDF
advanced_statistics.pdf
PPTX
Basic concepts of_econometrics
PPTX
Advanced Econometrics L3-4.pptx
PDF
Understanding the Chi-Square Test: A Guide for University Nursing Students
PPTX
Aminullah assagaf regresi17
PPTX
Aminullah assagaf model regresi 17+ 5 des 2021
PPTX
Aminullah assagaf model regresi 17+ 5 des 2021
PPTX
Aminullah assagaf regresi17
PDF
European conference on educational research
PDF
European conference on educational research
PDF
European conference on educational research
151028_abajpai1
Pearson Correlation
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Intermediate Statistics 1
Class 1 Introduction, Levels Of Measurement, Hypotheses, Variables
Designing Test Collections for Comparing Many Systems
Inferential-Statistics.pptx werfgvqerfwef3e
Linear regression model in econometrics undergraduate
Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...
advanced_statistics.pdf
Basic concepts of_econometrics
Advanced Econometrics L3-4.pptx
Understanding the Chi-Square Test: A Guide for University Nursing Students
Aminullah assagaf regresi17
Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf regresi17
European conference on educational research
European conference on educational research
European conference on educational research
Ad

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Global journeys: estimating international migration
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Foundation of Data Science unit number two notes
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction-to-Cloud-ComputingFinal.pptx
.pdf is not working space design for the following data for the following dat...
IB Computer Science - Internal Assessment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Acumen Training GuidePresentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Global journeys: estimating international migration
Moving the Public Sector (Government) to a Digital Adoption
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Launch Your Data Science Career in Kochi – 2025
Supervised vs unsupervised machine learning algorithms
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
Introduction to Knowledge Engineering Part 1
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Foundation of Data Science unit number two notes
Ad

Lecture 1.pdf

  • 2. Lecturer (Me) Contact details: Dale Roberts E: dale.roberts@anu.edu.au T: +61 2 612 57336 Consultation time: Friday 14:00 - 16:00 (2 hour block) Room 3.48 CBE Building, 26c
  • 3. STAT6014 - Additional material Contact details: Lucy Yunxi Hu E: yunxi.hu@anu.edu.au T: +61 2 612 50836 Consultation time: Friday 14:00 - 16:00 (2 hour block) Room 3.48 CBE Building, 26c
  • 4. Communication I Please consult with your allocated tutor for course content questions I And/or, use the discussion forum on Wattle I Please contact the course convenor (me) for issues and concerns including grades, illness, falling behind, and academic accessibility issues
  • 5. Lecture times I Wednesday, 13:00 - 15:00 (2 hour lecture) I Friday, 11:00 - 12:00 (1 hour lecture / workshop)
  • 6. Tutorials I Begin week 2; take time in Week 1 to visit the computer lab; check you can log on, etc. I Tutorial sign up – see instructions on wattle and course outline I You should read through the tutorial sheet and think and attempt the questions before class I Best opportunity to learn skills and techniques that will be required in the quizzes and exams I Your tutors are your main source for help
  • 7. Textbook I The required textbook for this course is Linear Regression by Michael H Kutner I This is a custom printed textbook available in print at the Harry Hartog bookstore I eBook is available from McGraw Hill. Use the link and discount code on wattle to buy the ebook I There are multiple copies of this text available in the Hancock library for 2Hr loan I Linear Models with R by Julian J. Faraway is another good resource. Available in Hancock library for 2 day loans.
  • 8. Course website I http://wattle.anu.edu.au I Access to all enrolled students I Course announcements I Lecture resources I Echo360 lecture recordings I Data sets I Tutorial questions, selected solutions I Online quizzes I Please check this site frequently!
  • 9. Assessment Assessment Task Value Due Date Online Quiz 5% Week 5 Assignment 1 15% Week 6 Assignment 2 20% Week 10 Final Examination 65% Central Exam Period
  • 10. Hints for success I Attend lectures and tutorials, supplement given materials with your own comments and notes. I Be prepared for classes (read the textbook, attempt tutorial questions) I Do the tutorials - statistics is a discipline in which hands on participation ⇒ learning I Time spent trying questions is well spent
  • 11. R and RStudio I We will be using the R software throughout the course I Please see course website for installation instructions for R and RStudio I Please attempt Tutorial 0 - Intro to R before your first tutorial
  • 13. What is regression? I Statistical methodology that utilises the relation between two or more quantatitive variables to that a response or outcome variable can be predicted from the other (or others) I A core and important methodology in Statistics and Machine Learning
  • 14. What is regression? Examples: I Predict sales of a product using relationship between sales and amount spent on advertising I Predict performance of employee using relationship between performance and aptitude test
  • 15. Relations between variables I We should distinguish between functional relation and a statistical relation between variables I A functional relation between two variables is expressed as a mathematical formula. If X is the independent variable and Y the dependent variable, a functional relation is Y = f (X) I A functional relation is a “perfect” mapping from X to Y
  • 16. Relations between variables 20 40 60 80 100 120 140 50 150 250 Units Sold (X) Dollar Sales (Y) Y = 2X
  • 17. Relations between variables I A statistical relationship is not perfect and the observations to not fall directly on the curve of relationship I There is (hopefully) a function/curve that captures a general tendency but the observations are typically scattered around this curve
  • 18. Relations between variables 60 70 80 90 100 60 70 80 90 110 Mid-year Evaluation (X) Year-end Evaluation (Y)
  • 20. History of regression I The term regression was first used by Francis Galton in the late 19th century to explain a biological phenomenon he observed: “regression towards the mean”
  • 21. Galton’s dataset library(HistData) help(GaltonFamilies) This data set lists the individual observations for 934 children in 205 families on which Galton (1886) based his cross-tabulation. I midparentHeight: mid-parent height, calculated as (father + 1.08*mother)/2 I childHeight: height of child
  • 22. Galton’s dataset 64 66 68 70 72 74 60 65 70 75 midparentHeight childHeight
  • 23. Basic concepts A regression model is a formal means of expressing two essential ingredients of a statistical relation: I A tendency of the response variable Y to vary with the predictor variable X in a systematic fasion I A scattering of points around the curve of statistical relationship These two characteristics are embodied in a regression model by postulating that: I There is a probability distribution of Y for each level of X I The means of these probability distributions vary in some systematic fashion with X
  • 24. Probability distributions varying with X 60 70 80 90 50 60 70 80 90 Mid-year Evaluation (X) Year-end Evaluation (Y)
  • 26. Selection of predictor variables / covariates I Note on terminology: I Independent variable X, aka. predictor, regressor, covariate, feature (ML), . . . I Dependent variable Y , aka. response, outcome, output, . . . I Only a limited number of covariates should be included in the regression model I How do you choose? Through exploratory studies, theory, etc.
  • 27. Choice of functional form of regression relation I Choice of f in the functional form Y = f (X) is tied to the choice of covariate(s) I Sometimes the relevant theory may indicate the appropriate form for f I Typically needs to be determined empirically from the data I Linear or quadratic regression functions are often a first good approximation
  • 28. Scope of model I We usually need to restrict the coverage of the model to some interval or region of values I We may not have observed the full range of possible observations and the effect of those observations on our model I The model may perform badly given previously unobserved data I Training / fitting model vs. predicting given new observations
  • 29. Use of regression I Regression serves three major purposes: I Description (How one variable influence the other) I Control (Set standards, monitor operations, etc.) I Prediction (Given new observations)
  • 30. Regression and Causality I Existence of a statistical relation between response Y and covariate X does not imply in any way that Y depends causally on X I Funny examples
  • 31. Use of computers I Regression analysis requires lots of tedious calculations I So we will make extensive use of R to perform these calculations
  • 33. Formal statement of model Only one covariate and a linear regression function f (x) = β0 + β1x, giving Yi = β0 + β1Xi + εi where: I Yi: response from ith trial / observation I β0 and β1 are parameters to be determined I Xi: observed covariate from ith trial / observation I εi: random error term with mean zero and variance σ2 I εi and εj are uncorrelated for all i 6= j
  • 34. Fitting model I We are given or we observe n pairs of values (Y1, X1), (Y2, X2), . . . , (Yn, Xn) I The process that relates X to Y is a black box but we assume it does some linear transformation and we are trying to determine what the parameters are I We must fit a linear model
  • 35. Important features of the model I The response Yi is a random variable as it is sum of two components: I the constant term β0 + β1Xi I the random term εi I Since E[ε] = 0, we have E[Yi] = E[β0 + β1Xi + εi] = β0 + β1Xi + E[εi] = β0 + β1Xi
  • 36. Important features of the model I So the response Yi, for level Xi, has a probability distribution with mean E[Yi] = β0 + β1Xi I So we know the regression function for the model is E[Y ] = β0 + β1X I The response Yi falls above or below the regression line based on the random fluctuations of εi I We have that Var[Yi] = Var[β0 + β1Xi + εi] = Var[εi] = σ2
  • 37. Important features of the model I Error terms εi and εj are uncorrelated, this implies that so are Yi and Xi I Our model assumes that Yi’s come from a probability distribution with mean β0 + β1Xi and variance σ2
  • 38. Summary of model I Linear models can be specified as: Yi = β0 + β1Xi + εi I The assumptions are E[εi] = 0, Var[εi] = σ2 , Cor[εi, εj] = 0 I Which gives E[Yi] = β0 + β1Xi, Var[Yi] = σ2 , Cor[Yi, Yj] = 0
  • 39. Regression parameters I The parameters are called regression coefficients I The intercept: β0 I The slope: β1 I The slope gives the change in mean of the probability distribution of Y per unit increase in X I The intercept, when the scope of the model includes X = 0, gives the mean of the probability distribution at X = 0
  • 40. Before fitting the model I What is your question of interest? I Statistical formulation of the question I Source of the data I Sample size I Missing data I Coding of data and inconsistencies I Exploratory Data Analysis I Scatterplots I Summary statistics
  • 41. Least squares estimation I To find a “good” estimator of the regression parameters β0 and β1, we employ the method of least squares I For each observation pair (Yi, Xi), we consider the deviation of Yi from its expected value Yi − E[Yi] given by Yi − (β0 + β1Xi)
  • 42. Least squares estimation I The method of “least squares” considers the sum of the n squared deviations I The criterion is denoted by Q: Q = n X i=1 (Yi − β0 − β1Xi)2 I The estimators of β0 and β1 are the values b0 and b1 that minimise Q given the observation pairs (Y1, X1), . . . , (Yn, Xn)
  • 43. Least squares estimation (Figure 1.9) 0 10 20 30 40 50 60 0 5 10 15 Age (X) Attempts (Y) Y = 2.8 + 0.18*X (Q=5.7) Y == 9.0 + 0.*X (Q=26)
  • 44. Properties of LS estimators I Unbiased and minimum variance E[b0] = β0, E[b1] = β1 I Estimate of σ2 = Var[εi] = Var[Yi]
  • 46. What is regression? I Modelling of a relationship or an association between variables of interest I Model the outcome variable on one or more predictor variables
  • 47. Linear modelling I Our core analytical method in this course I Can be extended to nonlinear modelling I Linear models help us in: I Description I Prediction I Control
  • 48. More than just fitting a model I Fitting a model is the easy part I Consider appropriateness of the model I Ensuring the assumptions are met I Diagnostics for a model to check for validity and significance I Remedies for violations of assumptions I Finally, make inferences
  • 49. Pitfalls in regression I Is a linear model the right model based on theory? I Correlation does not mean causation I Does high ice-cream sales lead to higher homicide rates? I Does high temperature lead to higher homicide rates? I Reverse Causality I e.g., GDP and unemployment I GDP causes lower unemployment but model may check for unemployment on GDP
  • 50. Pitfalls in regression I Omitted variable bias I Study finds “Golfers more prone to heart disease, cancer and arthritis” I Modelling mistake: the effect of age was omitted I Multicollinearity I Child’s education performance predicted by “mother’s education” and “father’s education” I Extrapolating beyond the data and data mining (too many variables)