SlideShare a Scribd company logo
Regression models 1
Choosing regression models
An elementary introduction
Stephen Senn
Explanation
• I am not presenting these things because I
think you don’t know them
• I am presenting them because the people
you work with don’t know them
• And you need to explain these things to
them
Regression models 2
Outline
• Basic considerations in modelling
• Choosing predictors
• Transformation of the predictor(s)
• Transformation of the outcome
• Advice
Regression models 3
Basic considerations
Thinking before you model
Regression models 4
Regression models 5
Some Modelling Tasks
• Choose a generally suitable probability model
• Choose a set of suitable predictors
• Consider whether these need to be transformed
• Consider whether the outcome needs to be transformed
• Choose a technique for fitting the model
• Fit the model
• Assess goodness of fit of model
• Make causal inferences
• Issue predictions
Regression models 6
Factors Affecting Choice of
Model
• Purpose of model
– Causal, predictive, classification
• Design of study
– Designed experiment, observational study, survey,
• Temporal sequence
• Prior knowledge
• Type of data
– Continuous measurements, binary, ordinal, counts, censored life-
times
• Case ascertainment
• Results of model fitting
Preliminaries
• Choosing good regression models is not a question
of throwing some data at a stepwise selection
algorithm
• Two things are important
– Being clear about the purpose
– Insight (which in turn is based on)
• Experience
• Understanding
• Logic
Regression models 7
Two Extremes
Causal analysis
• The putative causal factor(s) must
be in the model
• Other factors are in the model
because they help us understand
the causal factor(s)
• They are of no interest in
themselves
• We pay particular attention to the
significance of the putative causal
factor(s)
Predictive modelling
• We are trying to find predictors of
some outcome
• It is their joint value as predictors
that is important
• We simply want the most
predictive model
• We compare entire models to
judge which is best
Regression models 8
Example
• Modelling the effect of treatment in a clinical trial
• Treatment must be in any model whether or not it
is significant
• Other factors will be in the model to help me
improve my estimate of the effect of treatment
– They are of little interest in themselves
– They are nearly always predetermined
Regression models 9
Does Smoking Cause Lung Cancer?
A Tale of Two Statisticians
Works in public health
• I wish to establish whether it
is causal
• If so I can warn smokers to
quit and this will benefit
their health
• It is important for me to rule
out possible confounding
factors
Works in life insurance
• I don’t care if it is causal or
not
• The data show that smokers
are much more likely to get
lung cancer
• That’s enough for me to
take account of it in setting
the premiums
Regression models 10
Warning
• Regression models are there to help you use
your insight, experience and prior
knowledge to understand your datasets
• They are not a substitute for scientific
understanding
Regression models 11
Choosing predictors
It’s not just a matter of significance
Regression models 12
Regression models 13
An Example
• Multicentre trial of asthma comparing formoterol, salbutamol and
placebo for their effects on forced expiratory volume in one second
(FEV1).
• Randomisation stratified by steroid use (yes/no) and centre
• Sex, age, height of patient and baseline FEV1 also measured
• Definitely in the model
– Blocking factors: centre & steroid use
– Treatment factor (3 levels: formoterol, salbutamol, placebo)
• Possibly in the model
– Covariates: sex, age, height of patient and baseline FEV1
– NB sex, age, height are very predictive of baseline FEV1 also therefore if
you put them all in the model none may be significant
– This does not matter
Regression models 14
Temporal Sequence I
• If we are interested in causal inferences it is
usually inappropriate to include variables
that were measured later in a model than
putative causal variables that were
measured earlier.
• The later variables cannot have caused the
earlier variables and so should not be
included.
Regression models 15
Example
• It is desired to study whether the type of school attended
(private or state school) affects students’ chances of
success in final degree examinations at university
• Data are obtained for a large group of students
• In addition to information on degree results and type of
school attended, information is obtained on
– sex of student,
– high school results
– parents’ income
• Which of these factors is it inappropriate to include in the
model and why?
Regression models 16
Temporal Sequence II
• The same does not apply if the purpose of
the model is simply classification
• It may then be helpful to have factors in the
model even if they are measured after the
“outcome variable of interest”
• Indeed they can be included even if they
have been “caused” by the variable of
interest
Regression models 17
Example
• We wish to develop a model for classifying
patients who present with abdominal pain as
either suffering from appendicitis or non-
specific abdominal pain
• We use location of pain, degree of pain,
absence/presence of nausea, body
temperature as “predictor” variable
– Even though these are consequences of rather
than causes of appendicitis
Regression models 18
Prior Knowledge
• Frequently when fitting models we already have strong opinions about
the effect of some factors even if we are ignorant about others.
– For example we may be examining the effect of a previously
unstudied environmental exposure on health
– we know, however, that age is an important determinant of health
• We will tend to put factors we believe are important in the model
irrespective of their significance according to the current data set.
• Similarly, implicitly, there will always be a host of factors we believe
are irrelevant.
• We will not put these in the model on prior grounds
19
Type of Data
and Choice of Basic Model
Type of Data
• Continuous
measurement
• Count data
• Binary data
• Ordered categorical
• Censored lifetimes
• Multinomial
Possible Basic Model
• General linear model
(Normal outcomes)
• Poisson regression
• Logistic regression
• Proportional odds
• Proportional hazards
• Log-linear
Regression models
Regression models 20
Case Ascertainment
• The way in which data are obtained (ascertained) can
affect the way that we build a model
• For example in a case-control study we sample by outcome
(cases and controls) and then measure how these two differ
by exposure
– Example
• Case: lung cancer, Control: other cancer
• Exposure: smoker versus non-smoker
• We cannot model relative risk using such data
• We can only model (log) odds ratios
• For a cohort study where we sample by exposure we could
model either
Regression models 21
Social Status: Longer life expectancy for Oscar winners
A study of actors and actresses found that Oscar winners lived, on average,
almost four years longer than nominees who went home empty-handed, reports
the March issue of the Harvard Health Letter. Actors aren’t the only people who
reap benefits. Dr. Donald Redelmeier of Toronto’s Sunnybrook and Women’s
College Health Sciences Centre found that Oscar-winning directors live longer
than non-winners, and male directors live 4.5 years longer on average than
actors. These findings add to a large body of evidence delineating connections
between social status and health and longevity, reports the Harvard Health
Letter. Redelmeier theorizes that an Oscar on the mantel moves the winner up
the Hollywood pecking order. Winners find it easier to get work, and when they
do, they’re better appreciated and better paid.
Not Harvard Health Publications
Regression models 22
A study has shown that getting a telegram from The Queen can add 20 years to
your life
An extensive study of individuals who have received telegrams from The
Queen has shown that an astonishing proportion of them have lived to be 100.
Age at death of a control group of non-recipients was typically 20 years less.
Researchers have postulated that esteem is an important determinant of health
Joked lead researcher, Prof Morton Gullible, ‘our advice to her Majesty is send
yourself a telegram, Ma'am’
Regression models 23
Results of Model Fitting
• Statisticians have developed a number of
techniques for assessing the adequacy of various
models using the data in hand
– Standard errors, significance tests on coefficients
– Analysis of variance/ deviance on factors
– Goodness of fit generally
– Residual plots
– AIC, BIC
• These are important tools but are by no means the
only tools for assessing the adequacy of a model
Transforming predictors
The X Files
Regression models 24
Luxembourg Temperature Example
Regression models 25
Data on temperatures in Luxembourg
Month Normal temperatures deg C
January 0.6
February 1.4
March 4.7
April 7.7
May 12.4
June 15.1
July 17.5
August 17.3
September 13.5
October 8.9
November 4.0
December 1.8
Modelling the temperature
Regression models 26
Note that in the yearly rhythm, January follows December even
though January is point 1 and December point 12.
The data are periodic and we need a model that reflects this.
The simplest periodic pattern is a sine wave.
𝑌 = 𝛼 + βsin 𝑋 + 𝜃
 = level (the average temperature)
 = amplitude (the difference max to average)
 = phase (governs point at which maximum is reached)
Fitting a sine wave
Regression models 27
A sine wave model can be fitted by using the fact that
sin 𝑋 + 𝜃 = cos 𝜃 sin 𝑋 + sin 𝜃 cos 𝑋
This is linear in sin 𝑋, cos 𝑋. Hence by regressing Y on two
variables sin 𝑋, cos 𝑋 we can obtain a periodic fit.
Note that X must be transformed from linear to angular
measure. So we can write
𝑋 = 360 × 𝑚𝑜𝑛𝑡ℎ 12
if we measure in degrees or
𝑋 = 2Π × 𝑚𝑜𝑛𝑡ℎ 12
radians
Regression models 28
3 parameters
fit 12 points
rather well
Transforming the outcome
Being wise about Ys
Regression models 29
Regression models 30
An Example of a One-way Layout
• Four experimental p38 kinase inhibitors
• Vehicle and marketed product as controls
• Thrombaxane B2 (TXB2) is used as a
marker of COX-1 activity
• Six rats per group were treated for a total of
36 rats
• At the end of the study rats are sacrificed
and TXB2 is measured.
Regression models 31
Regression models 32
GenStat® ANOVA
(Original data)
Analysis of variance
Variate: TXB2 𝜎2
2
Source of variation d.f. s.s. m.s. v.r. F pr.
Treatment 5 184596. 36919. 6.31 <.001
Residual 30 175439. 5848.
Total 35 360035.
𝜎1
2
𝜎2
2
𝜎1
2
A2WAY [TREATMENTS=Treatment] TXB2
Regression models 33
GenStat plot of
residuals
Regression models 34
Regression models 35
GenStat ANOVA
(log transformed)
A2WAY [TREATMENTS=Treatment] logTXB2
Analysis of variance
Variate: logTXB2
Source of variation d.f. s.s. m.s. v.r.
Treatment 5 62.6760 12.5352 40.09
Residual 30 9.3800 0.3127
Total 35 72.0559
Signal to noise ratio is
now much higher
Regression models 36
GenStat plot
of residuals
Regression models 37
Homogeneity of Variances
(Bartlett’ Test: GenStat)
Untransformed
*** Bartlett's Test for homogeneity of variances ***
Chi-square 50.87 on 5 degrees of freedom: probability <
0.001
Log-transformed
*** Bartlett's Test for homogeneity of variances ***
Chi-square 8.95 on 5 degrees of freedom: probability 0.111
Data-filtering examples
or find the flaw
• A 20 year follow-up study of women in an English village
found higher survival amongst smokers than non-smokers
• Transplant receivers on highest doses of cyclosporine had
higher probability of graft rejection than on lower doses
• Left-handers observed to die younger on average than
right-handers
• Obese infarct survivors have better prognosis than non-
obese
Regression models 38
Advice
Statistics is a way of improving your thinking, not a substitute for it
Regression models 39
Advice
• Think before you model
• Purpose is key
– Causal
– Predictive
– Classification
• Think about time
• Think about case ascertainment
• Testing is a small part of discerning
• Don’t use stepwise regression as a substitute for
understanding
Regression models 40

More Related Content

PDF
Measurement error in medical research
PDF
Development and evaluation of prediction models: pitfalls and solutions
PDF
Regression shrinkage: better answers to causal questions
PDF
Introduction to prediction modelling - Berlin 2018 - Part II
PPTX
Real world modified
PDF
Clinical prediction models: development, validation and beyond
PDF
Correcting for missing data, measurement error and confounding
PDF
Introduction to prediction modelling - Berlin 2018 - Part I
Measurement error in medical research
Development and evaluation of prediction models: pitfalls and solutions
Regression shrinkage: better answers to causal questions
Introduction to prediction modelling - Berlin 2018 - Part II
Real world modified
Clinical prediction models: development, validation and beyond
Correcting for missing data, measurement error and confounding
Introduction to prediction modelling - Berlin 2018 - Part I

What's hot (20)

PDF
Five questions about artificial intelligence
PPT
Quantitative data 2
PPTX
Predictive analytics
PPTX
Amrita kumari
PDF
The basics of prediction modeling
PDF
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
PPT
meta analysis
PPTX
10. triangulation
PDF
Big Data Analytics for Healthcare
PPTX
Minimally important differences v2
PPTX
systematic review and metaanalysis
PPTX
Meta analysis
PDF
Rage against the machine learning 2023
PPTX
Metaanalysis copy
PPTX
Data analytics
PDF
Predictimands
PDF
Machine Learning in Healthcare
PPTX
Research design (1)
PDF
2. ph250b.14 measures of association 1
 
PDF
Introduction & rationale for meta-analysis
Five questions about artificial intelligence
Quantitative data 2
Predictive analytics
Amrita kumari
The basics of prediction modeling
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
meta analysis
10. triangulation
Big Data Analytics for Healthcare
Minimally important differences v2
systematic review and metaanalysis
Meta analysis
Rage against the machine learning 2023
Metaanalysis copy
Data analytics
Predictimands
Machine Learning in Healthcare
Research design (1)
2. ph250b.14 measures of association 1
 
Introduction & rationale for meta-analysis
Ad

Similar to Choosing Regression Models (20)

PPTX
Types of regression ii
PPT
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
PPT
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
PPTX
Predictive analytics and Type of Predictive Analytics
PPTX
7.revised regression model basics
PPT
Quantitative_analysis.ppt
PPT
Statistics
PDF
R Regression Models with Zelig
PDF
Common statistical pitfalls & errors in biomedical research (a top-5 list)
PDF
Causal Inference in Data Science and Machine Learning
PPTX
Seminar 10 BIOSTATISTICS
DOCX
Exercise 29Calculating Simple Linear RegressionSimple linear reg.docx
PPTX
simple-linear-regression (1).pptx
PPT
Areas In Statistics
PDF
Binary OR Binomial logistic regression
PPT
Econometric model ing
PPTX
Basic statistical &amp; pharmaceutical statistical applications
PPTX
To infinity and beyond v2
PPTX
R You Ready? An I/O Psychologist's Guide to R and Rstudio: Part 3
PPTX
Revised understanding predictive models limit to growth model
Types of regression ii
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterpri...
Predictive analytics and Type of Predictive Analytics
7.revised regression model basics
Quantitative_analysis.ppt
Statistics
R Regression Models with Zelig
Common statistical pitfalls & errors in biomedical research (a top-5 list)
Causal Inference in Data Science and Machine Learning
Seminar 10 BIOSTATISTICS
Exercise 29Calculating Simple Linear RegressionSimple linear reg.docx
simple-linear-regression (1).pptx
Areas In Statistics
Binary OR Binomial logistic regression
Econometric model ing
Basic statistical &amp; pharmaceutical statistical applications
To infinity and beyond v2
R You Ready? An I/O Psychologist's Guide to R and Rstudio: Part 3
Revised understanding predictive models limit to growth model
Ad

More from Stephen Senn (20)

PPTX
Has modelling killed randomisation inference frankfurt
PPTX
What is your question
PPTX
Vaccine trials in the age of COVID-19
PPTX
Approximate ANCOVA
PPTX
The Seven Habits of Highly Effective Statisticians
PPTX
Clinical trials: quo vadis in the age of covid?
PPT
A century of t tests
PPT
Is ignorance bliss
PPTX
What should we expect from reproducibiliry
PPTX
Personalised medicine a sceptical view
PPTX
In search of the lost loss function
PPTX
To infinity and beyond
PPTX
De Finetti meets Popper
PPTX
Understanding randomisation
PPTX
In Search of Lost Infinities: What is the “n” in big data?
PPTX
NNTs, responder analysis & overlap measures
PPTX
Seventy years of RCTs
PDF
The Rothamsted school meets Lord's paradox
PPTX
The revenge of RA Fisher
PPT
The story of MTA/02
Has modelling killed randomisation inference frankfurt
What is your question
Vaccine trials in the age of COVID-19
Approximate ANCOVA
The Seven Habits of Highly Effective Statisticians
Clinical trials: quo vadis in the age of covid?
A century of t tests
Is ignorance bliss
What should we expect from reproducibiliry
Personalised medicine a sceptical view
In search of the lost loss function
To infinity and beyond
De Finetti meets Popper
Understanding randomisation
In Search of Lost Infinities: What is the “n” in big data?
NNTs, responder analysis & overlap measures
Seventy years of RCTs
The Rothamsted school meets Lord's paradox
The revenge of RA Fisher
The story of MTA/02

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PDF
Mega Projects Data Mega Projects Data
PPT
Quality review (1)_presentation of this 21
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
[EN] Industrial Machine Downtime Prediction
Mega Projects Data Mega Projects Data
Quality review (1)_presentation of this 21
ISS -ESG Data flows What is ESG and HowHow
Reliability_Chapter_ presentation 1221.5784
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IB Computer Science - Internal Assessment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Analytics and business intelligence.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Qualitative Qantitative and Mixed Methods.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Supervised vs unsupervised machine learning algorithms

Choosing Regression Models

  • 1. Regression models 1 Choosing regression models An elementary introduction Stephen Senn
  • 2. Explanation • I am not presenting these things because I think you don’t know them • I am presenting them because the people you work with don’t know them • And you need to explain these things to them Regression models 2
  • 3. Outline • Basic considerations in modelling • Choosing predictors • Transformation of the predictor(s) • Transformation of the outcome • Advice Regression models 3
  • 4. Basic considerations Thinking before you model Regression models 4
  • 5. Regression models 5 Some Modelling Tasks • Choose a generally suitable probability model • Choose a set of suitable predictors • Consider whether these need to be transformed • Consider whether the outcome needs to be transformed • Choose a technique for fitting the model • Fit the model • Assess goodness of fit of model • Make causal inferences • Issue predictions
  • 6. Regression models 6 Factors Affecting Choice of Model • Purpose of model – Causal, predictive, classification • Design of study – Designed experiment, observational study, survey, • Temporal sequence • Prior knowledge • Type of data – Continuous measurements, binary, ordinal, counts, censored life- times • Case ascertainment • Results of model fitting
  • 7. Preliminaries • Choosing good regression models is not a question of throwing some data at a stepwise selection algorithm • Two things are important – Being clear about the purpose – Insight (which in turn is based on) • Experience • Understanding • Logic Regression models 7
  • 8. Two Extremes Causal analysis • The putative causal factor(s) must be in the model • Other factors are in the model because they help us understand the causal factor(s) • They are of no interest in themselves • We pay particular attention to the significance of the putative causal factor(s) Predictive modelling • We are trying to find predictors of some outcome • It is their joint value as predictors that is important • We simply want the most predictive model • We compare entire models to judge which is best Regression models 8
  • 9. Example • Modelling the effect of treatment in a clinical trial • Treatment must be in any model whether or not it is significant • Other factors will be in the model to help me improve my estimate of the effect of treatment – They are of little interest in themselves – They are nearly always predetermined Regression models 9
  • 10. Does Smoking Cause Lung Cancer? A Tale of Two Statisticians Works in public health • I wish to establish whether it is causal • If so I can warn smokers to quit and this will benefit their health • It is important for me to rule out possible confounding factors Works in life insurance • I don’t care if it is causal or not • The data show that smokers are much more likely to get lung cancer • That’s enough for me to take account of it in setting the premiums Regression models 10
  • 11. Warning • Regression models are there to help you use your insight, experience and prior knowledge to understand your datasets • They are not a substitute for scientific understanding Regression models 11
  • 12. Choosing predictors It’s not just a matter of significance Regression models 12
  • 13. Regression models 13 An Example • Multicentre trial of asthma comparing formoterol, salbutamol and placebo for their effects on forced expiratory volume in one second (FEV1). • Randomisation stratified by steroid use (yes/no) and centre • Sex, age, height of patient and baseline FEV1 also measured • Definitely in the model – Blocking factors: centre & steroid use – Treatment factor (3 levels: formoterol, salbutamol, placebo) • Possibly in the model – Covariates: sex, age, height of patient and baseline FEV1 – NB sex, age, height are very predictive of baseline FEV1 also therefore if you put them all in the model none may be significant – This does not matter
  • 14. Regression models 14 Temporal Sequence I • If we are interested in causal inferences it is usually inappropriate to include variables that were measured later in a model than putative causal variables that were measured earlier. • The later variables cannot have caused the earlier variables and so should not be included.
  • 15. Regression models 15 Example • It is desired to study whether the type of school attended (private or state school) affects students’ chances of success in final degree examinations at university • Data are obtained for a large group of students • In addition to information on degree results and type of school attended, information is obtained on – sex of student, – high school results – parents’ income • Which of these factors is it inappropriate to include in the model and why?
  • 16. Regression models 16 Temporal Sequence II • The same does not apply if the purpose of the model is simply classification • It may then be helpful to have factors in the model even if they are measured after the “outcome variable of interest” • Indeed they can be included even if they have been “caused” by the variable of interest
  • 17. Regression models 17 Example • We wish to develop a model for classifying patients who present with abdominal pain as either suffering from appendicitis or non- specific abdominal pain • We use location of pain, degree of pain, absence/presence of nausea, body temperature as “predictor” variable – Even though these are consequences of rather than causes of appendicitis
  • 18. Regression models 18 Prior Knowledge • Frequently when fitting models we already have strong opinions about the effect of some factors even if we are ignorant about others. – For example we may be examining the effect of a previously unstudied environmental exposure on health – we know, however, that age is an important determinant of health • We will tend to put factors we believe are important in the model irrespective of their significance according to the current data set. • Similarly, implicitly, there will always be a host of factors we believe are irrelevant. • We will not put these in the model on prior grounds
  • 19. 19 Type of Data and Choice of Basic Model Type of Data • Continuous measurement • Count data • Binary data • Ordered categorical • Censored lifetimes • Multinomial Possible Basic Model • General linear model (Normal outcomes) • Poisson regression • Logistic regression • Proportional odds • Proportional hazards • Log-linear Regression models
  • 20. Regression models 20 Case Ascertainment • The way in which data are obtained (ascertained) can affect the way that we build a model • For example in a case-control study we sample by outcome (cases and controls) and then measure how these two differ by exposure – Example • Case: lung cancer, Control: other cancer • Exposure: smoker versus non-smoker • We cannot model relative risk using such data • We can only model (log) odds ratios • For a cohort study where we sample by exposure we could model either
  • 21. Regression models 21 Social Status: Longer life expectancy for Oscar winners A study of actors and actresses found that Oscar winners lived, on average, almost four years longer than nominees who went home empty-handed, reports the March issue of the Harvard Health Letter. Actors aren’t the only people who reap benefits. Dr. Donald Redelmeier of Toronto’s Sunnybrook and Women’s College Health Sciences Centre found that Oscar-winning directors live longer than non-winners, and male directors live 4.5 years longer on average than actors. These findings add to a large body of evidence delineating connections between social status and health and longevity, reports the Harvard Health Letter. Redelmeier theorizes that an Oscar on the mantel moves the winner up the Hollywood pecking order. Winners find it easier to get work, and when they do, they’re better appreciated and better paid.
  • 22. Not Harvard Health Publications Regression models 22 A study has shown that getting a telegram from The Queen can add 20 years to your life An extensive study of individuals who have received telegrams from The Queen has shown that an astonishing proportion of them have lived to be 100. Age at death of a control group of non-recipients was typically 20 years less. Researchers have postulated that esteem is an important determinant of health Joked lead researcher, Prof Morton Gullible, ‘our advice to her Majesty is send yourself a telegram, Ma'am’
  • 23. Regression models 23 Results of Model Fitting • Statisticians have developed a number of techniques for assessing the adequacy of various models using the data in hand – Standard errors, significance tests on coefficients – Analysis of variance/ deviance on factors – Goodness of fit generally – Residual plots – AIC, BIC • These are important tools but are by no means the only tools for assessing the adequacy of a model
  • 24. Transforming predictors The X Files Regression models 24
  • 25. Luxembourg Temperature Example Regression models 25 Data on temperatures in Luxembourg Month Normal temperatures deg C January 0.6 February 1.4 March 4.7 April 7.7 May 12.4 June 15.1 July 17.5 August 17.3 September 13.5 October 8.9 November 4.0 December 1.8
  • 26. Modelling the temperature Regression models 26 Note that in the yearly rhythm, January follows December even though January is point 1 and December point 12. The data are periodic and we need a model that reflects this. The simplest periodic pattern is a sine wave. 𝑌 = 𝛼 + βsin 𝑋 + 𝜃  = level (the average temperature)  = amplitude (the difference max to average)  = phase (governs point at which maximum is reached)
  • 27. Fitting a sine wave Regression models 27 A sine wave model can be fitted by using the fact that sin 𝑋 + 𝜃 = cos 𝜃 sin 𝑋 + sin 𝜃 cos 𝑋 This is linear in sin 𝑋, cos 𝑋. Hence by regressing Y on two variables sin 𝑋, cos 𝑋 we can obtain a periodic fit. Note that X must be transformed from linear to angular measure. So we can write 𝑋 = 360 × 𝑚𝑜𝑛𝑡ℎ 12 if we measure in degrees or 𝑋 = 2Π × 𝑚𝑜𝑛𝑡ℎ 12 radians
  • 28. Regression models 28 3 parameters fit 12 points rather well
  • 29. Transforming the outcome Being wise about Ys Regression models 29
  • 30. Regression models 30 An Example of a One-way Layout • Four experimental p38 kinase inhibitors • Vehicle and marketed product as controls • Thrombaxane B2 (TXB2) is used as a marker of COX-1 activity • Six rats per group were treated for a total of 36 rats • At the end of the study rats are sacrificed and TXB2 is measured.
  • 32. Regression models 32 GenStat® ANOVA (Original data) Analysis of variance Variate: TXB2 𝜎2 2 Source of variation d.f. s.s. m.s. v.r. F pr. Treatment 5 184596. 36919. 6.31 <.001 Residual 30 175439. 5848. Total 35 360035. 𝜎1 2 𝜎2 2 𝜎1 2 A2WAY [TREATMENTS=Treatment] TXB2
  • 33. Regression models 33 GenStat plot of residuals
  • 35. Regression models 35 GenStat ANOVA (log transformed) A2WAY [TREATMENTS=Treatment] logTXB2 Analysis of variance Variate: logTXB2 Source of variation d.f. s.s. m.s. v.r. Treatment 5 62.6760 12.5352 40.09 Residual 30 9.3800 0.3127 Total 35 72.0559 Signal to noise ratio is now much higher
  • 36. Regression models 36 GenStat plot of residuals
  • 37. Regression models 37 Homogeneity of Variances (Bartlett’ Test: GenStat) Untransformed *** Bartlett's Test for homogeneity of variances *** Chi-square 50.87 on 5 degrees of freedom: probability < 0.001 Log-transformed *** Bartlett's Test for homogeneity of variances *** Chi-square 8.95 on 5 degrees of freedom: probability 0.111
  • 38. Data-filtering examples or find the flaw • A 20 year follow-up study of women in an English village found higher survival amongst smokers than non-smokers • Transplant receivers on highest doses of cyclosporine had higher probability of graft rejection than on lower doses • Left-handers observed to die younger on average than right-handers • Obese infarct survivors have better prognosis than non- obese Regression models 38
  • 39. Advice Statistics is a way of improving your thinking, not a substitute for it Regression models 39
  • 40. Advice • Think before you model • Purpose is key – Causal – Predictive – Classification • Think about time • Think about case ascertainment • Testing is a small part of discerning • Don’t use stepwise regression as a substitute for understanding Regression models 40