Applied Multivariate Statistics
Ralf B. Schäfer
University of Koblenz-Landau 2017/18
Applied Multivariate Statistics
Ralf B. Schäfer
University of Koblenz-Landau 2017/18
2
Short introduction
● Professor for Quantitative Landscape Ecology
● Current teaching: Statistics (M.Sc.); GIS (B.Sc./M.Sc.);
Environmental Modelling (B.Sc./M.Sc.); Aquatic
Ecotoxicology (M.Sc.); Environmental Philosophy (B.Sc.)
● Research focus:
● Community ecology of freshwater invertebrates and
microorganisms
● Response of freshwater ecosystems to different
(anthropogenic) stressors (e.g. pollution)
● Trophic linkages between aquatic & terrestrial systems
● Primarily field studies/experiments and data analyses/
modelling
www.landscapecology.uni-landau.de
2
Short introduction
● Professor for Quantitative Landscape Ecology
● Current teaching: Statistics (M.Sc.); GIS (B.Sc./M.Sc.);
Environmental Modelling (B.Sc./M.Sc.); Aquatic
Ecotoxicology (M.Sc.); Environmental Philosophy (B.Sc.)
● Research focus:
● Community ecology of freshwater invertebrates and
microorganisms
● Response of freshwater ecosystems to different
(anthropogenic) stressors (e.g. pollution)
● Trophic linkages between aquatic & terrestrial systems
● Primarily field studies/experiments and data analyses/
modelling
www.landscapecology.uni-landau.de
3
Organisation
● Lecture material (including course schedule and literature
list) can be found on github and website:
https://github.com/rbslandau/statistics_multi
https://goo.gl/EhPVFG
● Inverted classroom: Self study of lecture and demonstration,
Q&A and exercises in class room
● Contact time: 2 hours per week; Own study time:
approximately 1 day per week
3
Organisation
● Lecture material (including course schedule and literature
list) can be found on github and website:
https://github.com/rbslandau/statistics_multi
https://goo.gl/EhPVFG
● Inverted classroom: Self study of lecture and demonstration,
Q&A and exercises in class room
● Contact time: 2 hours per week; Own study time:
approximately 1 day per week
4
Using your own notebook
● feel free to you use your own WLAN-enabled notebook!
● install R (http://mirrors.softliste.de/cran/) oder RStudio
(recommended for beginners - http://www.rstudio.com/)
● Run “0_Install_packgs.R”, provided on github
● for installation of additional packages run
install.packages(“package to be installed”)
4
Using your own notebook
● feel free to you use your own WLAN-enabled notebook!
● install R (http://mirrors.softliste.de/cran/) oder RStudio
(recommended for beginners - http://www.rstudio.com/)
● Run “0_Install_packgs.R”, provided on github
● for installation of additional packages run
install.packages(“package to be installed”)
5
Course objectives: Learning outcomes
● Classify, explain and interpret the different types of
(multivariate) statistical approaches
● Select and apply the appropriate statistical method
for the research goal
● Demonstrate moderate level of statistical modelling
skills, including scripting in R
5
Course objectives: Learning outcomes
● Classify, explain and interpret the different types of
(multivariate) statistical approaches
● Select and apply the appropriate statistical method
for the research goal
● Demonstrate moderate level of statistical modelling
skills, including scripting in R
6
Two incorrect ways of thinking about stats
1.Overconfidence: Statistics is like mathematics and
provides a single, correct answer
But statistical thinking differs from mathematical thinking
2.Disbelief: Anything goes – statistics cannot be trusted
But: statistics provide quantitative support of the complete
research process
Tintle (2015) Amer. Statist. 69: 362
6
Two incorrect ways of thinking about stats
1.Overconfidence: Statistics is like mathematics and
provides a single, correct answer
But statistical thinking differs from mathematical thinking
2.Disbelief: Anything goes – statistics cannot be trusted
But: statistics provide quantitative support of the complete
research process
Tintle (2015) Amer. Statist. 69: 362
7
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for
data exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
7
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for
data exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
8
Learning targets
● Explain the data analysis cycle and apply tools
for exploratory data analysis
● Explain approaches to statistical modelling and
simulation and apply simulation-based methods
● Diagnosing and interpreting the linear model
8
Learning targets
● Explain the data analysis cycle and apply tools
for exploratory data analysis
● Explain approaches to statistical modelling and
simulation and apply simulation-based methods
● Diagnosing and interpreting the linear model
9
Learning targets and study questions
● Explain the data analysis cycle and apply tools for
exploratory data analysis
● Explain the steps of the data analysis cycle.
● Summarise the elements of exploratory analysis. Which
graphical tools are essential?
● Explain approaches to statistical modelling and
simulation and apply simulation-based methods
● Discuss the two different approaches to statistical modelling
and links through simulation-based approaches.
● Explain the purpose and critically discuss permutation tests.
● Explain the purpose and critically discuss bootstrapping.
● Explain the main idea of cross-validation and discuss the
selection of k with respect to the bias-variance trade-off.
9
Learning targets and study questions
● Explain the data analysis cycle and apply tools for
exploratory data analysis
● Explain the steps of the data analysis cycle.
● Summarise the elements of exploratory analysis. Which
graphical tools are essential?
● Explain approaches to statistical modelling and
simulation and apply simulation-based methods
● Discuss the two different approaches to statistical modelling
and links through simulation-based approaches.
● Explain the purpose and critically discuss permutation tests.
● Explain the purpose and critically discuss bootstrapping.
● Explain the main idea of cross-validation and discuss the
selection of k with respect to the bias-variance trade-off.
10
Learning targets and study questions
● Diagnosing and interpreting the linear model
● Describe the assumptions of the linear regression and
explain how they can be checked.
● Which types of outliers exist? When is an outlier important?
● Discuss the application of bootstrapping and cross-validation
for the linear model.
10
Learning targets and study questions
● Diagnosing and interpreting the linear model
● Describe the assumptions of the linear regression and
explain how they can be checked.
● Which types of outliers exist? When is an outlier important?
● Discuss the application of bootstrapping and cross-validation
for the linear model.
11
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
11
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
12
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
12
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
13
Define research goal and question
Scientific hypothesis: Restoring stream stretches
alters aquatic communities, resulting in different
emerging insects on which riparian spiders prey.
This affects the spiders’ body condition derived
from prosomal (pr.) and opisthosomal (op.) width.
Question: Does the body condition of riparian spiders differ between
restored and non-restored stream stretches?
Opisthosoma
Prosoma
H0 : µrestored=µnon−restored H 1: µrestored≠µnon−restored
● Research goals (e.g. prediction, estimation, inference) and
questions should inform study design and methods
● Aim: Test scientific hypothesis → Formulate testable hypothesis
Example
● Testable hypothesis: The sample means for the body condition are
drawn from populations with the same µ:
13
Define research goal and question
Scientific hypothesis: Restoring stream stretches
alters aquatic communities, resulting in different
emerging insects on which riparian spiders prey.
This affects the spiders’ body condition derived
from prosomal (pr.) and opisthosomal (op.) width.
Question: Does the body condition of riparian spiders differ between
restored and non-restored stream stretches?
Opisthosoma
Prosoma
H0 : µrestored=µnon−restored H 1: µrestored≠µnon−restored
● Research goals (e.g. prediction, estimation, inference) and
questions should inform study design and methods
● Aim: Test scientific hypothesis → Formulate testable hypothesis
Example
● Testable hypothesis: The sample means for the body condition are
drawn from populations with the same µ:
14
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
14
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
15
Tools for data exploration
GIGA: Garbage in – Garbage out
1.Outliers (e.g. boxplot)
2.Variance homogeneity (e.g. conditional boxplot)
3.Normal distribution (e.g. QQ-plot)
4.(Double) zeros (e.g. frequency plot)
5.Collinearity (e.g. pairwise scatterplots)
6.Relationship explanatory and response variable (e.g.
scatterplots)
7.Spatial- or temporal autocorrelation (e.g. variograms)
Elements of data exploration – Checking for:
● Useful for inspecting data before the modelling but also for
model diagnosis
● Zuur et al. (2009) urge data inspection before modelling
15
Tools for data exploration
GIGA: Garbage in – Garbage out
1.Outliers (e.g. boxplot)
2.Variance homogeneity (e.g. conditional boxplot)
3.Normal distribution (e.g. QQ-plot)
4.(Double) zeros (e.g. frequency plot)
5.Collinearity (e.g. pairwise scatterplots)
6.Relationship explanatory and response variable (e.g.
scatterplots)
7.Spatial- or temporal autocorrelation (e.g. variograms)
Elements of data exploration – Checking for:
● Useful for inspecting data before the modelling but also for
model diagnosis
● Zuur et al. (2009) urge data inspection before modelling
16
Data exploration
Common plots for looking at the data
Outliers?
Asymmetry of
distribution?
Normality?
Linearity?
Collinearity?
16
Data exploration
Common plots for looking at the data
Outliers?
Asymmetry of
distribution?
Normality?
Linearity?
Collinearity?
17
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
17
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
18
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
18
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
19
Statistical modelling: The two cultures
Breiman 2001 Statistical Science 16: 199
Real world: Processes lead to association between x and y
Examples for goals of statistical modelling: predict unknown y
from x, estimate how x is related to y
Data modelling culture
(classical statistics)
Common data model
Algorithmic modeling culture
(machine learning)
Estimate
parameters
from data
Model validation: Check residuals
Model validation: Predictive accuracy
Find algorithm that operates on x
to predict y
19
Statistical modelling: The two cultures
Breiman 2001 Statistical Science 16: 199
Real world: Processes lead to association between x and y
Examples for goals of statistical modelling: predict unknown y
from x, estimate how x is related to y
Data modelling culture
(classical statistics)
Common data model
Algorithmic modeling culture
(machine learning)
Estimate
parameters
from data
Model validation: Check residuals
Model validation: Predictive accuracy
Find algorithm that operates on x
to predict y
20
Statistical modelling: the classical view
● Fit model to data to inform estimation, inference or
prediction (e.g. estimate point or interval, test hypothesis)
● Example: The arithmetic mean is an estimate of the true
population mean µ and s2
is an estimate of the true variance σ2
● Most models incorporate a deterministic (fixed effect) and
a stochastic component (random effect)
● Example:
● All models rely on assumptions → Model diagnosis
● e.g. normal distribution, independence of observations
● Goodness of fit measures aid to choose between multiple
models that fit the data
● e.g. AIC, R2
, RMSE
x̄
yi = b0b1 xii with  ~ N 0,
2

20
Statistical modelling: the classical view
● Fit model to data to inform estimation, inference or
prediction (e.g. estimate point or interval, test hypothesis)
● Example: The arithmetic mean is an estimate of the true
population mean µ and s2
is an estimate of the true variance σ2
● Most models incorporate a deterministic (fixed effect) and
a stochastic component (random effect)
● Example:
● All models rely on assumptions → Model diagnosis
● e.g. normal distribution, independence of observations
● Goodness of fit measures aid to choose between multiple
models that fit the data
● e.g. AIC, R2
, RMSE
x̄
yi = b0b1 xii with  ~ N 0,
2

21
Simulation-based approaches in
data analysis
● Compatible with both cultures
● Infuses algorithm-based thinking into classical statistics
● Examples for simulation-based approaches for estimation,
inference or model diagnosis in classical statistics:
1. Permutation test → Permuting (shuffling) the data to derive
null distribution. Mainly used for inference
2. Bootstrapping → Randomly sampling subsets from the
data with replacement. Mainly used for estimation of
parameter distribution
3. Cross-validation (CV) → Splitting data into sets (i.e.
sampling without replacement). Mainly used for validation of
predictive models
21
Simulation-based approaches in
data analysis
● Compatible with both cultures
● Infuses algorithm-based thinking into classical statistics
● Examples for simulation-based approaches for estimation,
inference or model diagnosis in classical statistics:
1. Permutation test → Permuting (shuffling) the data to derive
null distribution. Mainly used for inference
2. Bootstrapping → Randomly sampling subsets from the
data with replacement. Mainly used for estimation of
parameter distribution
3. Cross-validation (CV) → Splitting data into sets (i.e.
sampling without replacement). Mainly used for validation of
predictive models
22
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
22
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
23
Permutation test: Algorithm
1) Permute values in data set
2) Compute test statistic t* for permuted data
3) Compare test statistic t0
to generated null distribution
Repeat
k times
23
Permutation test: Algorithm
1) Permute values in data set
2) Compute test statistic t* for permuted data
3) Compare test statistic t0
to generated null distribution
Repeat
k times
24
Permutation test: Algorithm
Original
dataset
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x9
x6
x12
x7
x13
x2
x10
x4
x1
x15
x14
x8
x3
x11
x5
x14
x2
x7
x9
x5
x15
x3
x6
x8
x1
x12
x2
x10
x4
x11
Permutation 1
Permutation k
1) Permute values in data set
2) Compute test statistic t* for permuted data
3) Compare test statistic t0
to generated null distribution
.
.
.
.
.
.
.
.
.
Repeat
k times
Example: Permutation test of difference in group mean
Group 1 Group 2 Test statistic

xgroup1 − 
xgroup2
t1
*
tk
*
t0
24
Permutation test: Algorithm
Original
dataset
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x9
x6
x12
x7
x13
x2
x10
x4
x1
x15
x14
x8
x3
x11
x5
x14
x2
x7
x9
x5
x15
x3
x6
x8
x1
x12
x2
x10
x4
x11
Permutation 1
Permutation k
1) Permute values in data set
2) Compute test statistic t* for permuted data
3) Compare test statistic t0
to generated null distribution
.
.
.
.
.
.
.
.
.
Repeat
k times
Example: Permutation test of difference in group mean
Group 1 Group 2 Test statistic

xgroup1 − 
xgroup2
t1
*
tk
*
t0
25
t0
Permutation test: Generated distribution
● Test informs whether pattern in data is due to chance
● Inference regarding statistical population only valid if
distribution of sample data matches actual distribution of
statistical population → particularly problematic for small n
p =
∑
i=1
k
1if ti
*
≤ t0 ,else0
k1
25
t0
Permutation test: Generated distribution
● Test informs whether pattern in data is due to chance
● Inference regarding statistical population only valid if
distribution of sample data matches actual distribution of
statistical population → particularly problematic for small n
p =
∑
i=1
k
1if ti
*
≤ t0 ,else0
k1
26
Permutation test: Advantages and limitations
● Advantages
● Free from distributional assumptions
● Applicable to complex designs through restricting permutations
● Limitations
● Generalisation to statistical population requires matching
distribution
● Statistical hypothesis testing can imply distributional assumptions
that apply to the permutation test, if aiming to infer to the
statistical population (e.g. testing for mean differences affected by
variance)
● Computationally intensive: Number of all possible permutations
for a dataset is factorial n, i.e. n! (e.g. 35! ≈ 1040
)
→ Monte Carlo simulation
Legendre & Legendre 2012: 25ff
26
Permutation test: Advantages and limitations
● Advantages
● Free from distributional assumptions
● Applicable to complex designs through restricting permutations
● Limitations
● Generalisation to statistical population requires matching
distribution
● Statistical hypothesis testing can imply distributional assumptions
that apply to the permutation test, if aiming to infer to the
statistical population (e.g. testing for mean differences affected by
variance)
● Computationally intensive: Number of all possible permutations
for a dataset is factorial n, i.e. n! (e.g. 35! ≈ 1040
)
→ Monte Carlo simulation
Legendre & Legendre 2012: 25ff
27
Monte-Carlo simulation
● Uses repeated random sampling to solve problems
probabilistically (even though they can be deterministic in
reality)
● Permutation tests use random numbers to randomly permute
data → approximate with MC simulation
● Legendre & Legendre (2012): use at least 10,000 permutations
for inference
Edvard Munch - At the Roulette Table in Monte Carlo
Entrance of casino in Monte Carlo, Monaco
27
Monte-Carlo simulation
● Uses repeated random sampling to solve problems
probabilistically (even though they can be deterministic in
reality)
● Permutation tests use random numbers to randomly permute
data → approximate with MC simulation
● Legendre & Legendre (2012): use at least 10,000 permutations
for inference
Edvard Munch - At the Roulette Table in Monte Carlo
Entrance of casino in Monte Carlo, Monaco
28
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
28
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
29
Bootstrapping: Idea and algorithm
● Inference on statistic t is based on sampling distribution
● Ideally: Draw all or many samples from statistical population
● Reality: Most frequently only one sample available
➔ Idea: Draw samples from an estimate of the statistical population
(i.e. the sample) and use these to estimate property (e.g. variance)
of the statistic t
● Algorithm:
1) Draw random sample with replacement from data
2) Compute statistic t* for bootstrap sample
3) Use the k estimates to derive property of statistic
● Exhaustive bootstrapping (k = nn
) computationally
demanding → approximate with Monte Carlo simulation
● Given todays computer power 104
-105
simulations viable
Repeat
k times
29
Bootstrapping: Idea and algorithm
● Inference on statistic t is based on sampling distribution
● Ideally: Draw all or many samples from statistical population
● Reality: Most frequently only one sample available
➔ Idea: Draw samples from an estimate of the statistical population
(i.e. the sample) and use these to estimate property (e.g. variance)
of the statistic t
● Algorithm:
1) Draw random sample with replacement from data
2) Compute statistic t* for bootstrap sample
3) Use the k estimates to derive property of statistic
● Exhaustive bootstrapping (k = nn
) computationally
demanding → approximate with Monte Carlo simulation
● Given todays computer power 104
-105
simulations viable
Repeat
k times
30
Bootstrapping: Example
Original
dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
15 7 8 4 15 11 9 1 3 6 14 2 11 12
1
5 7 8 8 15 10
10 1 13
6 13 2 10 12 3
7 5 8 2 12
14
10 8 13 6 11
7 15
12 1
BS sample 1
BS sample 2
BS sample k
.
.
.
Sampling with replacement
x̄ = 8
t (here: mean)
x̄*
= 7.93
x̄*
= 8.2
x̄*
= 8.73
Example: Bootstrap to the mean (to derive variance)
.
.
.
Distribution of statistic t
30
Bootstrapping: Example
Original
dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
15 7 8 4 15 11 9 1 3 6 14 2 11 12
1
5 7 8 8 15 10
10 1 13
6 13 2 10 12 3
7 5 8 2 12
14
10 8 13 6 11
7 15
12 1
BS sample 1
BS sample 2
BS sample k
.
.
.
Sampling with replacement
x̄ = 8
t (here: mean)
x̄*
= 7.93
x̄*
= 8.2
x̄*
= 8.73
Example: Bootstrap to the mean (to derive variance)
.
.
.
Distribution of statistic t
31
Bootstrapping: Limitations
Hesterberg 2015 Amer. Statist. 69:371
● Do not use for hypothesis testing
● No distributional assumptions implied, but not reliable for all
distributions, particularly at small n (see Hesterberg 2015)
● Small n: use adjusted bootstrap percentiles (Bca) or switch
to parametric statistics (allow for additional assumptions)
● Bootstrap does not improve estimate of population
parameter , centred at x̄
µ
31
Bootstrapping: Limitations
Hesterberg 2015 Amer. Statist. 69:371
● Do not use for hypothesis testing
● No distributional assumptions implied, but not reliable for all
distributions, particularly at small n (see Hesterberg 2015)
● Small n: use adjusted bootstrap percentiles (Bca) or switch
to parametric statistics (allow for additional assumptions)
● Bootstrap does not improve estimate of population
parameter , centred at x̄
µ
32
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
32
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
33
Cross-validation (CV)
● Objective: Evaluate predictive accuracy of a fitted model
● Can be checked if independent training data (used to fit
model) and test data (new data) are available → Rare case
● Idea: Split the available data into training and test set and
predict the (known) observations in the test set from a model
fitted with the training data
● Algorithm:
1. Draw k random samples without replacement from data
2. For each k:
1. Fit the model to the other k-1 parts
2. Predict k from model and calculate the prediction error
3. Calculate prediction error as average over the k estimates
33
Cross-validation (CV)
● Objective: Evaluate predictive accuracy of a fitted model
● Can be checked if independent training data (used to fit
model) and test data (new data) are available → Rare case
● Idea: Split the available data into training and test set and
predict the (known) observations in the test set from a model
fitted with the training data
● Algorithm:
1. Draw k random samples without replacement from data
2. For each k:
1. Fit the model to the other k-1 parts
2. Predict k from model and calculate the prediction error
3. Calculate prediction error as average over the k estimates
34
Cross-validation (CV)
● Problem of choosing k:
● k = n (Leave-one-out CV predicts each observation from all others)
→ low bias, but high variance
● k = 2 (split data into half) → low variance, but high bias
● k typically set to 5 or 10
Taken from James et al. 2013: 181
Example: k = 5
34
Cross-validation (CV)
● Problem of choosing k:
● k = n (Leave-one-out CV predicts each observation from all others)
→ low bias, but high variance
● k = 2 (split data into half) → low variance, but high bias
● k typically set to 5 or 10
Taken from James et al. 2013: 181
Example: k = 5
35
Test data
Training data
function used to
simulate data
highly flexible
smoother
linear regression
little flexible
smoother
Variance
Bias-variance trade-off
Definitions in context of model validation:
● Bias: error when approximating training data
● Variance: variability in error when approximating test data
Taken from James et al. 2013: 33
Higher flexibility (higher k in CV) → lower error for training data
(i.e. lower bias), but variance will increase from some point
35
Test data
Training data
function used to
simulate data
highly flexible
smoother
linear regression
little flexible
smoother
Variance
Bias-variance trade-off
Definitions in context of model validation:
● Bias: error when approximating training data
● Variance: variability in error when approximating test data
Taken from James et al. 2013: 33
Higher flexibility (higher k in CV) → lower error for training data
(i.e. lower bias), but variance will increase from some point
36
Bias-variance trade-off
Higher flexibility (higher k in CV) → lower error for training
data (i.e. lower bias), but variance will increase from some
point → Optimise combined error
Taken from Hastie, Tibshirani and Friedman 2011: 38
36
Bias-variance trade-off
Higher flexibility (higher k in CV) → lower error for training
data (i.e. lower bias), but variance will increase from some
point → Optimise combined error
Taken from Hastie, Tibshirani and Friedman 2011: 38
37
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
37
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
38
Relationship between two continuous
variables: linear regression model
● Bivariate relationship between an explanatory variable and
a response variable with:
● Example: Can we approximate pesticide runoff
concentrations with passive sampling?
Fernandez et al. 2014
yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ
2
)
38
Relationship between two continuous
variables: linear regression model
● Bivariate relationship between an explanatory variable and
a response variable with:
● Example: Can we approximate pesticide runoff
concentrations with passive sampling?
Fernandez et al. 2014
yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ
2
)
39
Relationship between two continuous
variables: linear regression model
● Bivariate relationship between an explanatory variable and
a response variable with:
● Aim: minimise ε (also called error sum of squares: SSE)
yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ
2
)
39
Relationship between two continuous
variables: linear regression model
● Bivariate relationship between an explanatory variable and
a response variable with:
● Aim: minimise ε (also called error sum of squares: SSE)
yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ
2
)
40
Linear regression model
SSY = SSR + SSE
R2
=
SSR
SSY
Total variation
Explained
variation
Unexplained
variation
% of explained variance:
adj. R2
= 1−1−R2

n−1
n− p−1
40
Linear regression model
SSY = SSR + SSE
R2
=
SSR
SSY
Total variation
Explained
variation
Unexplained
variation
% of explained variance:
adj. R2
= 1−1−R2

n−1
n− p−1
41
Linear regression model
● Assumptions:
● Linear relationship (graphical diagnostics)
● Normal distribution of error (graphical diagnostics)
● Variance homogeneity (graphical diagnostics)
● Independence of errors (graphical diagnostics)
● If one or more assumptions not met, alternatives include:
● Generalised linear model, Generalised least squares, Mixed
models
● Variable transformation (but using an appropriate model such as
a Generalised linear model is usually the better option)
41
Linear regression model
● Assumptions:
● Linear relationship (graphical diagnostics)
● Normal distribution of error (graphical diagnostics)
● Variance homogeneity (graphical diagnostics)
● Independence of errors (graphical diagnostics)
● If one or more assumptions not met, alternatives include:
● Generalised linear model, Generalised least squares, Mixed
models
● Variable transformation (but using an appropriate model such as
a Generalised linear model is usually the better option)
42
Model diagnostics: Variance homogeneity
„normal“
„strong increase“
„non-linear“
„slight increase“
Residuals vs. fitted values plots
42
Model diagnostics: Variance homogeneity
„normal“
„strong increase“
„non-linear“
„slight increase“
Residuals vs. fitted values plots
43
Further model diagnostics
Leverage points (predictor outlier)
How to deal with leverage points/outliers?
● Check whether values are plausible
● Check robustness of model results when removing observations
● Fit different statistical model or transform data
Leverage point
that exerts high
influence
Non-influential
leverage point
43
Further model diagnostics
Leverage points (predictor outlier)
How to deal with leverage points/outliers?
● Check whether values are plausible
● Check robustness of model results when removing observations
● Fit different statistical model or transform data
Leverage point
that exerts high
influence
Non-influential
leverage point
44
Flowchart for simple linear regression
Taken from Sheather 2009: p.103
44
Flowchart for simple linear regression
Taken from Sheather 2009: p.103
45
Simulation-based approaches to simple
linear regression
● Predictive accuracy measured with Mean square
prediction error (MSPE):
● Cross-validation (CV): Calculate CV-MSPE and CV-R2
● Bootstrapping (BS) in regression analysis:
● of residuals: BS residuals, add to to generate new and
calculate regression coefficients → x fixed
● of cases: BS complete cases and calculate regression
coefficients → x random
● If x and y random sample (e.g. x not fixed in experiment),
residuals correlated or exhibit non-constant variance → BS cases
MSPE =
1
m
∑
i=1
m
( yi− ^
yi)2
for the new observations 1 to m
^
y y
*
45
Simulation-based approaches to simple
linear regression
● Predictive accuracy measured with Mean square
prediction error (MSPE):
● Cross-validation (CV): Calculate CV-MSPE and CV-R2
● Bootstrapping (BS) in regression analysis:
● of residuals: BS residuals, add to to generate new and
calculate regression coefficients → x fixed
● of cases: BS complete cases and calculate regression
coefficients → x random
● If x and y random sample (e.g. x not fixed in experiment),
residuals correlated or exhibit non-constant variance → BS cases
MSPE =
1
m
∑
i=1
m
( yi− ^
yi)2
for the new observations 1 to m
^
y y
*
46
Exercise
We will work with the data set “possum” that includes
biometric measurements of possums in Victoria, Australia.
Conduct a linear regression analysis, diagnose and
interpret the model and apply simulation-based
approaches.
46
Exercise
We will work with the data set “possum” that includes
biometric measurements of possums in Victoria, Australia.
Conduct a linear regression analysis, diagnose and
interpret the model and apply simulation-based
approaches.
1
Applied Multivariate Statistics
Ralf B. Schäfer
University of Koblenz-Landau 2017/18
These slides and notes complement the lecture with exercises
“Applied multivariate statistics for environmental scientists”.
Do not hesitate to contact me if you have any comments or
you found any errors (text or slides):
schaefer-ralf@uni-landau.de
While I made notes below the slides, some aspects are only
mentioned in the R scripts associated with the lecture.
1
Applied Multivariate Statistics
Ralf B. Schäfer
University of Koblenz-Landau 2017/18
These slides and notes complement the lecture with exercises
“Applied multivariate statistics for environmental scientists”.
Do not hesitate to contact me if you have any comments or
you found any errors (text or slides):
schaefer-ralf@uni-landau.de
While I made notes below the slides, some aspects are only
mentioned in the R scripts associated with the lecture.
2
2
Short introduction
● Professor for Quantitative Landscape Ecology
● Current teaching: Statistics (M.Sc.); GIS (B.Sc./M.Sc.);
Environmental Modelling (B.Sc./M.Sc.); Aquatic
Ecotoxicology (M.Sc.); Environmental Philosophy (B.Sc.)
● Research focus:
● Community ecology of freshwater invertebrates and
microorganisms
● Response of freshwater ecosystems to different
(anthropogenic) stressors (e.g. pollution)
● Trophic linkages between aquatic & terrestrial systems
● Primarily field studies/experiments and data analyses/
modelling
www.landscapecology.uni-landau.de
2
2
Short introduction
● Professor for Quantitative Landscape Ecology
● Current teaching: Statistics (M.Sc.); GIS (B.Sc./M.Sc.);
Environmental Modelling (B.Sc./M.Sc.); Aquatic
Ecotoxicology (M.Sc.); Environmental Philosophy (B.Sc.)
● Research focus:
● Community ecology of freshwater invertebrates and
microorganisms
● Response of freshwater ecosystems to different
(anthropogenic) stressors (e.g. pollution)
● Trophic linkages between aquatic & terrestrial systems
● Primarily field studies/experiments and data analyses/
modelling
www.landscapecology.uni-landau.de
3
3
Organisation
● Lecture material (including course schedule and literature
list) can be found on github and website:
https://github.com/rbslandau/statistics_multi
https://goo.gl/EhPVFG
● Inverted classroom: Self study of lecture and demonstration,
Q&A and exercises in class room
● Contact time: 2 hours per week; Own study time:
approximately 1 day per week
Literature references that are listed in the literature list are cited
in short form on slides. For literature not contained in the
literature list, I give the complete reference on the slide or in the
notes for the respective slide.
3
3
Organisation
● Lecture material (including course schedule and literature
list) can be found on github and website:
https://github.com/rbslandau/statistics_multi
https://goo.gl/EhPVFG
● Inverted classroom: Self study of lecture and demonstration,
Q&A and exercises in class room
● Contact time: 2 hours per week; Own study time:
approximately 1 day per week
Literature references that are listed in the literature list are cited
in short form on slides. For literature not contained in the
literature list, I give the complete reference on the slide or in the
notes for the respective slide.
4
4
Using your own notebook
● feel free to you use your own WLAN-enabled notebook!
● install R (http://mirrors.softliste.de/cran/) oder RStudio
(recommended for beginners - http://www.rstudio.com/)
● Run “0_Install_packgs.R”, provided on github
● for installation of additional packages run
install.packages(“package to be installed”)
4
4
Using your own notebook
● feel free to you use your own WLAN-enabled notebook!
● install R (http://mirrors.softliste.de/cran/) oder RStudio
(recommended for beginners - http://www.rstudio.com/)
● Run “0_Install_packgs.R”, provided on github
● for installation of additional packages run
install.packages(“package to be installed”)
5
5
Course objectives: Learning outcomes
● Classify, explain and interpret the different types of
(multivariate) statistical approaches
● Select and apply the appropriate statistical method
for the research goal
● Demonstrate moderate level of statistical modelling
skills, including scripting in R
5
5
Course objectives: Learning outcomes
● Classify, explain and interpret the different types of
(multivariate) statistical approaches
● Select and apply the appropriate statistical method
for the research goal
● Demonstrate moderate level of statistical modelling
skills, including scripting in R
6
6
Two incorrect ways of thinking about stats
1.Overconfidence: Statistics is like mathematics and
provides a single, correct answer
But statistical thinking differs from mathematical thinking
2.Disbelief: Anything goes – statistics cannot be trusted
But: statistics provide quantitative support of the complete
research process
Tintle (2015) Amer. Statist. 69: 362
Tintle N., Chance B., Cobb G., Roy S., Swanson T. &
VanderStoep J. (2015) Combating Anti-Statistical Thinking
Using Simulation-Based Methods Throughout the
Undergraduate Curriculum. The American Statistician 69, 362–
370.
6
6
Two incorrect ways of thinking about stats
1.Overconfidence: Statistics is like mathematics and
provides a single, correct answer
But statistical thinking differs from mathematical thinking
2.Disbelief: Anything goes – statistics cannot be trusted
But: statistics provide quantitative support of the complete
research process
Tintle (2015) Amer. Statist. 69: 362
Tintle N., Chance B., Cobb G., Roy S., Swanson T. &
VanderStoep J. (2015) Combating Anti-Statistical Thinking
Using Simulation-Based Methods Throughout the
Undergraduate Curriculum. The American Statistician 69, 362–
370.
7
7
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for
data exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
7
7
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for
data exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
8
8
Learning targets
● Explain the data analysis cycle and apply tools
for exploratory data analysis
● Explain approaches to statistical modelling and
simulation and apply simulation-based methods
● Diagnosing and interpreting the linear model
8
8
Learning targets
● Explain the data analysis cycle and apply tools
for exploratory data analysis
● Explain approaches to statistical modelling and
simulation and apply simulation-based methods
● Diagnosing and interpreting the linear model
9
9
Learning targets and study questions
● Explain the data analysis cycle and apply tools for
exploratory data analysis
● Explain the steps of the data analysis cycle.
● Summarise the elements of exploratory analysis. Which
graphical tools are essential?
● Explain approaches to statistical modelling and
simulation and apply simulation-based methods
● Discuss the two different approaches to statistical modelling
and links through simulation-based approaches.
● Explain the purpose and critically discuss permutation tests.
● Explain the purpose and critically discuss bootstrapping.
● Explain the main idea of cross-validation and discuss the
selection of k with respect to the bias-variance trade-off.
9
9
Learning targets and study questions
● Explain the data analysis cycle and apply tools for
exploratory data analysis
● Explain the steps of the data analysis cycle.
● Summarise the elements of exploratory analysis. Which
graphical tools are essential?
● Explain approaches to statistical modelling and
simulation and apply simulation-based methods
● Discuss the two different approaches to statistical modelling
and links through simulation-based approaches.
● Explain the purpose and critically discuss permutation tests.
● Explain the purpose and critically discuss bootstrapping.
● Explain the main idea of cross-validation and discuss the
selection of k with respect to the bias-variance trade-off.
10
10
Learning targets and study questions
● Diagnosing and interpreting the linear model
● Describe the assumptions of the linear regression and
explain how they can be checked.
● Which types of outliers exist? When is an outlier important?
● Discuss the application of bootstrapping and cross-validation
for the linear model.
10
10
Learning targets and study questions
● Diagnosing and interpreting the linear model
● Describe the assumptions of the linear regression and
explain how they can be checked.
● Which types of outliers exist? When is an outlier important?
● Discuss the application of bootstrapping and cross-validation
for the linear model.
11
11
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
Zumel N. & Mount J. (2014) Practical data science with R. Manning
Publications Co, Shelter Island, NY.
Data exploration visualised with dashed line as it will depend on the
research context if and when data exploration is conducted. However,
most frequently data exploration (e.g. descriptive statistics such as
data summaries) is employed before statistical modelling and the
characteristics of the data set are explored to aid model selection. In
some studies and disciplines, eventually no statistical modelling is
done and only descriptive statistics is reported. Nevertheless, in case
that a clear research hypothesis has been established before data
collection, data exploration may not be required before statistical
modelling. Still, the techniques related to data exploration will be
needed to check model assumptions. Note that you must not establish
research or statistical hypotheses after data exploration.
11
11
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
Zumel N. & Mount J. (2014) Practical data science with R. Manning
Publications Co, Shelter Island, NY.
Data exploration visualised with dashed line as it will depend on the
research context if and when data exploration is conducted. However,
most frequently data exploration (e.g. descriptive statistics such as
data summaries) is employed before statistical modelling and the
characteristics of the data set are explored to aid model selection. In
some studies and disciplines, eventually no statistical modelling is
done and only descriptive statistics is reported. Nevertheless, in case
that a clear research hypothesis has been established before data
collection, data exploration may not be required before statistical
modelling. Still, the techniques related to data exploration will be
needed to check model assumptions. Note that you must not establish
research or statistical hypotheses after data exploration.
12
12
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
12
12
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
13
13
Define research goal and question
Scientific hypothesis: Restoring stream stretches
alters aquatic communities, resulting in different
emerging insects on which riparian spiders prey.
This affects the spiders’ body condition derived
from prosomal (pr.) and opisthosomal (op.) width.
Question: Does the body condition of riparian spiders differ between
restored and non-restored stream stretches?
Opisthosoma
Prosoma
H0 : µrestored=µnon−restored H 1: µrestored≠µnon−restored
● Research goals (e.g. prediction, estimation, inference) and
questions should inform study design and methods
● Aim: Test scientific hypothesis → Formulate testable hypothesis
Example
● Testable hypothesis: The sample means for the body condition are
drawn from populations with the same µ:
River restoration may lead to improvements such as increased
species richness of the aquatic invertebrate community.
Terrestrial predators in the riparian zone such as spiders, in
turn, may benefit from an increase in the biomass and diversity
of aquatic emergent prey. In a study we therefore compared the
body condition, using a proxy based on prosomal and
opisthosomal width, between non-restored and restored stream
reaches.
Statistical hypothesis testing consisted of comparing the central
tendencies (sample means) using a paired t-test (each line
corresponds to a different stream).
13
13
Define research goal and question
Scientific hypothesis: Restoring stream stretches
alters aquatic communities, resulting in different
emerging insects on which riparian spiders prey.
This affects the spiders’ body condition derived
from prosomal (pr.) and opisthosomal (op.) width.
Question: Does the body condition of riparian spiders differ between
restored and non-restored stream stretches?
Opisthosoma
Prosoma
H0 : µrestored=µnon−restored H 1: µrestored≠µnon−restored
● Research goals (e.g. prediction, estimation, inference) and
questions should inform study design and methods
● Aim: Test scientific hypothesis → Formulate testable hypothesis
Example
● Testable hypothesis: The sample means for the body condition are
drawn from populations with the same µ:
River restoration may lead to improvements such as increased
species richness of the aquatic invertebrate community.
Terrestrial predators in the riparian zone such as spiders, in
turn, may benefit from an increase in the biomass and diversity
of aquatic emergent prey. In a study we therefore compared the
body condition, using a proxy based on prosomal and
opisthosomal width, between non-restored and restored stream
reaches.
Statistical hypothesis testing consisted of comparing the central
tendencies (sample means) using a paired t-test (each line
corresponds to a different stream).
14
14
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
14
14
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
15
15
Tools for data exploration
GIGA: Garbage in – Garbage out
1.Outliers (e.g. boxplot)
2.Variance homogeneity (e.g. conditional boxplot)
3.Normal distribution (e.g. QQ-plot)
4.(Double) zeros (e.g. frequency plot)
5.Collinearity (e.g. pairwise scatterplots)
6.Relationship explanatory and response variable (e.g.
scatterplots)
7.Spatial- or temporal autocorrelation (e.g. variograms)
Elements of data exploration – Checking for:
● Useful for inspecting data before the modelling but also for
model diagnosis
● Zuur et al. (2009) urge data inspection before modelling
A recommended read is:
Zuur, A.F; Ieno, E.N; Elphick, C.S (2009): A protocol for data exploration
to avoid common statistical problems. Methods in Ecology and
Evolution 1: 3–14.
http://onlinelibrary.wiley.com/wol1/doi/10.1111/j.2041-210X.2009.00001.x/f
ull
You have already encountered several of the elements of data
exploration in the course and you will meet them later again.
Double zeros are often occurring for species data (i.e. absence of a
species in pairs of sites) and may complicate interpretation. In
addition, several zeros in the response variable can lead to biased
parameter estimates and in such a situation models tailored for zero-
inflated data should be used. For zero-inflated models see: Zuur, A.F
& Ieno, E.N. (2016): Beginner’s Guide to Zero-Inflated Models with R.
Highland statistics.
http://highstat.com/index.php/beginner-s-guide-to-zero-inflated-models
15
15
Tools for data exploration
GIGA: Garbage in – Garbage out
1.Outliers (e.g. boxplot)
2.Variance homogeneity (e.g. conditional boxplot)
3.Normal distribution (e.g. QQ-plot)
4.(Double) zeros (e.g. frequency plot)
5.Collinearity (e.g. pairwise scatterplots)
6.Relationship explanatory and response variable (e.g.
scatterplots)
7.Spatial- or temporal autocorrelation (e.g. variograms)
Elements of data exploration – Checking for:
● Useful for inspecting data before the modelling but also for
model diagnosis
● Zuur et al. (2009) urge data inspection before modelling
A recommended read is:
Zuur, A.F; Ieno, E.N; Elphick, C.S (2009): A protocol for data exploration
to avoid common statistical problems. Methods in Ecology and
Evolution 1: 3–14.
http://onlinelibrary.wiley.com/wol1/doi/10.1111/j.2041-210X.2009.00001.x/f
ull
You have already encountered several of the elements of data
exploration in the course and you will meet them later again.
Double zeros are often occurring for species data (i.e. absence of a
species in pairs of sites) and may complicate interpretation. In
addition, several zeros in the response variable can lead to biased
parameter estimates and in such a situation models tailored for zero-
inflated data should be used. For zero-inflated models see: Zuur, A.F
& Ieno, E.N. (2016): Beginner’s Guide to Zero-Inflated Models with R.
Highland statistics.
http://highstat.com/index.php/beginner-s-guide-to-zero-inflated-models
16
16
Data exploration
Common plots for looking at the data
Outliers?
Asymmetry of
distribution?
Normality?
Linearity?
Collinearity?
There are several rules of thumb as to what can be regarded as an
outlier – but it remains more or less a subjective decision. John Tukey
suggested to define Y as an outlier if: Y < (Q1 − 1.5 IQR) or Y > (Q3 +
1.5 IQR), where Q1 denotes the lower quartile, Q3 denotes the upper
quartile, and IQR = (Q3 − Q1) denotes the interquartile range. In
practice, the type of data, number of observations and knowledge
about the data should be taken into account when deciding whether an
observation is classified as outlier.
A beanplot represents an alternative to a boxplot that has several
advantages. Beanplots have been introduced by Peter Kampstra:
Kampstra P. 2008: Beanplot: A Boxplot Alternative for Visual
Comparison of Distributions. Journal of Statistical Software, Code
Snippets. 28 (1): 1-9. Freely available at
http://www.jstatsoft.org/v28/c01/
We will quickly look at a beanplot in the practical part.
16
16
Data exploration
Common plots for looking at the data
Outliers?
Asymmetry of
distribution?
Normality?
Linearity?
Collinearity?
There are several rules of thumb as to what can be regarded as an
outlier – but it remains more or less a subjective decision. John Tukey
suggested to define Y as an outlier if: Y < (Q1 − 1.5 IQR) or Y > (Q3 +
1.5 IQR), where Q1 denotes the lower quartile, Q3 denotes the upper
quartile, and IQR = (Q3 − Q1) denotes the interquartile range. In
practice, the type of data, number of observations and knowledge
about the data should be taken into account when deciding whether an
observation is classified as outlier.
A beanplot represents an alternative to a boxplot that has several
advantages. Beanplots have been introduced by Peter Kampstra:
Kampstra P. 2008: Beanplot: A Boxplot Alternative for Visual
Comparison of Distributions. Journal of Statistical Software, Code
Snippets. 28 (1): 1-9. Freely available at
http://www.jstatsoft.org/v28/c01/
We will quickly look at a beanplot in the practical part.
17
17
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
17
17
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
18
18
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
18
18
Data analysis cycle
Modified from Zumel & Mount 2014: 6
→ Research goal and questions
e.g. empirical study,
data compilation
e.g. formatting, handling
missing values
Data
exploration
e.g. outliers,
skewness,
distribution, linearity
Model diagnosis,
evaluation and
interpretation
Check model
assumptions and
interpret results.
Publish results
Statistical
modelling
Identify limitations
and open questions
19
19
Statistical modelling: The two cultures
Breiman 2001 Statistical Science 16: 199
Real world: Processes lead to association between x and y
Examples for goals of statistical modelling: predict unknown y
from x, estimate how x is related to y
Data modelling culture
(classical statistics)
Common data model
Algorithmic modeling culture
(machine learning)
Estimate
parameters
from data
Model validation: Check residuals
Model validation: Predictive accuracy
Find algorithm that operates on x
to predict y
Breiman L. (2001) Statistical modeling: The two cultures.
Statistical Science 16, 199–215.
The very readable debate is available here:
https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
19
19
Statistical modelling: The two cultures
Breiman 2001 Statistical Science 16: 199
Real world: Processes lead to association between x and y
Examples for goals of statistical modelling: predict unknown y
from x, estimate how x is related to y
Data modelling culture
(classical statistics)
Common data model
Algorithmic modeling culture
(machine learning)
Estimate
parameters
from data
Model validation: Check residuals
Model validation: Predictive accuracy
Find algorithm that operates on x
to predict y
Breiman L. (2001) Statistical modeling: The two cultures.
Statistical Science 16, 199–215.
The very readable debate is available here:
https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
20
20
Statistical modelling: the classical view
● Fit model to data to inform estimation, inference or
prediction (e.g. estimate point or interval, test hypothesis)
● Example: The arithmetic mean is an estimate of the true
population mean µ and s2
is an estimate of the true variance σ2
● Most models incorporate a deterministic (fixed effect) and
a stochastic component (random effect)
● Example:
● All models rely on assumptions → Model diagnosis
● e.g. normal distribution, independence of observations
● Goodness of fit measures aid to choose between multiple
models that fit the data
● e.g. AIC, R2
, RMSE
x̄
yi = b0b1 xii with  ~ N 0,
2

Any observation contains signal and noise. In a statistical model,
this relates to the fitted value and the residual.
20
20
Statistical modelling: the classical view
● Fit model to data to inform estimation, inference or
prediction (e.g. estimate point or interval, test hypothesis)
● Example: The arithmetic mean is an estimate of the true
population mean µ and s2
is an estimate of the true variance σ2
● Most models incorporate a deterministic (fixed effect) and
a stochastic component (random effect)
● Example:
● All models rely on assumptions → Model diagnosis
● e.g. normal distribution, independence of observations
● Goodness of fit measures aid to choose between multiple
models that fit the data
● e.g. AIC, R2
, RMSE
x̄
yi = b0b1 xii with  ~ N 0,
2

Any observation contains signal and noise. In a statistical model,
this relates to the fitted value and the residual.
21
21
Simulation-based approaches in
data analysis
● Compatible with both cultures
● Infuses algorithm-based thinking into classical statistics
● Examples for simulation-based approaches for estimation,
inference or model diagnosis in classical statistics:
1. Permutation test → Permuting (shuffling) the data to derive
null distribution. Mainly used for inference
2. Bootstrapping → Randomly sampling subsets from the
data with replacement. Mainly used for estimation of
parameter distribution
3. Cross-validation (CV) → Splitting data into sets (i.e.
sampling without replacement). Mainly used for validation of
predictive models
21
21
Simulation-based approaches in
data analysis
● Compatible with both cultures
● Infuses algorithm-based thinking into classical statistics
● Examples for simulation-based approaches for estimation,
inference or model diagnosis in classical statistics:
1. Permutation test → Permuting (shuffling) the data to derive
null distribution. Mainly used for inference
2. Bootstrapping → Randomly sampling subsets from the
data with replacement. Mainly used for estimation of
parameter distribution
3. Cross-validation (CV) → Splitting data into sets (i.e.
sampling without replacement). Mainly used for validation of
predictive models
22
22
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
22
22
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
23
23
Permutation test: Algorithm
1) Permute values in data set
2) Compute test statistic t* for permuted data
3) Compare test statistic t0
to generated null distribution
Repeat
k times
23
23
Permutation test: Algorithm
1) Permute values in data set
2) Compute test statistic t* for permuted data
3) Compare test statistic t0
to generated null distribution
Repeat
k times
24
24
Permutation test: Algorithm
Original
dataset
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x9
x6
x12
x7
x13
x2
x10
x4
x1
x15
x14
x8
x3
x11
x5
x14
x2
x7
x9
x5
x15
x3
x6
x8
x1
x12
x2
x10
x4
x11
Permutation 1
Permutation k
1) Permute values in data set
2) Compute test statistic t* for permuted data
3) Compare test statistic t0
to generated null distribution
.
.
.
.
.
.
.
.
.
Repeat
k times
Example: Permutation test of difference in group mean
Group 1 Group 2 Test statistic

xgroup1 − 
xgroup2
t1
*
tk
*
t0
24
24
Permutation test: Algorithm
Original
dataset
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x9
x6
x12
x7
x13
x2
x10
x4
x1
x15
x14
x8
x3
x11
x5
x14
x2
x7
x9
x5
x15
x3
x6
x8
x1
x12
x2
x10
x4
x11
Permutation 1
Permutation k
1) Permute values in data set
2) Compute test statistic t* for permuted data
3) Compare test statistic t0
to generated null distribution
.
.
.
.
.
.
.
.
.
Repeat
k times
Example: Permutation test of difference in group mean
Group 1 Group 2 Test statistic

xgroup1 − 
xgroup2
t1
*
tk
*
t0
25
25
t0
Permutation test: Generated distribution
● Test informs whether pattern in data is due to chance
● Inference regarding statistical population only valid if
distribution of sample data matches actual distribution of
statistical population → particularly problematic for small n
p =
∑
i=1
k
1if ti
*
≤ t0 ,else0
k1
The p-value is computed as the fraction of test statistics t*, which are
based on permutated data, that are more extreme (lower or higher
depending on the hypothesis) than the non-permuted test statistic.
If the sample distribution deviates from the actual distribution of the
statistical population, the permutation test only allows to infer
conclusions that apply to the data at hand. These may not be very
interesting. For the example of the mean comparison, this would
translate to being unable to test the null hypothesis:
What a small sample size n is, depends on the context and no single
number applies to all situations. For example, it will depend on the
statistical distribution, statistical test etc. However, as a rule of thumb,
sample sizes < 30 for a population are small. Still, much larger sample
sizes can be required to reliably generalize from the permutation test
to the statistical population.
H0 : µgroup1=µgroup2
25
25
t0
Permutation test: Generated distribution
● Test informs whether pattern in data is due to chance
● Inference regarding statistical population only valid if
distribution of sample data matches actual distribution of
statistical population → particularly problematic for small n
p =
∑
i=1
k
1if ti
*
≤ t0 ,else0
k1
The p-value is computed as the fraction of test statistics t*, which are
based on permutated data, that are more extreme (lower or higher
depending on the hypothesis) than the non-permuted test statistic.
If the sample distribution deviates from the actual distribution of the
statistical population, the permutation test only allows to infer
conclusions that apply to the data at hand. These may not be very
interesting. For the example of the mean comparison, this would
translate to being unable to test the null hypothesis:
What a small sample size n is, depends on the context and no single
number applies to all situations. For example, it will depend on the
statistical distribution, statistical test etc. However, as a rule of thumb,
sample sizes < 30 for a population are small. Still, much larger sample
sizes can be required to reliably generalize from the permutation test
to the statistical population.
H0 : µgroup1=µgroup2
26
26
Permutation test: Advantages and limitations
● Advantages
● Free from distributional assumptions
● Applicable to complex designs through restricting permutations
● Limitations
● Generalisation to statistical population requires matching
distribution
● Statistical hypothesis testing can imply distributional assumptions
that apply to the permutation test, if aiming to infer to the
statistical population (e.g. testing for mean differences affected by
variance)
● Computationally intensive: Number of all possible permutations
for a dataset is factorial n, i.e. n! (e.g. 35! ≈ 1040
)
→ Monte Carlo simulation
Legendre & Legendre 2012: 25ff
For instance, two null hypotheses are tested simultaneously (1.
equality of mean, 2. equality of variance) when testing for
differences among sample means. This dual aspect of classical
tests such as analysis of variance or t-test also applies to the
related permutation test and prohibits to draw unequivocal
conclusions regarding the mean difference without
consideration of variance equality.
26
26
Permutation test: Advantages and limitations
● Advantages
● Free from distributional assumptions
● Applicable to complex designs through restricting permutations
● Limitations
● Generalisation to statistical population requires matching
distribution
● Statistical hypothesis testing can imply distributional assumptions
that apply to the permutation test, if aiming to infer to the
statistical population (e.g. testing for mean differences affected by
variance)
● Computationally intensive: Number of all possible permutations
for a dataset is factorial n, i.e. n! (e.g. 35! ≈ 1040
)
→ Monte Carlo simulation
Legendre & Legendre 2012: 25ff
For instance, two null hypotheses are tested simultaneously (1.
equality of mean, 2. equality of variance) when testing for
differences among sample means. This dual aspect of classical
tests such as analysis of variance or t-test also applies to the
related permutation test and prohibits to draw unequivocal
conclusions regarding the mean difference without
consideration of variance equality.
27
27
Monte-Carlo simulation
● Uses repeated random sampling to solve problems
probabilistically (even though they can be deterministic in
reality)
● Permutation tests use random numbers to randomly permute
data → approximate with MC simulation
● Legendre & Legendre (2012): use at least 10,000 permutations
for inference
Edvard Munch - At the Roulette Table in Monte Carlo
Entrance of casino in Monte Carlo, Monaco
Name refers to the city, it was chosen as code name for a secret project
in the context of nuclear weapon research in Los Alamos, USA.
The larger the number of MC-based permutations, the lower is the error
when approximating the distribution of all possible permutations with
the MC-based permutation.
Picture sources:
Photo of Casino
https://pixabay.com/de/spielbank-casino-monte-carlo-monaco-188882/
Picture of Munch:
https://upload.wikimedia.org/wikipedia/commons/1/1f/Edvard_Munch_-_
At_the_Roulette_Table_in_Monte_Carlo_-_Google_Art_Project.jpg
27
27
Monte-Carlo simulation
● Uses repeated random sampling to solve problems
probabilistically (even though they can be deterministic in
reality)
● Permutation tests use random numbers to randomly permute
data → approximate with MC simulation
● Legendre & Legendre (2012): use at least 10,000 permutations
for inference
Edvard Munch - At the Roulette Table in Monte Carlo
Entrance of casino in Monte Carlo, Monaco
Name refers to the city, it was chosen as code name for a secret project
in the context of nuclear weapon research in Los Alamos, USA.
The larger the number of MC-based permutations, the lower is the error
when approximating the distribution of all possible permutations with
the MC-based permutation.
Picture sources:
Photo of Casino
https://pixabay.com/de/spielbank-casino-monte-carlo-monaco-188882/
Picture of Munch:
https://upload.wikimedia.org/wikipedia/commons/1/1f/Edvard_Munch_-_
At_the_Roulette_Table_in_Monte_Carlo_-_Google_Art_Project.jpg
28
28
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
28
28
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
29
29
Bootstrapping: Idea and algorithm
● Inference on statistic t is based on sampling distribution
● Ideally: Draw all or many samples from statistical population
● Reality: Most frequently only one sample available
➔ Idea: Draw samples from an estimate of the statistical population
(i.e. the sample) and use these to estimate property (e.g. variance)
of the statistic t
● Algorithm:
1) Draw random sample with replacement from data
2) Compute statistic t* for bootstrap sample
3) Use the k estimates to derive property of statistic
● Exhaustive bootstrapping (k = nn
) computationally
demanding → approximate with Monte Carlo simulation
● Given todays computer power 104
-105
simulations viable
Repeat
k times
The name bootstrapping alludes to the phrase “pulling oneself up
by one’s bootstraps,” which has been voiced by the fictional
character Baron Münchhausen.
In analogy to the permutation tests, the following applies to
bootstrapping: The larger the number of MC-based bootstrap
samples, the lower is the error when approximating the
bootstrap distribution with the MC-based samples.
29
29
Bootstrapping: Idea and algorithm
● Inference on statistic t is based on sampling distribution
● Ideally: Draw all or many samples from statistical population
● Reality: Most frequently only one sample available
➔ Idea: Draw samples from an estimate of the statistical population
(i.e. the sample) and use these to estimate property (e.g. variance)
of the statistic t
● Algorithm:
1) Draw random sample with replacement from data
2) Compute statistic t* for bootstrap sample
3) Use the k estimates to derive property of statistic
● Exhaustive bootstrapping (k = nn
) computationally
demanding → approximate with Monte Carlo simulation
● Given todays computer power 104
-105
simulations viable
Repeat
k times
The name bootstrapping alludes to the phrase “pulling oneself up
by one’s bootstraps,” which has been voiced by the fictional
character Baron Münchhausen.
In analogy to the permutation tests, the following applies to
bootstrapping: The larger the number of MC-based bootstrap
samples, the lower is the error when approximating the
bootstrap distribution with the MC-based samples.
30
30
Bootstrapping: Example
Original
dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
15 7 8 4 15 11 9 1 3 6 14 2 11 12
1
5 7 8 8 15 10
10 1 13
6 13 2 10 12 3
7 5 8 2 12
14
10 8 13 6 11
7 15
12 1
BS sample 1
BS sample 2
BS sample k
.
.
.
Sampling with replacement
x̄ = 8
t (here: mean)
x̄*
= 7.93
x̄*
= 8.2
x̄*
= 8.73
Example: Bootstrap to the mean (to derive variance)
.
.
.
Distribution of statistic t
30
30
Bootstrapping: Example
Original
dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
15 7 8 4 15 11 9 1 3 6 14 2 11 12
1
5 7 8 8 15 10
10 1 13
6 13 2 10 12 3
7 5 8 2 12
14
10 8 13 6 11
7 15
12 1
BS sample 1
BS sample 2
BS sample k
.
.
.
Sampling with replacement
x̄ = 8
t (here: mean)
x̄*
= 7.93
x̄*
= 8.2
x̄*
= 8.73
Example: Bootstrap to the mean (to derive variance)
.
.
.
Distribution of statistic t
31
31
Bootstrapping: Limitations
Hesterberg 2015 Amer. Statist. 69:371
● Do not use for hypothesis testing
● No distributional assumptions implied, but not reliable for all
distributions, particularly at small n (see Hesterberg 2015)
● Small n: use adjusted bootstrap percentiles (Bca) or switch
to parametric statistics (allow for additional assumptions)
● Bootstrap does not improve estimate of population
parameter , centred at x̄
µ
Bootstrapping is generally less accurate than permutation tests
for hypothesis testing.
BCa corrects for bias and skewness in the distribution of
bootstrap estimates.
A very nice introduction and overview on bootstrapping is
provided by:
Hesterberg T.C. (2015) What Teachers Should Know About the
Bootstrap: Resampling in the Undergraduate Statistics
Curriculum. The American Statistician 69, 371–386.
Freely available at:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784504/pdf/uta
s-69-371.pdf
31
31
Bootstrapping: Limitations
Hesterberg 2015 Amer. Statist. 69:371
● Do not use for hypothesis testing
● No distributional assumptions implied, but not reliable for all
distributions, particularly at small n (see Hesterberg 2015)
● Small n: use adjusted bootstrap percentiles (Bca) or switch
to parametric statistics (allow for additional assumptions)
● Bootstrap does not improve estimate of population
parameter , centred at x̄
µ
Bootstrapping is generally less accurate than permutation tests
for hypothesis testing.
BCa corrects for bias and skewness in the distribution of
bootstrap estimates.
A very nice introduction and overview on bootstrapping is
provided by:
Hesterberg T.C. (2015) What Teachers Should Know About the
Bootstrap: Resampling in the Undergraduate Statistics
Curriculum. The American Statistician 69, 371–386.
Freely available at:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784504/pdf/uta
s-69-371.pdf
32
32
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
32
32
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
33
33
Cross-validation (CV)
● Objective: Evaluate predictive accuracy of a fitted model
● Can be checked if independent training data (used to fit
model) and test data (new data) are available → Rare case
● Idea: Split the available data into training and test set and
predict the (known) observations in the test set from a model
fitted with the training data
● Algorithm:
1. Draw k random samples without replacement from data
2. For each k:
1. Fit the model to the other k-1 parts
2. Predict k from model and calculate the prediction error
3. Calculate prediction error as average over the k estimates
Predictive accuracy measures the accuracy of predictions for
new data.
CV is typically used in validation, but can also be used as
goodness-of-fit measure to guide parameter estimation (see
shrinkage methods later).
33
33
Cross-validation (CV)
● Objective: Evaluate predictive accuracy of a fitted model
● Can be checked if independent training data (used to fit
model) and test data (new data) are available → Rare case
● Idea: Split the available data into training and test set and
predict the (known) observations in the test set from a model
fitted with the training data
● Algorithm:
1. Draw k random samples without replacement from data
2. For each k:
1. Fit the model to the other k-1 parts
2. Predict k from model and calculate the prediction error
3. Calculate prediction error as average over the k estimates
Predictive accuracy measures the accuracy of predictions for
new data.
CV is typically used in validation, but can also be used as
goodness-of-fit measure to guide parameter estimation (see
shrinkage methods later).
34
34
Cross-validation (CV)
● Problem of choosing k:
● k = n (Leave-one-out CV predicts each observation from all others)
→ low bias, but high variance
● k = 2 (split data into half) → low variance, but high bias
● k typically set to 5 or 10
Taken from James et al. 2013: 181
Example: k = 5
The bias-variance trade-off will be discussed in detail on the next slides. In brief,
there is a trade-off between bias (error when estimating the 'true' prediction
accuracy of the sample data) and variance (variability of the error when estimating
new (test) data). If we use a major fraction of the data (extreme case: k = n, where
we use n-1 observations) in model fitting, the error of estimating the prediction
accuracy of the full data is probably very low (low bias). However, the variability of
the error when predicting a few (or only one for k = n) observations from different
training sets is most likely high, which translates to a high variance. Conversely, if
we use only half of the data (k = 2) in model fitting, we decrease the variance. In
other words, the error when predicting the test set is most likely similar for the two
training sets. But this comes at the cost of bias. In the case of k = 2, we are
estimating the predictive accuracy from only a fraction of the data, whereas in
practice all observations will be used in prediction. The prediction accuracy
estimated from the fraction of the data is likely to differ (i.e. lower or higher) from
that of the complete data set, i.e. exhibit bias. Thus, the bias increases when the
relative size of the training set in CV decreases.
k is typically set to 5 or 10, i.e. the data is partitioned in 5 or 10 groups during CV
as a compromise between bias and variance. Leave-one-out CV is considered less
reliable than 5- or 10-fold CV (see Harrell 2015: 172).
34
34
Cross-validation (CV)
● Problem of choosing k:
● k = n (Leave-one-out CV predicts each observation from all others)
→ low bias, but high variance
● k = 2 (split data into half) → low variance, but high bias
● k typically set to 5 or 10
Taken from James et al. 2013: 181
Example: k = 5
The bias-variance trade-off will be discussed in detail on the next slides. In brief,
there is a trade-off between bias (error when estimating the 'true' prediction
accuracy of the sample data) and variance (variability of the error when estimating
new (test) data). If we use a major fraction of the data (extreme case: k = n, where
we use n-1 observations) in model fitting, the error of estimating the prediction
accuracy of the full data is probably very low (low bias). However, the variability of
the error when predicting a few (or only one for k = n) observations from different
training sets is most likely high, which translates to a high variance. Conversely, if
we use only half of the data (k = 2) in model fitting, we decrease the variance. In
other words, the error when predicting the test set is most likely similar for the two
training sets. But this comes at the cost of bias. In the case of k = 2, we are
estimating the predictive accuracy from only a fraction of the data, whereas in
practice all observations will be used in prediction. The prediction accuracy
estimated from the fraction of the data is likely to differ (i.e. lower or higher) from
that of the complete data set, i.e. exhibit bias. Thus, the bias increases when the
relative size of the training set in CV decreases.
k is typically set to 5 or 10, i.e. the data is partitioned in 5 or 10 groups during CV
as a compromise between bias and variance. Leave-one-out CV is considered less
reliable than 5- or 10-fold CV (see Harrell 2015: 172).
35
35
Test data
Training data
function used to
simulate data
highly flexible
smoother
linear regression
little flexible
smoother
Variance
Bias-variance trade-off
Definitions in context of model validation:
● Bias: error when approximating training data
● Variance: variability in error when approximating test data
Taken from James et al. 2013: 33
Higher flexibility (higher k in CV) → lower error for training data
(i.e. lower bias), but variance will increase from some point
The left figure displays the fit of different models to data
originating from the function plotted in black.
The models rank regarding bias: linear regression > little flexible
smoother > highly flexible smoother.
Regarding variance (see right figure), the ranking is: highly
flexible smoother > little flexible smoother > linear regression.
35
35
Test data
Training data
function used to
simulate data
highly flexible
smoother
linear regression
little flexible
smoother
Variance
Bias-variance trade-off
Definitions in context of model validation:
● Bias: error when approximating training data
● Variance: variability in error when approximating test data
Taken from James et al. 2013: 33
Higher flexibility (higher k in CV) → lower error for training data
(i.e. lower bias), but variance will increase from some point
The left figure displays the fit of different models to data
originating from the function plotted in black.
The models rank regarding bias: linear regression > little flexible
smoother > highly flexible smoother.
Regarding variance (see right figure), the ranking is: highly
flexible smoother > little flexible smoother > linear regression.
36
36
Bias-variance trade-off
Higher flexibility (higher k in CV) → lower error for training
data (i.e. lower bias), but variance will increase from some
point → Optimise combined error
Taken from Hastie, Tibshirani and Friedman 2011: 38
For a mathematical derivation of the bias-variance trade-off see
Matloff(2017): 48f.
36
36
Bias-variance trade-off
Higher flexibility (higher k in CV) → lower error for training
data (i.e. lower bias), but variance will increase from some
point → Optimise combined error
Taken from Hastie, Tibshirani and Friedman 2011: 38
For a mathematical derivation of the bias-variance trade-off see
Matloff(2017): 48f.
37
37
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
37
37
Statistical modelling,
simulation and the linear model
1.Framework for data analysis and tools for data
exploration
2.Statistical modelling and simulation-based tools
3.Permutation and Monte Carlo simulation
4.Bootstrapping
5.Cross-Validation and Bias-variance trade-off
6.Revisiting the linear model
Contents
38
38
Relationship between two continuous
variables: linear regression model
● Bivariate relationship between an explanatory variable and
a response variable with:
● Example: Can we approximate pesticide runoff
concentrations with passive sampling?
Fernandez et al. 2014
yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ
2
)
The figure shows the concentrations of pesticides measured with
passive samplers (TWA concentrations) and event-driven
samplers (EDS peak concentrations). The concentrations are
relatively similar for pesticides that were quantified in samples
from both sampling devices, i.e. follow almost a 1:1
relationship, which means that passive sampling is a suitable
technique to approximate peak concentrations. The non-filled
points indicate cases where a compound was only quantified in
samples of one of the sampling devices. Further details can be
found in:
Fernández D., Vermeirssen E.L.M., Bandow N., Muñoz K. &
Schäfer R.B. (2014) Calibration and field application of passive
sampling for episodic exposure to polar organic pesticides in
streams. Environmental Pollution 194, 196–202.
38
38
Relationship between two continuous
variables: linear regression model
● Bivariate relationship between an explanatory variable and
a response variable with:
● Example: Can we approximate pesticide runoff
concentrations with passive sampling?
Fernandez et al. 2014
yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ
2
)
The figure shows the concentrations of pesticides measured with
passive samplers (TWA concentrations) and event-driven
samplers (EDS peak concentrations). The concentrations are
relatively similar for pesticides that were quantified in samples
from both sampling devices, i.e. follow almost a 1:1
relationship, which means that passive sampling is a suitable
technique to approximate peak concentrations. The non-filled
points indicate cases where a compound was only quantified in
samples of one of the sampling devices. Further details can be
found in:
Fernández D., Vermeirssen E.L.M., Bandow N., Muñoz K. &
Schäfer R.B. (2014) Calibration and field application of passive
sampling for episodic exposure to polar organic pesticides in
streams. Environmental Pollution 194, 196–202.
39
39
Relationship between two continuous
variables: linear regression model
● Bivariate relationship between an explanatory variable and
a response variable with:
● Aim: minimise ε (also called error sum of squares: SSE)
yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ
2
)
A measure that is similar to the SSE is the Mean Squared Error
(MSE), which is given as:
MSE =
1
n− p−1
∑
i=1
n
( yi− ^
yi)
2
^
yi = b0+b1 xi
The fitted values for the regression model, i.e. the estimates for y are
given as:
The denominator accounts for the number of explanatory variables p
and the intercept and requires adjustment in case of no-intercept
models (i.e. the denominator would turn into n-p). MSE is typically
used when assessing the quality of the estimation. In case that the
predictive accuracy is assessed, the Mean Squared Prediction Error
(MSPE) is used for new observations yn+1
to ym
.
39
39
Relationship between two continuous
variables: linear regression model
● Bivariate relationship between an explanatory variable and
a response variable with:
● Aim: minimise ε (also called error sum of squares: SSE)
yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ
2
)
A measure that is similar to the SSE is the Mean Squared Error
(MSE), which is given as:
MSE =
1
n− p−1
∑
i=1
n
( yi− ^
yi)
2
^
yi = b0+b1 xi
The fitted values for the regression model, i.e. the estimates for y are
given as:
The denominator accounts for the number of explanatory variables p
and the intercept and requires adjustment in case of no-intercept
models (i.e. the denominator would turn into n-p). MSE is typically
used when assessing the quality of the estimation. In case that the
predictive accuracy is assessed, the Mean Squared Prediction Error
(MSPE) is used for new observations yn+1
to ym
.
40
40
Linear regression model
SSY = SSR + SSE
R2
=
SSR
SSY
Total variation
Explained
variation
Unexplained
variation
% of explained variance:
adj. R2
= 1−1−R2

n−1
n− p−1
SSR refers to regression sum of squares and can be calculated
as the summed quadratic differences between the fitted values
and the mean for the response variable. SSE and SSY are
defined as for the analysis of variance (ANOVA). Indeed, both
ANOVA and linear regression are linear models and in R most
functions apply to either of them.
The square root of the R2
has the same absolute value as the
Pearson correlation coefficient. The R2
is typically used to
measure the goodness of fit of a regression model.
The adjusted R2
should be preferred over the normal R2
as it
takes the number of explanatory variables p into account (n is
sample size). The denominator is n-p-1 accounting for the
number of p and the intercept. However, this is more important
in the case of multiple linear regression.
40
40
Linear regression model
SSY = SSR + SSE
R2
=
SSR
SSY
Total variation
Explained
variation
Unexplained
variation
% of explained variance:
adj. R2
= 1−1−R2

n−1
n− p−1
SSR refers to regression sum of squares and can be calculated
as the summed quadratic differences between the fitted values
and the mean for the response variable. SSE and SSY are
defined as for the analysis of variance (ANOVA). Indeed, both
ANOVA and linear regression are linear models and in R most
functions apply to either of them.
The square root of the R2
has the same absolute value as the
Pearson correlation coefficient. The R2
is typically used to
measure the goodness of fit of a regression model.
The adjusted R2
should be preferred over the normal R2
as it
takes the number of explanatory variables p into account (n is
sample size). The denominator is n-p-1 accounting for the
number of p and the intercept. However, this is more important
in the case of multiple linear regression.
41
41
Linear regression model
● Assumptions:
● Linear relationship (graphical diagnostics)
● Normal distribution of error (graphical diagnostics)
● Variance homogeneity (graphical diagnostics)
● Independence of errors (graphical diagnostics)
● If one or more assumptions not met, alternatives include:
● Generalised linear model, Generalised least squares, Mixed
models
● Variable transformation (but using an appropriate model such as
a Generalised linear model is usually the better option)
Although hypothesis tests for checking the assumptions exist, most textbooks recommend
graphical diagnostics. For data that is spatially or temporally structured or data that is
nested/ hierarchically structured, the independence of errors assumption is typically
violated. Since time series and spatial data are beyond the scope of this course (and is
discussed in the “Advanced GIS” course), I refer to Faraway (2015): Linear models in R.
p.81-83 for diagnostic tools to spot serial correlation or see Plant (2012): Spatial data
analysis in ecology and agriculture using R. Spatial and temporal structure can be
incorporated into the model using generalised least squares (see chapter 4 in Zuur, A. F. et
al. 2009: Mixed effects models and extensions in ecology with R. Springer: New York).
Nested/hierarchically structured data can be modelled with mixed effect models, which are
discussed in the first part of this course. We will also discuss generalised linear models
later in this course. Variable transformation and robust regression are discussed in many
textbooks (e.g. Maindonald & Braun 2010, Quinn & Keough 2002) and are beyond the
scope of this course (but variable transformation has been extensively discussed in the
preceding course of univariate statistics).
In linear regression analysis, we usually do not take the measurement error in x into account.
This is discussed in detail in Warton, D.I., Wright, I.J., Falster, D.S., and Westoby, M.
(2006). Bivariate line-fitting methods for allometry. Biological Reviews 81, 259-291.
Warton et al. (2006) also provide information on alternatives to linear regression that should
be used if the measurement error is relevant.
41
41
Linear regression model
● Assumptions:
● Linear relationship (graphical diagnostics)
● Normal distribution of error (graphical diagnostics)
● Variance homogeneity (graphical diagnostics)
● Independence of errors (graphical diagnostics)
● If one or more assumptions not met, alternatives include:
● Generalised linear model, Generalised least squares, Mixed
models
● Variable transformation (but using an appropriate model such as
a Generalised linear model is usually the better option)
Although hypothesis tests for checking the assumptions exist, most textbooks recommend
graphical diagnostics. For data that is spatially or temporally structured or data that is
nested/ hierarchically structured, the independence of errors assumption is typically
violated. Since time series and spatial data are beyond the scope of this course (and is
discussed in the “Advanced GIS” course), I refer to Faraway (2015): Linear models in R.
p.81-83 for diagnostic tools to spot serial correlation or see Plant (2012): Spatial data
analysis in ecology and agriculture using R. Spatial and temporal structure can be
incorporated into the model using generalised least squares (see chapter 4 in Zuur, A. F. et
al. 2009: Mixed effects models and extensions in ecology with R. Springer: New York).
Nested/hierarchically structured data can be modelled with mixed effect models, which are
discussed in the first part of this course. We will also discuss generalised linear models
later in this course. Variable transformation and robust regression are discussed in many
textbooks (e.g. Maindonald & Braun 2010, Quinn & Keough 2002) and are beyond the
scope of this course (but variable transformation has been extensively discussed in the
preceding course of univariate statistics).
In linear regression analysis, we usually do not take the measurement error in x into account.
This is discussed in detail in Warton, D.I., Wright, I.J., Falster, D.S., and Westoby, M.
(2006). Bivariate line-fitting methods for allometry. Biological Reviews 81, 259-291.
Warton et al. (2006) also provide information on alternatives to linear regression that should
be used if the measurement error is relevant.
42
42
Model diagnostics: Variance homogeneity
„normal“
„strong increase“
„non-linear“
„slight increase“
Residuals vs. fitted values plots
The graphical diagnostics for checking variance homogeneity are the
same for linear regression and ANOVA (and other linear models), but
the x-axis of ANOVA (and t-test) would display the factor levels and,
consequently, the plots would not describe a continuous pattern.
The displayed residuals-fitted values plots can be used to check whether
the assumption of variance homogeneity, also termed
homoscedasticity, (and the assumption of linearity in the case of
regression) holds. If the residuals are not randomly distributed (upper
right) but display patterns, this may indicate variance heterogeneity,
also termed heteroscedasticity, (bottom and top left) or non-linearity
(bottom right). In case of departures from the assumption of
homoscedasticity, generalized least squares, generalized linear or
additive models can represent a suitable alternative for continuous
data. Depending on the data, data transformation or weighting
observations may also be used to alleviate the issue, though
transformation should only be considered if the data cannot be
modelled non-transformed.
42
42
Model diagnostics: Variance homogeneity
„normal“
„strong increase“
„non-linear“
„slight increase“
Residuals vs. fitted values plots
The graphical diagnostics for checking variance homogeneity are the
same for linear regression and ANOVA (and other linear models), but
the x-axis of ANOVA (and t-test) would display the factor levels and,
consequently, the plots would not describe a continuous pattern.
The displayed residuals-fitted values plots can be used to check whether
the assumption of variance homogeneity, also termed
homoscedasticity, (and the assumption of linearity in the case of
regression) holds. If the residuals are not randomly distributed (upper
right) but display patterns, this may indicate variance heterogeneity,
also termed heteroscedasticity, (bottom and top left) or non-linearity
(bottom right). In case of departures from the assumption of
homoscedasticity, generalized least squares, generalized linear or
additive models can represent a suitable alternative for continuous
data. Depending on the data, data transformation or weighting
observations may also be used to alleviate the issue, though
transformation should only be considered if the data cannot be
modelled non-transformed.
43
43
Further model diagnostics
Leverage points (predictor outlier)
How to deal with leverage points/outliers?
● Check whether values are plausible
● Check robustness of model results when removing observations
● Fit different statistical model or transform data
Leverage point
that exerts high
influence
Non-influential
leverage point
Beside checking for assumptions, model diagnostics are used to detect influential points, leverage points (predictor
outliers) and model outliers (outlier in response variable indicating model failure). Influential points exercise high
influence on the model fit, but may not be outliers. Leverage points and outliers do not fit the model, but are not
necessarily influential.
Leverage points (1) exercise high influence on the fitted y (but not necessarily on the model fit) and (2) are distant from
the other x-values. The leverage is calculated in terms of so-called hat values, which will be explained later in the
course, and the average hat value is p/n, where p is the number of parameters in the model (including the intercept)
and n is the number of observations. Faraway (2015):83 and Sheater (2009) suggest to look at points with hat
values > 2 p/n more closely. However, hat values are independent of the response variable y and graphical
inspection is most suitable to check whether a high leverage point is really problematic. A nice illustration of
leverage can be found here: http://www.rob-mcculloch.org/teachingApplets/Leverage/index.html.
Outliers in the response variable can be identified with studentized residuals. Here, points that deviate more than 2
standard deviations from the regression line may be considered as outlier (see Sheater 2009: p.60). There are of
course different rules of thumb as to what can be regarded as an outlier – but it remains more or less a subjective
decision. John Tukey suggested to define Y as an outlier if: Y < (Q1 − 1.5 IQR) or Y > (Q3 + 1.5 IQR), where Q1
denotes the lower quartile, Q3 denotes the upper quartile, and IQR = (Q3 − Q1) denotes the interquartile range.
Hence, you could use a boxplot to identify an outlier.
Another important measure in diagnostics plots represents Cooks distance. Cooks distance measures the influence of
observations on the model fit by calculating the combined effect of leverage and of the magnitude of the residual.
The higher Cooks distance the larger the change in model fit when the point is removed from the model. A point
with a high Cooks distance tends to be either an outlier or a leverage point or both. There are different rules of
thumb as to when consider a point as influential (e.g. Cooks D > 1 or Cooks D > 4/n-2), but in practice it is important
to look for gaps in the values of Cooks distance (Sheater 2009: p.68).
Methods such as robust regression and quantile regression have been developed to reduce the influence of influential
points. However, they are outside the scope of this course.
43
43
Further model diagnostics
Leverage points (predictor outlier)
How to deal with leverage points/outliers?
● Check whether values are plausible
● Check robustness of model results when removing observations
● Fit different statistical model or transform data
Leverage point
that exerts high
influence
Non-influential
leverage point
Beside checking for assumptions, model diagnostics are used to detect influential points, leverage points (predictor
outliers) and model outliers (outlier in response variable indicating model failure). Influential points exercise high
influence on the model fit, but may not be outliers. Leverage points and outliers do not fit the model, but are not
necessarily influential.
Leverage points (1) exercise high influence on the fitted y (but not necessarily on the model fit) and (2) are distant from
the other x-values. The leverage is calculated in terms of so-called hat values, which will be explained later in the
course, and the average hat value is p/n, where p is the number of parameters in the model (including the intercept)
and n is the number of observations. Faraway (2015):83 and Sheater (2009) suggest to look at points with hat
values > 2 p/n more closely. However, hat values are independent of the response variable y and graphical
inspection is most suitable to check whether a high leverage point is really problematic. A nice illustration of
leverage can be found here: http://www.rob-mcculloch.org/teachingApplets/Leverage/index.html.
Outliers in the response variable can be identified with studentized residuals. Here, points that deviate more than 2
standard deviations from the regression line may be considered as outlier (see Sheater 2009: p.60). There are of
course different rules of thumb as to what can be regarded as an outlier – but it remains more or less a subjective
decision. John Tukey suggested to define Y as an outlier if: Y < (Q1 − 1.5 IQR) or Y > (Q3 + 1.5 IQR), where Q1
denotes the lower quartile, Q3 denotes the upper quartile, and IQR = (Q3 − Q1) denotes the interquartile range.
Hence, you could use a boxplot to identify an outlier.
Another important measure in diagnostics plots represents Cooks distance. Cooks distance measures the influence of
observations on the model fit by calculating the combined effect of leverage and of the magnitude of the residual.
The higher Cooks distance the larger the change in model fit when the point is removed from the model. A point
with a high Cooks distance tends to be either an outlier or a leverage point or both. There are different rules of
thumb as to when consider a point as influential (e.g. Cooks D > 1 or Cooks D > 4/n-2), but in practice it is important
to look for gaps in the values of Cooks distance (Sheater 2009: p.68).
Methods such as robust regression and quantile regression have been developed to reduce the influence of influential
points. However, they are outside the scope of this course.
44
44
Flowchart for simple linear regression
Taken from Sheather 2009: p.103
Note that this flowchart only serves the purpose of giving orientation, whereas the suggestions may differ
from the suggestions presented in this lecture. For example, if the errors do not have constant variance,
the flowchart suggests the addition of new terms to the model and/or variable transformation. However, we
have discussed in the lecture that other model types such as the Generalized linear model can be more
appropriate for the data. Hence, before transformation of data, you should check whether the data can be
directly modelled using a Generalized linear model or others (cf. Szöcs & Schäfer 2015). Note also that
the bootstrap may not be reliable for small sample sizes, see the part on bootstrapping.
Szöcs E. & Schäfer R. (2015) Ecotoxicology is not normal. Environmental Science and Pollution Research
22, 13990–13999.
44
44
Flowchart for simple linear regression
Taken from Sheather 2009: p.103
Note that this flowchart only serves the purpose of giving orientation, whereas the suggestions may differ
from the suggestions presented in this lecture. For example, if the errors do not have constant variance,
the flowchart suggests the addition of new terms to the model and/or variable transformation. However, we
have discussed in the lecture that other model types such as the Generalized linear model can be more
appropriate for the data. Hence, before transformation of data, you should check whether the data can be
directly modelled using a Generalized linear model or others (cf. Szöcs & Schäfer 2015). Note also that
the bootstrap may not be reliable for small sample sizes, see the part on bootstrapping.
Szöcs E. & Schäfer R. (2015) Ecotoxicology is not normal. Environmental Science and Pollution Research
22, 13990–13999.
45
45
Simulation-based approaches to simple
linear regression
● Predictive accuracy measured with Mean square
prediction error (MSPE):
● Cross-validation (CV): Calculate CV-MSPE and CV-R2
● Bootstrapping (BS) in regression analysis:
● of residuals: BS residuals, add to to generate new and
calculate regression coefficients → x fixed
● of cases: BS complete cases and calculate regression
coefficients → x random
● If x and y random sample (e.g. x not fixed in experiment),
residuals correlated or exhibit non-constant variance → BS cases
MSPE =
1
m
∑
i=1
m
( yi− ^
yi)2
for the new observations 1 to m
^
y y*
Bold variables indicate vectors.
The mean square prediction error measures how well a new observation is predicted. For the relationship
with other error measures see here.
The algorithm for residual bootstrapping is easiest understood, when reformulating the equation for the
ordinary linear regression model (see here for details) to:
The bootstrap samples (samples with replacement) are drawn from the the n residuals yielding
to a bootstrap sample of residuals
These bootstrapped residuals are added to the vector of fitted responses ( ) to obtain a vector of new
responses y*:
These new responses are used to calculate new bootstrapped regression coefficients (i.e. ). The
procedure is repeated 1,000 to 10,000 times and as usual in bootstrapping delivers the distribution for a
test statistic t* (here for b0
and b1
).
For bootstrapping cases, pairs of x and y are bootstrapped and then the regression model is fitted, also
providing bootstrapped regression coefficients (i.e. ).
Now when to use what? In case that the residuals exhibit non-constant variance or are correlated, the
bootstrap sample does not preserve the properties of the population sample and bootstrapping of cases
should be preferred. However, if the observations for the predictors (x) have not been drawn randomly, but
are fixed (for example, fixed concentration levels in an experiment), bootstrapping residuals should be
preferred as it preserves these original x. For further details see Fox (2015): 658-660 and Hesterberg
(2015) Americ. Statist. 69: 371–386.
ϵi = yi−b0+b1 xi ⇔ ϵi = yi− ^
yi
ϵ1
*
,ϵ2
*
,... ,ϵn
*
yi
*
= ^
yi+ϵi
*
ϵ1 ,ϵ2 ,... ,ϵn
^
y
b0
*
, b1
*
b0
*
, b1
*
45
45
Simulation-based approaches to simple
linear regression
● Predictive accuracy measured with Mean square
prediction error (MSPE):
● Cross-validation (CV): Calculate CV-MSPE and CV-R2
● Bootstrapping (BS) in regression analysis:
● of residuals: BS residuals, add to to generate new and
calculate regression coefficients → x fixed
● of cases: BS complete cases and calculate regression
coefficients → x random
● If x and y random sample (e.g. x not fixed in experiment),
residuals correlated or exhibit non-constant variance → BS cases
MSPE =
1
m
∑
i=1
m
( yi− ^
yi)2
for the new observations 1 to m
^
y y*
Bold variables indicate vectors.
The mean square prediction error measures how well a new observation is predicted. For the relationship
with other error measures see here.
The algorithm for residual bootstrapping is easiest understood, when reformulating the equation for the
ordinary linear regression model (see here for details) to:
The bootstrap samples (samples with replacement) are drawn from the the n residuals yielding
to a bootstrap sample of residuals
These bootstrapped residuals are added to the vector of fitted responses ( ) to obtain a vector of new
responses y*:
These new responses are used to calculate new bootstrapped regression coefficients (i.e. ). The
procedure is repeated 1,000 to 10,000 times and as usual in bootstrapping delivers the distribution for a
test statistic t* (here for b0
and b1
).
For bootstrapping cases, pairs of x and y are bootstrapped and then the regression model is fitted, also
providing bootstrapped regression coefficients (i.e. ).
Now when to use what? In case that the residuals exhibit non-constant variance or are correlated, the
bootstrap sample does not preserve the properties of the population sample and bootstrapping of cases
should be preferred. However, if the observations for the predictors (x) have not been drawn randomly, but
are fixed (for example, fixed concentration levels in an experiment), bootstrapping residuals should be
preferred as it preserves these original x. For further details see Fox (2015): 658-660 and Hesterberg
(2015) Americ. Statist. 69: 371–386.
ϵi = yi−b0+b1 xi ⇔ ϵi = yi− ^
yi
ϵ1
*
,ϵ2
*
,... ,ϵn
*
yi
*
= ^
yi+ϵi
*
ϵ1 ,ϵ2 ,... ,ϵn
^
y
b0
*
, b1
*
b0
*
, b1
*
46
46
Exercise
We will work with the data set “possum” that includes
biometric measurements of possums in Victoria, Australia.
Conduct a linear regression analysis, diagnose and
interpret the model and apply simulation-based
approaches.
46
46
Exercise
We will work with the data set “possum” that includes
biometric measurements of possums in Victoria, Australia.
Conduct a linear regression analysis, diagnose and
interpret the model and apply simulation-based
approaches.

More Related Content

DOCX
MBA 5652Unit ILiterature ReviewInstructionsWithin this cou.docx
PDF
Practical Research Planning design Pearson 2015 Paul D. Leedy
PPT
Qualitative and Quantitative Research Plans By Malik Muhammad Mehran
DOCX
ASSIGNMENT 2 - Research Proposal Weighting 30 tow.docx
PDF
Statistics A Gentle Introduction 4th Edition Frederick L Coolidge
PDF
An Introduction To Modelbased Survey Sampling With Applications 1st Edition R...
PDF
An Introduction To Modelbased Survey Sampling With Applications 1st Edition R...
DOCX
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
MBA 5652Unit ILiterature ReviewInstructionsWithin this cou.docx
Practical Research Planning design Pearson 2015 Paul D. Leedy
Qualitative and Quantitative Research Plans By Malik Muhammad Mehran
ASSIGNMENT 2 - Research Proposal Weighting 30 tow.docx
Statistics A Gentle Introduction 4th Edition Frederick L Coolidge
An Introduction To Modelbased Survey Sampling With Applications 1st Edition R...
An Introduction To Modelbased Survey Sampling With Applications 1st Edition R...
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx

Similar to Introduction to statistics and data analysis.pdf (20)

DOCX
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
PPTX
Project Report BBA 2023.pptx
PPTX
Methodology Dissertation Writing In UK.pptx
PPTX
How to use Python to conduct regression analysis in management PhD research.pptx
PPTX
Cdl research methods.2011
PPTX
Cdl research methods.2011
PDF
GBS MSCBDA - Dissertation Guidelines.pdf
DOCX
· Toggle DrawerOverviewFor this assessment, you will complete .docx
PPT
orientations session bacth 12bba
PPTX
Resaerch-design-Presentation-MTTE-1st-sem-2023.pptx
ODP
Review of "Survey Research Methods & Design in Psychology"
PDF
Doing your systematic review: managing data and reporting
DOCX
Datascience
DOCX
datascience.docx
PPTX
Descriptive type of research in social sciences.pptx
DOCX
CJUS 745Quantitative Analysis Report Multiple Regression
PPTX
Module 4 data analysis
PPTX
1_Q2-PRACTICAL-RESEARCH.pptx
PDF
Introduction to Statistics in Metrology Stephen Crowder
PDF
Quantitative research presentation, safiah almurashi
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Project Report BBA 2023.pptx
Methodology Dissertation Writing In UK.pptx
How to use Python to conduct regression analysis in management PhD research.pptx
Cdl research methods.2011
Cdl research methods.2011
GBS MSCBDA - Dissertation Guidelines.pdf
· Toggle DrawerOverviewFor this assessment, you will complete .docx
orientations session bacth 12bba
Resaerch-design-Presentation-MTTE-1st-sem-2023.pptx
Review of "Survey Research Methods & Design in Psychology"
Doing your systematic review: managing data and reporting
Datascience
datascience.docx
Descriptive type of research in social sciences.pptx
CJUS 745Quantitative Analysis Report Multiple Regression
Module 4 data analysis
1_Q2-PRACTICAL-RESEARCH.pptx
Introduction to Statistics in Metrology Stephen Crowder
Quantitative research presentation, safiah almurashi
Ad

Recently uploaded (20)

PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PPTX
New ISO 27001_2022 standard and the changes
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
chrmotography.pptx food anaylysis techni
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Best Data Science Professional Certificates in the USA | IABAC
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
MBA JAPAN: 2025 the University of Waseda
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
ai agent creaction with langgraph_presentation_
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
SET 1 Compulsory MNH machine learning intro
PPT
statistic analysis for study - data collection
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Session 11 - Data Visualization Storytelling (2).pdf
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
New ISO 27001_2022 standard and the changes
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
chrmotography.pptx food anaylysis techni
DU, AIS, Big Data and Data Analytics.ppt
Best Data Science Professional Certificates in the USA | IABAC
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
MBA JAPAN: 2025 the University of Waseda
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
expt-design-lecture-12 hghhgfggjhjd (1).ppt
1 hour to get there before the game is done so you don’t need a car seat for ...
ai agent creaction with langgraph_presentation_
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
SET 1 Compulsory MNH machine learning intro
statistic analysis for study - data collection
Caseware_IDEA_Detailed_Presentation.pptx
Ad

Introduction to statistics and data analysis.pdf

  • 1. Applied Multivariate Statistics Ralf B. Schäfer University of Koblenz-Landau 2017/18 Applied Multivariate Statistics Ralf B. Schäfer University of Koblenz-Landau 2017/18
  • 2. 2 Short introduction ● Professor for Quantitative Landscape Ecology ● Current teaching: Statistics (M.Sc.); GIS (B.Sc./M.Sc.); Environmental Modelling (B.Sc./M.Sc.); Aquatic Ecotoxicology (M.Sc.); Environmental Philosophy (B.Sc.) ● Research focus: ● Community ecology of freshwater invertebrates and microorganisms ● Response of freshwater ecosystems to different (anthropogenic) stressors (e.g. pollution) ● Trophic linkages between aquatic & terrestrial systems ● Primarily field studies/experiments and data analyses/ modelling www.landscapecology.uni-landau.de 2 Short introduction ● Professor for Quantitative Landscape Ecology ● Current teaching: Statistics (M.Sc.); GIS (B.Sc./M.Sc.); Environmental Modelling (B.Sc./M.Sc.); Aquatic Ecotoxicology (M.Sc.); Environmental Philosophy (B.Sc.) ● Research focus: ● Community ecology of freshwater invertebrates and microorganisms ● Response of freshwater ecosystems to different (anthropogenic) stressors (e.g. pollution) ● Trophic linkages between aquatic & terrestrial systems ● Primarily field studies/experiments and data analyses/ modelling www.landscapecology.uni-landau.de
  • 3. 3 Organisation ● Lecture material (including course schedule and literature list) can be found on github and website: https://github.com/rbslandau/statistics_multi https://goo.gl/EhPVFG ● Inverted classroom: Self study of lecture and demonstration, Q&A and exercises in class room ● Contact time: 2 hours per week; Own study time: approximately 1 day per week 3 Organisation ● Lecture material (including course schedule and literature list) can be found on github and website: https://github.com/rbslandau/statistics_multi https://goo.gl/EhPVFG ● Inverted classroom: Self study of lecture and demonstration, Q&A and exercises in class room ● Contact time: 2 hours per week; Own study time: approximately 1 day per week
  • 4. 4 Using your own notebook ● feel free to you use your own WLAN-enabled notebook! ● install R (http://mirrors.softliste.de/cran/) oder RStudio (recommended for beginners - http://www.rstudio.com/) ● Run “0_Install_packgs.R”, provided on github ● for installation of additional packages run install.packages(“package to be installed”) 4 Using your own notebook ● feel free to you use your own WLAN-enabled notebook! ● install R (http://mirrors.softliste.de/cran/) oder RStudio (recommended for beginners - http://www.rstudio.com/) ● Run “0_Install_packgs.R”, provided on github ● for installation of additional packages run install.packages(“package to be installed”)
  • 5. 5 Course objectives: Learning outcomes ● Classify, explain and interpret the different types of (multivariate) statistical approaches ● Select and apply the appropriate statistical method for the research goal ● Demonstrate moderate level of statistical modelling skills, including scripting in R 5 Course objectives: Learning outcomes ● Classify, explain and interpret the different types of (multivariate) statistical approaches ● Select and apply the appropriate statistical method for the research goal ● Demonstrate moderate level of statistical modelling skills, including scripting in R
  • 6. 6 Two incorrect ways of thinking about stats 1.Overconfidence: Statistics is like mathematics and provides a single, correct answer But statistical thinking differs from mathematical thinking 2.Disbelief: Anything goes – statistics cannot be trusted But: statistics provide quantitative support of the complete research process Tintle (2015) Amer. Statist. 69: 362 6 Two incorrect ways of thinking about stats 1.Overconfidence: Statistics is like mathematics and provides a single, correct answer But statistical thinking differs from mathematical thinking 2.Disbelief: Anything goes – statistics cannot be trusted But: statistics provide quantitative support of the complete research process Tintle (2015) Amer. Statist. 69: 362
  • 7. 7 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 7 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 8. 8 Learning targets ● Explain the data analysis cycle and apply tools for exploratory data analysis ● Explain approaches to statistical modelling and simulation and apply simulation-based methods ● Diagnosing and interpreting the linear model 8 Learning targets ● Explain the data analysis cycle and apply tools for exploratory data analysis ● Explain approaches to statistical modelling and simulation and apply simulation-based methods ● Diagnosing and interpreting the linear model
  • 9. 9 Learning targets and study questions ● Explain the data analysis cycle and apply tools for exploratory data analysis ● Explain the steps of the data analysis cycle. ● Summarise the elements of exploratory analysis. Which graphical tools are essential? ● Explain approaches to statistical modelling and simulation and apply simulation-based methods ● Discuss the two different approaches to statistical modelling and links through simulation-based approaches. ● Explain the purpose and critically discuss permutation tests. ● Explain the purpose and critically discuss bootstrapping. ● Explain the main idea of cross-validation and discuss the selection of k with respect to the bias-variance trade-off. 9 Learning targets and study questions ● Explain the data analysis cycle and apply tools for exploratory data analysis ● Explain the steps of the data analysis cycle. ● Summarise the elements of exploratory analysis. Which graphical tools are essential? ● Explain approaches to statistical modelling and simulation and apply simulation-based methods ● Discuss the two different approaches to statistical modelling and links through simulation-based approaches. ● Explain the purpose and critically discuss permutation tests. ● Explain the purpose and critically discuss bootstrapping. ● Explain the main idea of cross-validation and discuss the selection of k with respect to the bias-variance trade-off.
  • 10. 10 Learning targets and study questions ● Diagnosing and interpreting the linear model ● Describe the assumptions of the linear regression and explain how they can be checked. ● Which types of outliers exist? When is an outlier important? ● Discuss the application of bootstrapping and cross-validation for the linear model. 10 Learning targets and study questions ● Diagnosing and interpreting the linear model ● Describe the assumptions of the linear regression and explain how they can be checked. ● Which types of outliers exist? When is an outlier important? ● Discuss the application of bootstrapping and cross-validation for the linear model.
  • 11. 11 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions 11 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions
  • 12. 12 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions 12 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions
  • 13. 13 Define research goal and question Scientific hypothesis: Restoring stream stretches alters aquatic communities, resulting in different emerging insects on which riparian spiders prey. This affects the spiders’ body condition derived from prosomal (pr.) and opisthosomal (op.) width. Question: Does the body condition of riparian spiders differ between restored and non-restored stream stretches? Opisthosoma Prosoma H0 : µrestored=µnon−restored H 1: µrestored≠µnon−restored ● Research goals (e.g. prediction, estimation, inference) and questions should inform study design and methods ● Aim: Test scientific hypothesis → Formulate testable hypothesis Example ● Testable hypothesis: The sample means for the body condition are drawn from populations with the same µ: 13 Define research goal and question Scientific hypothesis: Restoring stream stretches alters aquatic communities, resulting in different emerging insects on which riparian spiders prey. This affects the spiders’ body condition derived from prosomal (pr.) and opisthosomal (op.) width. Question: Does the body condition of riparian spiders differ between restored and non-restored stream stretches? Opisthosoma Prosoma H0 : µrestored=µnon−restored H 1: µrestored≠µnon−restored ● Research goals (e.g. prediction, estimation, inference) and questions should inform study design and methods ● Aim: Test scientific hypothesis → Formulate testable hypothesis Example ● Testable hypothesis: The sample means for the body condition are drawn from populations with the same µ:
  • 14. 14 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions 14 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions
  • 15. 15 Tools for data exploration GIGA: Garbage in – Garbage out 1.Outliers (e.g. boxplot) 2.Variance homogeneity (e.g. conditional boxplot) 3.Normal distribution (e.g. QQ-plot) 4.(Double) zeros (e.g. frequency plot) 5.Collinearity (e.g. pairwise scatterplots) 6.Relationship explanatory and response variable (e.g. scatterplots) 7.Spatial- or temporal autocorrelation (e.g. variograms) Elements of data exploration – Checking for: ● Useful for inspecting data before the modelling but also for model diagnosis ● Zuur et al. (2009) urge data inspection before modelling 15 Tools for data exploration GIGA: Garbage in – Garbage out 1.Outliers (e.g. boxplot) 2.Variance homogeneity (e.g. conditional boxplot) 3.Normal distribution (e.g. QQ-plot) 4.(Double) zeros (e.g. frequency plot) 5.Collinearity (e.g. pairwise scatterplots) 6.Relationship explanatory and response variable (e.g. scatterplots) 7.Spatial- or temporal autocorrelation (e.g. variograms) Elements of data exploration – Checking for: ● Useful for inspecting data before the modelling but also for model diagnosis ● Zuur et al. (2009) urge data inspection before modelling
  • 16. 16 Data exploration Common plots for looking at the data Outliers? Asymmetry of distribution? Normality? Linearity? Collinearity? 16 Data exploration Common plots for looking at the data Outliers? Asymmetry of distribution? Normality? Linearity? Collinearity?
  • 17. 17 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 17 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 18. 18 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions 18 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions
  • 19. 19 Statistical modelling: The two cultures Breiman 2001 Statistical Science 16: 199 Real world: Processes lead to association between x and y Examples for goals of statistical modelling: predict unknown y from x, estimate how x is related to y Data modelling culture (classical statistics) Common data model Algorithmic modeling culture (machine learning) Estimate parameters from data Model validation: Check residuals Model validation: Predictive accuracy Find algorithm that operates on x to predict y 19 Statistical modelling: The two cultures Breiman 2001 Statistical Science 16: 199 Real world: Processes lead to association between x and y Examples for goals of statistical modelling: predict unknown y from x, estimate how x is related to y Data modelling culture (classical statistics) Common data model Algorithmic modeling culture (machine learning) Estimate parameters from data Model validation: Check residuals Model validation: Predictive accuracy Find algorithm that operates on x to predict y
  • 20. 20 Statistical modelling: the classical view ● Fit model to data to inform estimation, inference or prediction (e.g. estimate point or interval, test hypothesis) ● Example: The arithmetic mean is an estimate of the true population mean µ and s2 is an estimate of the true variance σ2 ● Most models incorporate a deterministic (fixed effect) and a stochastic component (random effect) ● Example: ● All models rely on assumptions → Model diagnosis ● e.g. normal distribution, independence of observations ● Goodness of fit measures aid to choose between multiple models that fit the data ● e.g. AIC, R2 , RMSE x̄ yi = b0b1 xii with  ~ N 0, 2  20 Statistical modelling: the classical view ● Fit model to data to inform estimation, inference or prediction (e.g. estimate point or interval, test hypothesis) ● Example: The arithmetic mean is an estimate of the true population mean µ and s2 is an estimate of the true variance σ2 ● Most models incorporate a deterministic (fixed effect) and a stochastic component (random effect) ● Example: ● All models rely on assumptions → Model diagnosis ● e.g. normal distribution, independence of observations ● Goodness of fit measures aid to choose between multiple models that fit the data ● e.g. AIC, R2 , RMSE x̄ yi = b0b1 xii with  ~ N 0, 2 
  • 21. 21 Simulation-based approaches in data analysis ● Compatible with both cultures ● Infuses algorithm-based thinking into classical statistics ● Examples for simulation-based approaches for estimation, inference or model diagnosis in classical statistics: 1. Permutation test → Permuting (shuffling) the data to derive null distribution. Mainly used for inference 2. Bootstrapping → Randomly sampling subsets from the data with replacement. Mainly used for estimation of parameter distribution 3. Cross-validation (CV) → Splitting data into sets (i.e. sampling without replacement). Mainly used for validation of predictive models 21 Simulation-based approaches in data analysis ● Compatible with both cultures ● Infuses algorithm-based thinking into classical statistics ● Examples for simulation-based approaches for estimation, inference or model diagnosis in classical statistics: 1. Permutation test → Permuting (shuffling) the data to derive null distribution. Mainly used for inference 2. Bootstrapping → Randomly sampling subsets from the data with replacement. Mainly used for estimation of parameter distribution 3. Cross-validation (CV) → Splitting data into sets (i.e. sampling without replacement). Mainly used for validation of predictive models
  • 22. 22 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 22 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 23. 23 Permutation test: Algorithm 1) Permute values in data set 2) Compute test statistic t* for permuted data 3) Compare test statistic t0 to generated null distribution Repeat k times 23 Permutation test: Algorithm 1) Permute values in data set 2) Compute test statistic t* for permuted data 3) Compare test statistic t0 to generated null distribution Repeat k times
  • 24. 24 Permutation test: Algorithm Original dataset x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x9 x6 x12 x7 x13 x2 x10 x4 x1 x15 x14 x8 x3 x11 x5 x14 x2 x7 x9 x5 x15 x3 x6 x8 x1 x12 x2 x10 x4 x11 Permutation 1 Permutation k 1) Permute values in data set 2) Compute test statistic t* for permuted data 3) Compare test statistic t0 to generated null distribution . . . . . . . . . Repeat k times Example: Permutation test of difference in group mean Group 1 Group 2 Test statistic  xgroup1 −  xgroup2 t1 * tk * t0 24 Permutation test: Algorithm Original dataset x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x9 x6 x12 x7 x13 x2 x10 x4 x1 x15 x14 x8 x3 x11 x5 x14 x2 x7 x9 x5 x15 x3 x6 x8 x1 x12 x2 x10 x4 x11 Permutation 1 Permutation k 1) Permute values in data set 2) Compute test statistic t* for permuted data 3) Compare test statistic t0 to generated null distribution . . . . . . . . . Repeat k times Example: Permutation test of difference in group mean Group 1 Group 2 Test statistic  xgroup1 −  xgroup2 t1 * tk * t0
  • 25. 25 t0 Permutation test: Generated distribution ● Test informs whether pattern in data is due to chance ● Inference regarding statistical population only valid if distribution of sample data matches actual distribution of statistical population → particularly problematic for small n p = ∑ i=1 k 1if ti * ≤ t0 ,else0 k1 25 t0 Permutation test: Generated distribution ● Test informs whether pattern in data is due to chance ● Inference regarding statistical population only valid if distribution of sample data matches actual distribution of statistical population → particularly problematic for small n p = ∑ i=1 k 1if ti * ≤ t0 ,else0 k1
  • 26. 26 Permutation test: Advantages and limitations ● Advantages ● Free from distributional assumptions ● Applicable to complex designs through restricting permutations ● Limitations ● Generalisation to statistical population requires matching distribution ● Statistical hypothesis testing can imply distributional assumptions that apply to the permutation test, if aiming to infer to the statistical population (e.g. testing for mean differences affected by variance) ● Computationally intensive: Number of all possible permutations for a dataset is factorial n, i.e. n! (e.g. 35! ≈ 1040 ) → Monte Carlo simulation Legendre & Legendre 2012: 25ff 26 Permutation test: Advantages and limitations ● Advantages ● Free from distributional assumptions ● Applicable to complex designs through restricting permutations ● Limitations ● Generalisation to statistical population requires matching distribution ● Statistical hypothesis testing can imply distributional assumptions that apply to the permutation test, if aiming to infer to the statistical population (e.g. testing for mean differences affected by variance) ● Computationally intensive: Number of all possible permutations for a dataset is factorial n, i.e. n! (e.g. 35! ≈ 1040 ) → Monte Carlo simulation Legendre & Legendre 2012: 25ff
  • 27. 27 Monte-Carlo simulation ● Uses repeated random sampling to solve problems probabilistically (even though they can be deterministic in reality) ● Permutation tests use random numbers to randomly permute data → approximate with MC simulation ● Legendre & Legendre (2012): use at least 10,000 permutations for inference Edvard Munch - At the Roulette Table in Monte Carlo Entrance of casino in Monte Carlo, Monaco 27 Monte-Carlo simulation ● Uses repeated random sampling to solve problems probabilistically (even though they can be deterministic in reality) ● Permutation tests use random numbers to randomly permute data → approximate with MC simulation ● Legendre & Legendre (2012): use at least 10,000 permutations for inference Edvard Munch - At the Roulette Table in Monte Carlo Entrance of casino in Monte Carlo, Monaco
  • 28. 28 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 28 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 29. 29 Bootstrapping: Idea and algorithm ● Inference on statistic t is based on sampling distribution ● Ideally: Draw all or many samples from statistical population ● Reality: Most frequently only one sample available ➔ Idea: Draw samples from an estimate of the statistical population (i.e. the sample) and use these to estimate property (e.g. variance) of the statistic t ● Algorithm: 1) Draw random sample with replacement from data 2) Compute statistic t* for bootstrap sample 3) Use the k estimates to derive property of statistic ● Exhaustive bootstrapping (k = nn ) computationally demanding → approximate with Monte Carlo simulation ● Given todays computer power 104 -105 simulations viable Repeat k times 29 Bootstrapping: Idea and algorithm ● Inference on statistic t is based on sampling distribution ● Ideally: Draw all or many samples from statistical population ● Reality: Most frequently only one sample available ➔ Idea: Draw samples from an estimate of the statistical population (i.e. the sample) and use these to estimate property (e.g. variance) of the statistic t ● Algorithm: 1) Draw random sample with replacement from data 2) Compute statistic t* for bootstrap sample 3) Use the k estimates to derive property of statistic ● Exhaustive bootstrapping (k = nn ) computationally demanding → approximate with Monte Carlo simulation ● Given todays computer power 104 -105 simulations viable Repeat k times
  • 30. 30 Bootstrapping: Example Original dataset 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 7 8 4 15 11 9 1 3 6 14 2 11 12 1 5 7 8 8 15 10 10 1 13 6 13 2 10 12 3 7 5 8 2 12 14 10 8 13 6 11 7 15 12 1 BS sample 1 BS sample 2 BS sample k . . . Sampling with replacement x̄ = 8 t (here: mean) x̄* = 7.93 x̄* = 8.2 x̄* = 8.73 Example: Bootstrap to the mean (to derive variance) . . . Distribution of statistic t 30 Bootstrapping: Example Original dataset 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 7 8 4 15 11 9 1 3 6 14 2 11 12 1 5 7 8 8 15 10 10 1 13 6 13 2 10 12 3 7 5 8 2 12 14 10 8 13 6 11 7 15 12 1 BS sample 1 BS sample 2 BS sample k . . . Sampling with replacement x̄ = 8 t (here: mean) x̄* = 7.93 x̄* = 8.2 x̄* = 8.73 Example: Bootstrap to the mean (to derive variance) . . . Distribution of statistic t
  • 31. 31 Bootstrapping: Limitations Hesterberg 2015 Amer. Statist. 69:371 ● Do not use for hypothesis testing ● No distributional assumptions implied, but not reliable for all distributions, particularly at small n (see Hesterberg 2015) ● Small n: use adjusted bootstrap percentiles (Bca) or switch to parametric statistics (allow for additional assumptions) ● Bootstrap does not improve estimate of population parameter , centred at x̄ µ 31 Bootstrapping: Limitations Hesterberg 2015 Amer. Statist. 69:371 ● Do not use for hypothesis testing ● No distributional assumptions implied, but not reliable for all distributions, particularly at small n (see Hesterberg 2015) ● Small n: use adjusted bootstrap percentiles (Bca) or switch to parametric statistics (allow for additional assumptions) ● Bootstrap does not improve estimate of population parameter , centred at x̄ µ
  • 32. 32 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 32 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 33. 33 Cross-validation (CV) ● Objective: Evaluate predictive accuracy of a fitted model ● Can be checked if independent training data (used to fit model) and test data (new data) are available → Rare case ● Idea: Split the available data into training and test set and predict the (known) observations in the test set from a model fitted with the training data ● Algorithm: 1. Draw k random samples without replacement from data 2. For each k: 1. Fit the model to the other k-1 parts 2. Predict k from model and calculate the prediction error 3. Calculate prediction error as average over the k estimates 33 Cross-validation (CV) ● Objective: Evaluate predictive accuracy of a fitted model ● Can be checked if independent training data (used to fit model) and test data (new data) are available → Rare case ● Idea: Split the available data into training and test set and predict the (known) observations in the test set from a model fitted with the training data ● Algorithm: 1. Draw k random samples without replacement from data 2. For each k: 1. Fit the model to the other k-1 parts 2. Predict k from model and calculate the prediction error 3. Calculate prediction error as average over the k estimates
  • 34. 34 Cross-validation (CV) ● Problem of choosing k: ● k = n (Leave-one-out CV predicts each observation from all others) → low bias, but high variance ● k = 2 (split data into half) → low variance, but high bias ● k typically set to 5 or 10 Taken from James et al. 2013: 181 Example: k = 5 34 Cross-validation (CV) ● Problem of choosing k: ● k = n (Leave-one-out CV predicts each observation from all others) → low bias, but high variance ● k = 2 (split data into half) → low variance, but high bias ● k typically set to 5 or 10 Taken from James et al. 2013: 181 Example: k = 5
  • 35. 35 Test data Training data function used to simulate data highly flexible smoother linear regression little flexible smoother Variance Bias-variance trade-off Definitions in context of model validation: ● Bias: error when approximating training data ● Variance: variability in error when approximating test data Taken from James et al. 2013: 33 Higher flexibility (higher k in CV) → lower error for training data (i.e. lower bias), but variance will increase from some point 35 Test data Training data function used to simulate data highly flexible smoother linear regression little flexible smoother Variance Bias-variance trade-off Definitions in context of model validation: ● Bias: error when approximating training data ● Variance: variability in error when approximating test data Taken from James et al. 2013: 33 Higher flexibility (higher k in CV) → lower error for training data (i.e. lower bias), but variance will increase from some point
  • 36. 36 Bias-variance trade-off Higher flexibility (higher k in CV) → lower error for training data (i.e. lower bias), but variance will increase from some point → Optimise combined error Taken from Hastie, Tibshirani and Friedman 2011: 38 36 Bias-variance trade-off Higher flexibility (higher k in CV) → lower error for training data (i.e. lower bias), but variance will increase from some point → Optimise combined error Taken from Hastie, Tibshirani and Friedman 2011: 38
  • 37. 37 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 37 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 38. 38 Relationship between two continuous variables: linear regression model ● Bivariate relationship between an explanatory variable and a response variable with: ● Example: Can we approximate pesticide runoff concentrations with passive sampling? Fernandez et al. 2014 yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ 2 ) 38 Relationship between two continuous variables: linear regression model ● Bivariate relationship between an explanatory variable and a response variable with: ● Example: Can we approximate pesticide runoff concentrations with passive sampling? Fernandez et al. 2014 yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ 2 )
  • 39. 39 Relationship between two continuous variables: linear regression model ● Bivariate relationship between an explanatory variable and a response variable with: ● Aim: minimise ε (also called error sum of squares: SSE) yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ 2 ) 39 Relationship between two continuous variables: linear regression model ● Bivariate relationship between an explanatory variable and a response variable with: ● Aim: minimise ε (also called error sum of squares: SSE) yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ 2 )
  • 40. 40 Linear regression model SSY = SSR + SSE R2 = SSR SSY Total variation Explained variation Unexplained variation % of explained variance: adj. R2 = 1−1−R2  n−1 n− p−1 40 Linear regression model SSY = SSR + SSE R2 = SSR SSY Total variation Explained variation Unexplained variation % of explained variance: adj. R2 = 1−1−R2  n−1 n− p−1
  • 41. 41 Linear regression model ● Assumptions: ● Linear relationship (graphical diagnostics) ● Normal distribution of error (graphical diagnostics) ● Variance homogeneity (graphical diagnostics) ● Independence of errors (graphical diagnostics) ● If one or more assumptions not met, alternatives include: ● Generalised linear model, Generalised least squares, Mixed models ● Variable transformation (but using an appropriate model such as a Generalised linear model is usually the better option) 41 Linear regression model ● Assumptions: ● Linear relationship (graphical diagnostics) ● Normal distribution of error (graphical diagnostics) ● Variance homogeneity (graphical diagnostics) ● Independence of errors (graphical diagnostics) ● If one or more assumptions not met, alternatives include: ● Generalised linear model, Generalised least squares, Mixed models ● Variable transformation (but using an appropriate model such as a Generalised linear model is usually the better option)
  • 42. 42 Model diagnostics: Variance homogeneity „normal“ „strong increase“ „non-linear“ „slight increase“ Residuals vs. fitted values plots 42 Model diagnostics: Variance homogeneity „normal“ „strong increase“ „non-linear“ „slight increase“ Residuals vs. fitted values plots
  • 43. 43 Further model diagnostics Leverage points (predictor outlier) How to deal with leverage points/outliers? ● Check whether values are plausible ● Check robustness of model results when removing observations ● Fit different statistical model or transform data Leverage point that exerts high influence Non-influential leverage point 43 Further model diagnostics Leverage points (predictor outlier) How to deal with leverage points/outliers? ● Check whether values are plausible ● Check robustness of model results when removing observations ● Fit different statistical model or transform data Leverage point that exerts high influence Non-influential leverage point
  • 44. 44 Flowchart for simple linear regression Taken from Sheather 2009: p.103 44 Flowchart for simple linear regression Taken from Sheather 2009: p.103
  • 45. 45 Simulation-based approaches to simple linear regression ● Predictive accuracy measured with Mean square prediction error (MSPE): ● Cross-validation (CV): Calculate CV-MSPE and CV-R2 ● Bootstrapping (BS) in regression analysis: ● of residuals: BS residuals, add to to generate new and calculate regression coefficients → x fixed ● of cases: BS complete cases and calculate regression coefficients → x random ● If x and y random sample (e.g. x not fixed in experiment), residuals correlated or exhibit non-constant variance → BS cases MSPE = 1 m ∑ i=1 m ( yi− ^ yi)2 for the new observations 1 to m ^ y y * 45 Simulation-based approaches to simple linear regression ● Predictive accuracy measured with Mean square prediction error (MSPE): ● Cross-validation (CV): Calculate CV-MSPE and CV-R2 ● Bootstrapping (BS) in regression analysis: ● of residuals: BS residuals, add to to generate new and calculate regression coefficients → x fixed ● of cases: BS complete cases and calculate regression coefficients → x random ● If x and y random sample (e.g. x not fixed in experiment), residuals correlated or exhibit non-constant variance → BS cases MSPE = 1 m ∑ i=1 m ( yi− ^ yi)2 for the new observations 1 to m ^ y y *
  • 46. 46 Exercise We will work with the data set “possum” that includes biometric measurements of possums in Victoria, Australia. Conduct a linear regression analysis, diagnose and interpret the model and apply simulation-based approaches. 46 Exercise We will work with the data set “possum” that includes biometric measurements of possums in Victoria, Australia. Conduct a linear regression analysis, diagnose and interpret the model and apply simulation-based approaches.
  • 47. 1 Applied Multivariate Statistics Ralf B. Schäfer University of Koblenz-Landau 2017/18 These slides and notes complement the lecture with exercises “Applied multivariate statistics for environmental scientists”. Do not hesitate to contact me if you have any comments or you found any errors (text or slides): schaefer-ralf@uni-landau.de While I made notes below the slides, some aspects are only mentioned in the R scripts associated with the lecture. 1 Applied Multivariate Statistics Ralf B. Schäfer University of Koblenz-Landau 2017/18 These slides and notes complement the lecture with exercises “Applied multivariate statistics for environmental scientists”. Do not hesitate to contact me if you have any comments or you found any errors (text or slides): schaefer-ralf@uni-landau.de While I made notes below the slides, some aspects are only mentioned in the R scripts associated with the lecture.
  • 48. 2 2 Short introduction ● Professor for Quantitative Landscape Ecology ● Current teaching: Statistics (M.Sc.); GIS (B.Sc./M.Sc.); Environmental Modelling (B.Sc./M.Sc.); Aquatic Ecotoxicology (M.Sc.); Environmental Philosophy (B.Sc.) ● Research focus: ● Community ecology of freshwater invertebrates and microorganisms ● Response of freshwater ecosystems to different (anthropogenic) stressors (e.g. pollution) ● Trophic linkages between aquatic & terrestrial systems ● Primarily field studies/experiments and data analyses/ modelling www.landscapecology.uni-landau.de 2 2 Short introduction ● Professor for Quantitative Landscape Ecology ● Current teaching: Statistics (M.Sc.); GIS (B.Sc./M.Sc.); Environmental Modelling (B.Sc./M.Sc.); Aquatic Ecotoxicology (M.Sc.); Environmental Philosophy (B.Sc.) ● Research focus: ● Community ecology of freshwater invertebrates and microorganisms ● Response of freshwater ecosystems to different (anthropogenic) stressors (e.g. pollution) ● Trophic linkages between aquatic & terrestrial systems ● Primarily field studies/experiments and data analyses/ modelling www.landscapecology.uni-landau.de
  • 49. 3 3 Organisation ● Lecture material (including course schedule and literature list) can be found on github and website: https://github.com/rbslandau/statistics_multi https://goo.gl/EhPVFG ● Inverted classroom: Self study of lecture and demonstration, Q&A and exercises in class room ● Contact time: 2 hours per week; Own study time: approximately 1 day per week Literature references that are listed in the literature list are cited in short form on slides. For literature not contained in the literature list, I give the complete reference on the slide or in the notes for the respective slide. 3 3 Organisation ● Lecture material (including course schedule and literature list) can be found on github and website: https://github.com/rbslandau/statistics_multi https://goo.gl/EhPVFG ● Inverted classroom: Self study of lecture and demonstration, Q&A and exercises in class room ● Contact time: 2 hours per week; Own study time: approximately 1 day per week Literature references that are listed in the literature list are cited in short form on slides. For literature not contained in the literature list, I give the complete reference on the slide or in the notes for the respective slide.
  • 50. 4 4 Using your own notebook ● feel free to you use your own WLAN-enabled notebook! ● install R (http://mirrors.softliste.de/cran/) oder RStudio (recommended for beginners - http://www.rstudio.com/) ● Run “0_Install_packgs.R”, provided on github ● for installation of additional packages run install.packages(“package to be installed”) 4 4 Using your own notebook ● feel free to you use your own WLAN-enabled notebook! ● install R (http://mirrors.softliste.de/cran/) oder RStudio (recommended for beginners - http://www.rstudio.com/) ● Run “0_Install_packgs.R”, provided on github ● for installation of additional packages run install.packages(“package to be installed”)
  • 51. 5 5 Course objectives: Learning outcomes ● Classify, explain and interpret the different types of (multivariate) statistical approaches ● Select and apply the appropriate statistical method for the research goal ● Demonstrate moderate level of statistical modelling skills, including scripting in R 5 5 Course objectives: Learning outcomes ● Classify, explain and interpret the different types of (multivariate) statistical approaches ● Select and apply the appropriate statistical method for the research goal ● Demonstrate moderate level of statistical modelling skills, including scripting in R
  • 52. 6 6 Two incorrect ways of thinking about stats 1.Overconfidence: Statistics is like mathematics and provides a single, correct answer But statistical thinking differs from mathematical thinking 2.Disbelief: Anything goes – statistics cannot be trusted But: statistics provide quantitative support of the complete research process Tintle (2015) Amer. Statist. 69: 362 Tintle N., Chance B., Cobb G., Roy S., Swanson T. & VanderStoep J. (2015) Combating Anti-Statistical Thinking Using Simulation-Based Methods Throughout the Undergraduate Curriculum. The American Statistician 69, 362– 370. 6 6 Two incorrect ways of thinking about stats 1.Overconfidence: Statistics is like mathematics and provides a single, correct answer But statistical thinking differs from mathematical thinking 2.Disbelief: Anything goes – statistics cannot be trusted But: statistics provide quantitative support of the complete research process Tintle (2015) Amer. Statist. 69: 362 Tintle N., Chance B., Cobb G., Roy S., Swanson T. & VanderStoep J. (2015) Combating Anti-Statistical Thinking Using Simulation-Based Methods Throughout the Undergraduate Curriculum. The American Statistician 69, 362– 370.
  • 53. 7 7 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 7 7 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 54. 8 8 Learning targets ● Explain the data analysis cycle and apply tools for exploratory data analysis ● Explain approaches to statistical modelling and simulation and apply simulation-based methods ● Diagnosing and interpreting the linear model 8 8 Learning targets ● Explain the data analysis cycle and apply tools for exploratory data analysis ● Explain approaches to statistical modelling and simulation and apply simulation-based methods ● Diagnosing and interpreting the linear model
  • 55. 9 9 Learning targets and study questions ● Explain the data analysis cycle and apply tools for exploratory data analysis ● Explain the steps of the data analysis cycle. ● Summarise the elements of exploratory analysis. Which graphical tools are essential? ● Explain approaches to statistical modelling and simulation and apply simulation-based methods ● Discuss the two different approaches to statistical modelling and links through simulation-based approaches. ● Explain the purpose and critically discuss permutation tests. ● Explain the purpose and critically discuss bootstrapping. ● Explain the main idea of cross-validation and discuss the selection of k with respect to the bias-variance trade-off. 9 9 Learning targets and study questions ● Explain the data analysis cycle and apply tools for exploratory data analysis ● Explain the steps of the data analysis cycle. ● Summarise the elements of exploratory analysis. Which graphical tools are essential? ● Explain approaches to statistical modelling and simulation and apply simulation-based methods ● Discuss the two different approaches to statistical modelling and links through simulation-based approaches. ● Explain the purpose and critically discuss permutation tests. ● Explain the purpose and critically discuss bootstrapping. ● Explain the main idea of cross-validation and discuss the selection of k with respect to the bias-variance trade-off.
  • 56. 10 10 Learning targets and study questions ● Diagnosing and interpreting the linear model ● Describe the assumptions of the linear regression and explain how they can be checked. ● Which types of outliers exist? When is an outlier important? ● Discuss the application of bootstrapping and cross-validation for the linear model. 10 10 Learning targets and study questions ● Diagnosing and interpreting the linear model ● Describe the assumptions of the linear regression and explain how they can be checked. ● Which types of outliers exist? When is an outlier important? ● Discuss the application of bootstrapping and cross-validation for the linear model.
  • 57. 11 11 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions Zumel N. & Mount J. (2014) Practical data science with R. Manning Publications Co, Shelter Island, NY. Data exploration visualised with dashed line as it will depend on the research context if and when data exploration is conducted. However, most frequently data exploration (e.g. descriptive statistics such as data summaries) is employed before statistical modelling and the characteristics of the data set are explored to aid model selection. In some studies and disciplines, eventually no statistical modelling is done and only descriptive statistics is reported. Nevertheless, in case that a clear research hypothesis has been established before data collection, data exploration may not be required before statistical modelling. Still, the techniques related to data exploration will be needed to check model assumptions. Note that you must not establish research or statistical hypotheses after data exploration. 11 11 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions Zumel N. & Mount J. (2014) Practical data science with R. Manning Publications Co, Shelter Island, NY. Data exploration visualised with dashed line as it will depend on the research context if and when data exploration is conducted. However, most frequently data exploration (e.g. descriptive statistics such as data summaries) is employed before statistical modelling and the characteristics of the data set are explored to aid model selection. In some studies and disciplines, eventually no statistical modelling is done and only descriptive statistics is reported. Nevertheless, in case that a clear research hypothesis has been established before data collection, data exploration may not be required before statistical modelling. Still, the techniques related to data exploration will be needed to check model assumptions. Note that you must not establish research or statistical hypotheses after data exploration.
  • 58. 12 12 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions 12 12 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions
  • 59. 13 13 Define research goal and question Scientific hypothesis: Restoring stream stretches alters aquatic communities, resulting in different emerging insects on which riparian spiders prey. This affects the spiders’ body condition derived from prosomal (pr.) and opisthosomal (op.) width. Question: Does the body condition of riparian spiders differ between restored and non-restored stream stretches? Opisthosoma Prosoma H0 : µrestored=µnon−restored H 1: µrestored≠µnon−restored ● Research goals (e.g. prediction, estimation, inference) and questions should inform study design and methods ● Aim: Test scientific hypothesis → Formulate testable hypothesis Example ● Testable hypothesis: The sample means for the body condition are drawn from populations with the same µ: River restoration may lead to improvements such as increased species richness of the aquatic invertebrate community. Terrestrial predators in the riparian zone such as spiders, in turn, may benefit from an increase in the biomass and diversity of aquatic emergent prey. In a study we therefore compared the body condition, using a proxy based on prosomal and opisthosomal width, between non-restored and restored stream reaches. Statistical hypothesis testing consisted of comparing the central tendencies (sample means) using a paired t-test (each line corresponds to a different stream). 13 13 Define research goal and question Scientific hypothesis: Restoring stream stretches alters aquatic communities, resulting in different emerging insects on which riparian spiders prey. This affects the spiders’ body condition derived from prosomal (pr.) and opisthosomal (op.) width. Question: Does the body condition of riparian spiders differ between restored and non-restored stream stretches? Opisthosoma Prosoma H0 : µrestored=µnon−restored H 1: µrestored≠µnon−restored ● Research goals (e.g. prediction, estimation, inference) and questions should inform study design and methods ● Aim: Test scientific hypothesis → Formulate testable hypothesis Example ● Testable hypothesis: The sample means for the body condition are drawn from populations with the same µ: River restoration may lead to improvements such as increased species richness of the aquatic invertebrate community. Terrestrial predators in the riparian zone such as spiders, in turn, may benefit from an increase in the biomass and diversity of aquatic emergent prey. In a study we therefore compared the body condition, using a proxy based on prosomal and opisthosomal width, between non-restored and restored stream reaches. Statistical hypothesis testing consisted of comparing the central tendencies (sample means) using a paired t-test (each line corresponds to a different stream).
  • 60. 14 14 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions 14 14 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions
  • 61. 15 15 Tools for data exploration GIGA: Garbage in – Garbage out 1.Outliers (e.g. boxplot) 2.Variance homogeneity (e.g. conditional boxplot) 3.Normal distribution (e.g. QQ-plot) 4.(Double) zeros (e.g. frequency plot) 5.Collinearity (e.g. pairwise scatterplots) 6.Relationship explanatory and response variable (e.g. scatterplots) 7.Spatial- or temporal autocorrelation (e.g. variograms) Elements of data exploration – Checking for: ● Useful for inspecting data before the modelling but also for model diagnosis ● Zuur et al. (2009) urge data inspection before modelling A recommended read is: Zuur, A.F; Ieno, E.N; Elphick, C.S (2009): A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1: 3–14. http://onlinelibrary.wiley.com/wol1/doi/10.1111/j.2041-210X.2009.00001.x/f ull You have already encountered several of the elements of data exploration in the course and you will meet them later again. Double zeros are often occurring for species data (i.e. absence of a species in pairs of sites) and may complicate interpretation. In addition, several zeros in the response variable can lead to biased parameter estimates and in such a situation models tailored for zero- inflated data should be used. For zero-inflated models see: Zuur, A.F & Ieno, E.N. (2016): Beginner’s Guide to Zero-Inflated Models with R. Highland statistics. http://highstat.com/index.php/beginner-s-guide-to-zero-inflated-models 15 15 Tools for data exploration GIGA: Garbage in – Garbage out 1.Outliers (e.g. boxplot) 2.Variance homogeneity (e.g. conditional boxplot) 3.Normal distribution (e.g. QQ-plot) 4.(Double) zeros (e.g. frequency plot) 5.Collinearity (e.g. pairwise scatterplots) 6.Relationship explanatory and response variable (e.g. scatterplots) 7.Spatial- or temporal autocorrelation (e.g. variograms) Elements of data exploration – Checking for: ● Useful for inspecting data before the modelling but also for model diagnosis ● Zuur et al. (2009) urge data inspection before modelling A recommended read is: Zuur, A.F; Ieno, E.N; Elphick, C.S (2009): A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1: 3–14. http://onlinelibrary.wiley.com/wol1/doi/10.1111/j.2041-210X.2009.00001.x/f ull You have already encountered several of the elements of data exploration in the course and you will meet them later again. Double zeros are often occurring for species data (i.e. absence of a species in pairs of sites) and may complicate interpretation. In addition, several zeros in the response variable can lead to biased parameter estimates and in such a situation models tailored for zero- inflated data should be used. For zero-inflated models see: Zuur, A.F & Ieno, E.N. (2016): Beginner’s Guide to Zero-Inflated Models with R. Highland statistics. http://highstat.com/index.php/beginner-s-guide-to-zero-inflated-models
  • 62. 16 16 Data exploration Common plots for looking at the data Outliers? Asymmetry of distribution? Normality? Linearity? Collinearity? There are several rules of thumb as to what can be regarded as an outlier – but it remains more or less a subjective decision. John Tukey suggested to define Y as an outlier if: Y < (Q1 − 1.5 IQR) or Y > (Q3 + 1.5 IQR), where Q1 denotes the lower quartile, Q3 denotes the upper quartile, and IQR = (Q3 − Q1) denotes the interquartile range. In practice, the type of data, number of observations and knowledge about the data should be taken into account when deciding whether an observation is classified as outlier. A beanplot represents an alternative to a boxplot that has several advantages. Beanplots have been introduced by Peter Kampstra: Kampstra P. 2008: Beanplot: A Boxplot Alternative for Visual Comparison of Distributions. Journal of Statistical Software, Code Snippets. 28 (1): 1-9. Freely available at http://www.jstatsoft.org/v28/c01/ We will quickly look at a beanplot in the practical part. 16 16 Data exploration Common plots for looking at the data Outliers? Asymmetry of distribution? Normality? Linearity? Collinearity? There are several rules of thumb as to what can be regarded as an outlier – but it remains more or less a subjective decision. John Tukey suggested to define Y as an outlier if: Y < (Q1 − 1.5 IQR) or Y > (Q3 + 1.5 IQR), where Q1 denotes the lower quartile, Q3 denotes the upper quartile, and IQR = (Q3 − Q1) denotes the interquartile range. In practice, the type of data, number of observations and knowledge about the data should be taken into account when deciding whether an observation is classified as outlier. A beanplot represents an alternative to a boxplot that has several advantages. Beanplots have been introduced by Peter Kampstra: Kampstra P. 2008: Beanplot: A Boxplot Alternative for Visual Comparison of Distributions. Journal of Statistical Software, Code Snippets. 28 (1): 1-9. Freely available at http://www.jstatsoft.org/v28/c01/ We will quickly look at a beanplot in the practical part.
  • 63. 17 17 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 17 17 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 64. 18 18 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions 18 18 Data analysis cycle Modified from Zumel & Mount 2014: 6 → Research goal and questions e.g. empirical study, data compilation e.g. formatting, handling missing values Data exploration e.g. outliers, skewness, distribution, linearity Model diagnosis, evaluation and interpretation Check model assumptions and interpret results. Publish results Statistical modelling Identify limitations and open questions
  • 65. 19 19 Statistical modelling: The two cultures Breiman 2001 Statistical Science 16: 199 Real world: Processes lead to association between x and y Examples for goals of statistical modelling: predict unknown y from x, estimate how x is related to y Data modelling culture (classical statistics) Common data model Algorithmic modeling culture (machine learning) Estimate parameters from data Model validation: Check residuals Model validation: Predictive accuracy Find algorithm that operates on x to predict y Breiman L. (2001) Statistical modeling: The two cultures. Statistical Science 16, 199–215. The very readable debate is available here: https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726 19 19 Statistical modelling: The two cultures Breiman 2001 Statistical Science 16: 199 Real world: Processes lead to association between x and y Examples for goals of statistical modelling: predict unknown y from x, estimate how x is related to y Data modelling culture (classical statistics) Common data model Algorithmic modeling culture (machine learning) Estimate parameters from data Model validation: Check residuals Model validation: Predictive accuracy Find algorithm that operates on x to predict y Breiman L. (2001) Statistical modeling: The two cultures. Statistical Science 16, 199–215. The very readable debate is available here: https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
  • 66. 20 20 Statistical modelling: the classical view ● Fit model to data to inform estimation, inference or prediction (e.g. estimate point or interval, test hypothesis) ● Example: The arithmetic mean is an estimate of the true population mean µ and s2 is an estimate of the true variance σ2 ● Most models incorporate a deterministic (fixed effect) and a stochastic component (random effect) ● Example: ● All models rely on assumptions → Model diagnosis ● e.g. normal distribution, independence of observations ● Goodness of fit measures aid to choose between multiple models that fit the data ● e.g. AIC, R2 , RMSE x̄ yi = b0b1 xii with  ~ N 0, 2  Any observation contains signal and noise. In a statistical model, this relates to the fitted value and the residual. 20 20 Statistical modelling: the classical view ● Fit model to data to inform estimation, inference or prediction (e.g. estimate point or interval, test hypothesis) ● Example: The arithmetic mean is an estimate of the true population mean µ and s2 is an estimate of the true variance σ2 ● Most models incorporate a deterministic (fixed effect) and a stochastic component (random effect) ● Example: ● All models rely on assumptions → Model diagnosis ● e.g. normal distribution, independence of observations ● Goodness of fit measures aid to choose between multiple models that fit the data ● e.g. AIC, R2 , RMSE x̄ yi = b0b1 xii with  ~ N 0, 2  Any observation contains signal and noise. In a statistical model, this relates to the fitted value and the residual.
  • 67. 21 21 Simulation-based approaches in data analysis ● Compatible with both cultures ● Infuses algorithm-based thinking into classical statistics ● Examples for simulation-based approaches for estimation, inference or model diagnosis in classical statistics: 1. Permutation test → Permuting (shuffling) the data to derive null distribution. Mainly used for inference 2. Bootstrapping → Randomly sampling subsets from the data with replacement. Mainly used for estimation of parameter distribution 3. Cross-validation (CV) → Splitting data into sets (i.e. sampling without replacement). Mainly used for validation of predictive models 21 21 Simulation-based approaches in data analysis ● Compatible with both cultures ● Infuses algorithm-based thinking into classical statistics ● Examples for simulation-based approaches for estimation, inference or model diagnosis in classical statistics: 1. Permutation test → Permuting (shuffling) the data to derive null distribution. Mainly used for inference 2. Bootstrapping → Randomly sampling subsets from the data with replacement. Mainly used for estimation of parameter distribution 3. Cross-validation (CV) → Splitting data into sets (i.e. sampling without replacement). Mainly used for validation of predictive models
  • 68. 22 22 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 22 22 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 69. 23 23 Permutation test: Algorithm 1) Permute values in data set 2) Compute test statistic t* for permuted data 3) Compare test statistic t0 to generated null distribution Repeat k times 23 23 Permutation test: Algorithm 1) Permute values in data set 2) Compute test statistic t* for permuted data 3) Compare test statistic t0 to generated null distribution Repeat k times
  • 70. 24 24 Permutation test: Algorithm Original dataset x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x9 x6 x12 x7 x13 x2 x10 x4 x1 x15 x14 x8 x3 x11 x5 x14 x2 x7 x9 x5 x15 x3 x6 x8 x1 x12 x2 x10 x4 x11 Permutation 1 Permutation k 1) Permute values in data set 2) Compute test statistic t* for permuted data 3) Compare test statistic t0 to generated null distribution . . . . . . . . . Repeat k times Example: Permutation test of difference in group mean Group 1 Group 2 Test statistic  xgroup1 −  xgroup2 t1 * tk * t0 24 24 Permutation test: Algorithm Original dataset x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x9 x6 x12 x7 x13 x2 x10 x4 x1 x15 x14 x8 x3 x11 x5 x14 x2 x7 x9 x5 x15 x3 x6 x8 x1 x12 x2 x10 x4 x11 Permutation 1 Permutation k 1) Permute values in data set 2) Compute test statistic t* for permuted data 3) Compare test statistic t0 to generated null distribution . . . . . . . . . Repeat k times Example: Permutation test of difference in group mean Group 1 Group 2 Test statistic  xgroup1 −  xgroup2 t1 * tk * t0
  • 71. 25 25 t0 Permutation test: Generated distribution ● Test informs whether pattern in data is due to chance ● Inference regarding statistical population only valid if distribution of sample data matches actual distribution of statistical population → particularly problematic for small n p = ∑ i=1 k 1if ti * ≤ t0 ,else0 k1 The p-value is computed as the fraction of test statistics t*, which are based on permutated data, that are more extreme (lower or higher depending on the hypothesis) than the non-permuted test statistic. If the sample distribution deviates from the actual distribution of the statistical population, the permutation test only allows to infer conclusions that apply to the data at hand. These may not be very interesting. For the example of the mean comparison, this would translate to being unable to test the null hypothesis: What a small sample size n is, depends on the context and no single number applies to all situations. For example, it will depend on the statistical distribution, statistical test etc. However, as a rule of thumb, sample sizes < 30 for a population are small. Still, much larger sample sizes can be required to reliably generalize from the permutation test to the statistical population. H0 : µgroup1=µgroup2 25 25 t0 Permutation test: Generated distribution ● Test informs whether pattern in data is due to chance ● Inference regarding statistical population only valid if distribution of sample data matches actual distribution of statistical population → particularly problematic for small n p = ∑ i=1 k 1if ti * ≤ t0 ,else0 k1 The p-value is computed as the fraction of test statistics t*, which are based on permutated data, that are more extreme (lower or higher depending on the hypothesis) than the non-permuted test statistic. If the sample distribution deviates from the actual distribution of the statistical population, the permutation test only allows to infer conclusions that apply to the data at hand. These may not be very interesting. For the example of the mean comparison, this would translate to being unable to test the null hypothesis: What a small sample size n is, depends on the context and no single number applies to all situations. For example, it will depend on the statistical distribution, statistical test etc. However, as a rule of thumb, sample sizes < 30 for a population are small. Still, much larger sample sizes can be required to reliably generalize from the permutation test to the statistical population. H0 : µgroup1=µgroup2
  • 72. 26 26 Permutation test: Advantages and limitations ● Advantages ● Free from distributional assumptions ● Applicable to complex designs through restricting permutations ● Limitations ● Generalisation to statistical population requires matching distribution ● Statistical hypothesis testing can imply distributional assumptions that apply to the permutation test, if aiming to infer to the statistical population (e.g. testing for mean differences affected by variance) ● Computationally intensive: Number of all possible permutations for a dataset is factorial n, i.e. n! (e.g. 35! ≈ 1040 ) → Monte Carlo simulation Legendre & Legendre 2012: 25ff For instance, two null hypotheses are tested simultaneously (1. equality of mean, 2. equality of variance) when testing for differences among sample means. This dual aspect of classical tests such as analysis of variance or t-test also applies to the related permutation test and prohibits to draw unequivocal conclusions regarding the mean difference without consideration of variance equality. 26 26 Permutation test: Advantages and limitations ● Advantages ● Free from distributional assumptions ● Applicable to complex designs through restricting permutations ● Limitations ● Generalisation to statistical population requires matching distribution ● Statistical hypothesis testing can imply distributional assumptions that apply to the permutation test, if aiming to infer to the statistical population (e.g. testing for mean differences affected by variance) ● Computationally intensive: Number of all possible permutations for a dataset is factorial n, i.e. n! (e.g. 35! ≈ 1040 ) → Monte Carlo simulation Legendre & Legendre 2012: 25ff For instance, two null hypotheses are tested simultaneously (1. equality of mean, 2. equality of variance) when testing for differences among sample means. This dual aspect of classical tests such as analysis of variance or t-test also applies to the related permutation test and prohibits to draw unequivocal conclusions regarding the mean difference without consideration of variance equality.
  • 73. 27 27 Monte-Carlo simulation ● Uses repeated random sampling to solve problems probabilistically (even though they can be deterministic in reality) ● Permutation tests use random numbers to randomly permute data → approximate with MC simulation ● Legendre & Legendre (2012): use at least 10,000 permutations for inference Edvard Munch - At the Roulette Table in Monte Carlo Entrance of casino in Monte Carlo, Monaco Name refers to the city, it was chosen as code name for a secret project in the context of nuclear weapon research in Los Alamos, USA. The larger the number of MC-based permutations, the lower is the error when approximating the distribution of all possible permutations with the MC-based permutation. Picture sources: Photo of Casino https://pixabay.com/de/spielbank-casino-monte-carlo-monaco-188882/ Picture of Munch: https://upload.wikimedia.org/wikipedia/commons/1/1f/Edvard_Munch_-_ At_the_Roulette_Table_in_Monte_Carlo_-_Google_Art_Project.jpg 27 27 Monte-Carlo simulation ● Uses repeated random sampling to solve problems probabilistically (even though they can be deterministic in reality) ● Permutation tests use random numbers to randomly permute data → approximate with MC simulation ● Legendre & Legendre (2012): use at least 10,000 permutations for inference Edvard Munch - At the Roulette Table in Monte Carlo Entrance of casino in Monte Carlo, Monaco Name refers to the city, it was chosen as code name for a secret project in the context of nuclear weapon research in Los Alamos, USA. The larger the number of MC-based permutations, the lower is the error when approximating the distribution of all possible permutations with the MC-based permutation. Picture sources: Photo of Casino https://pixabay.com/de/spielbank-casino-monte-carlo-monaco-188882/ Picture of Munch: https://upload.wikimedia.org/wikipedia/commons/1/1f/Edvard_Munch_-_ At_the_Roulette_Table_in_Monte_Carlo_-_Google_Art_Project.jpg
  • 74. 28 28 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 28 28 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 75. 29 29 Bootstrapping: Idea and algorithm ● Inference on statistic t is based on sampling distribution ● Ideally: Draw all or many samples from statistical population ● Reality: Most frequently only one sample available ➔ Idea: Draw samples from an estimate of the statistical population (i.e. the sample) and use these to estimate property (e.g. variance) of the statistic t ● Algorithm: 1) Draw random sample with replacement from data 2) Compute statistic t* for bootstrap sample 3) Use the k estimates to derive property of statistic ● Exhaustive bootstrapping (k = nn ) computationally demanding → approximate with Monte Carlo simulation ● Given todays computer power 104 -105 simulations viable Repeat k times The name bootstrapping alludes to the phrase “pulling oneself up by one’s bootstraps,” which has been voiced by the fictional character Baron Münchhausen. In analogy to the permutation tests, the following applies to bootstrapping: The larger the number of MC-based bootstrap samples, the lower is the error when approximating the bootstrap distribution with the MC-based samples. 29 29 Bootstrapping: Idea and algorithm ● Inference on statistic t is based on sampling distribution ● Ideally: Draw all or many samples from statistical population ● Reality: Most frequently only one sample available ➔ Idea: Draw samples from an estimate of the statistical population (i.e. the sample) and use these to estimate property (e.g. variance) of the statistic t ● Algorithm: 1) Draw random sample with replacement from data 2) Compute statistic t* for bootstrap sample 3) Use the k estimates to derive property of statistic ● Exhaustive bootstrapping (k = nn ) computationally demanding → approximate with Monte Carlo simulation ● Given todays computer power 104 -105 simulations viable Repeat k times The name bootstrapping alludes to the phrase “pulling oneself up by one’s bootstraps,” which has been voiced by the fictional character Baron Münchhausen. In analogy to the permutation tests, the following applies to bootstrapping: The larger the number of MC-based bootstrap samples, the lower is the error when approximating the bootstrap distribution with the MC-based samples.
  • 76. 30 30 Bootstrapping: Example Original dataset 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 7 8 4 15 11 9 1 3 6 14 2 11 12 1 5 7 8 8 15 10 10 1 13 6 13 2 10 12 3 7 5 8 2 12 14 10 8 13 6 11 7 15 12 1 BS sample 1 BS sample 2 BS sample k . . . Sampling with replacement x̄ = 8 t (here: mean) x̄* = 7.93 x̄* = 8.2 x̄* = 8.73 Example: Bootstrap to the mean (to derive variance) . . . Distribution of statistic t 30 30 Bootstrapping: Example Original dataset 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 7 8 4 15 11 9 1 3 6 14 2 11 12 1 5 7 8 8 15 10 10 1 13 6 13 2 10 12 3 7 5 8 2 12 14 10 8 13 6 11 7 15 12 1 BS sample 1 BS sample 2 BS sample k . . . Sampling with replacement x̄ = 8 t (here: mean) x̄* = 7.93 x̄* = 8.2 x̄* = 8.73 Example: Bootstrap to the mean (to derive variance) . . . Distribution of statistic t
  • 77. 31 31 Bootstrapping: Limitations Hesterberg 2015 Amer. Statist. 69:371 ● Do not use for hypothesis testing ● No distributional assumptions implied, but not reliable for all distributions, particularly at small n (see Hesterberg 2015) ● Small n: use adjusted bootstrap percentiles (Bca) or switch to parametric statistics (allow for additional assumptions) ● Bootstrap does not improve estimate of population parameter , centred at x̄ µ Bootstrapping is generally less accurate than permutation tests for hypothesis testing. BCa corrects for bias and skewness in the distribution of bootstrap estimates. A very nice introduction and overview on bootstrapping is provided by: Hesterberg T.C. (2015) What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum. The American Statistician 69, 371–386. Freely available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784504/pdf/uta s-69-371.pdf 31 31 Bootstrapping: Limitations Hesterberg 2015 Amer. Statist. 69:371 ● Do not use for hypothesis testing ● No distributional assumptions implied, but not reliable for all distributions, particularly at small n (see Hesterberg 2015) ● Small n: use adjusted bootstrap percentiles (Bca) or switch to parametric statistics (allow for additional assumptions) ● Bootstrap does not improve estimate of population parameter , centred at x̄ µ Bootstrapping is generally less accurate than permutation tests for hypothesis testing. BCa corrects for bias and skewness in the distribution of bootstrap estimates. A very nice introduction and overview on bootstrapping is provided by: Hesterberg T.C. (2015) What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum. The American Statistician 69, 371–386. Freely available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784504/pdf/uta s-69-371.pdf
  • 78. 32 32 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 32 32 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 79. 33 33 Cross-validation (CV) ● Objective: Evaluate predictive accuracy of a fitted model ● Can be checked if independent training data (used to fit model) and test data (new data) are available → Rare case ● Idea: Split the available data into training and test set and predict the (known) observations in the test set from a model fitted with the training data ● Algorithm: 1. Draw k random samples without replacement from data 2. For each k: 1. Fit the model to the other k-1 parts 2. Predict k from model and calculate the prediction error 3. Calculate prediction error as average over the k estimates Predictive accuracy measures the accuracy of predictions for new data. CV is typically used in validation, but can also be used as goodness-of-fit measure to guide parameter estimation (see shrinkage methods later). 33 33 Cross-validation (CV) ● Objective: Evaluate predictive accuracy of a fitted model ● Can be checked if independent training data (used to fit model) and test data (new data) are available → Rare case ● Idea: Split the available data into training and test set and predict the (known) observations in the test set from a model fitted with the training data ● Algorithm: 1. Draw k random samples without replacement from data 2. For each k: 1. Fit the model to the other k-1 parts 2. Predict k from model and calculate the prediction error 3. Calculate prediction error as average over the k estimates Predictive accuracy measures the accuracy of predictions for new data. CV is typically used in validation, but can also be used as goodness-of-fit measure to guide parameter estimation (see shrinkage methods later).
  • 80. 34 34 Cross-validation (CV) ● Problem of choosing k: ● k = n (Leave-one-out CV predicts each observation from all others) → low bias, but high variance ● k = 2 (split data into half) → low variance, but high bias ● k typically set to 5 or 10 Taken from James et al. 2013: 181 Example: k = 5 The bias-variance trade-off will be discussed in detail on the next slides. In brief, there is a trade-off between bias (error when estimating the 'true' prediction accuracy of the sample data) and variance (variability of the error when estimating new (test) data). If we use a major fraction of the data (extreme case: k = n, where we use n-1 observations) in model fitting, the error of estimating the prediction accuracy of the full data is probably very low (low bias). However, the variability of the error when predicting a few (or only one for k = n) observations from different training sets is most likely high, which translates to a high variance. Conversely, if we use only half of the data (k = 2) in model fitting, we decrease the variance. In other words, the error when predicting the test set is most likely similar for the two training sets. But this comes at the cost of bias. In the case of k = 2, we are estimating the predictive accuracy from only a fraction of the data, whereas in practice all observations will be used in prediction. The prediction accuracy estimated from the fraction of the data is likely to differ (i.e. lower or higher) from that of the complete data set, i.e. exhibit bias. Thus, the bias increases when the relative size of the training set in CV decreases. k is typically set to 5 or 10, i.e. the data is partitioned in 5 or 10 groups during CV as a compromise between bias and variance. Leave-one-out CV is considered less reliable than 5- or 10-fold CV (see Harrell 2015: 172). 34 34 Cross-validation (CV) ● Problem of choosing k: ● k = n (Leave-one-out CV predicts each observation from all others) → low bias, but high variance ● k = 2 (split data into half) → low variance, but high bias ● k typically set to 5 or 10 Taken from James et al. 2013: 181 Example: k = 5 The bias-variance trade-off will be discussed in detail on the next slides. In brief, there is a trade-off between bias (error when estimating the 'true' prediction accuracy of the sample data) and variance (variability of the error when estimating new (test) data). If we use a major fraction of the data (extreme case: k = n, where we use n-1 observations) in model fitting, the error of estimating the prediction accuracy of the full data is probably very low (low bias). However, the variability of the error when predicting a few (or only one for k = n) observations from different training sets is most likely high, which translates to a high variance. Conversely, if we use only half of the data (k = 2) in model fitting, we decrease the variance. In other words, the error when predicting the test set is most likely similar for the two training sets. But this comes at the cost of bias. In the case of k = 2, we are estimating the predictive accuracy from only a fraction of the data, whereas in practice all observations will be used in prediction. The prediction accuracy estimated from the fraction of the data is likely to differ (i.e. lower or higher) from that of the complete data set, i.e. exhibit bias. Thus, the bias increases when the relative size of the training set in CV decreases. k is typically set to 5 or 10, i.e. the data is partitioned in 5 or 10 groups during CV as a compromise between bias and variance. Leave-one-out CV is considered less reliable than 5- or 10-fold CV (see Harrell 2015: 172).
  • 81. 35 35 Test data Training data function used to simulate data highly flexible smoother linear regression little flexible smoother Variance Bias-variance trade-off Definitions in context of model validation: ● Bias: error when approximating training data ● Variance: variability in error when approximating test data Taken from James et al. 2013: 33 Higher flexibility (higher k in CV) → lower error for training data (i.e. lower bias), but variance will increase from some point The left figure displays the fit of different models to data originating from the function plotted in black. The models rank regarding bias: linear regression > little flexible smoother > highly flexible smoother. Regarding variance (see right figure), the ranking is: highly flexible smoother > little flexible smoother > linear regression. 35 35 Test data Training data function used to simulate data highly flexible smoother linear regression little flexible smoother Variance Bias-variance trade-off Definitions in context of model validation: ● Bias: error when approximating training data ● Variance: variability in error when approximating test data Taken from James et al. 2013: 33 Higher flexibility (higher k in CV) → lower error for training data (i.e. lower bias), but variance will increase from some point The left figure displays the fit of different models to data originating from the function plotted in black. The models rank regarding bias: linear regression > little flexible smoother > highly flexible smoother. Regarding variance (see right figure), the ranking is: highly flexible smoother > little flexible smoother > linear regression.
  • 82. 36 36 Bias-variance trade-off Higher flexibility (higher k in CV) → lower error for training data (i.e. lower bias), but variance will increase from some point → Optimise combined error Taken from Hastie, Tibshirani and Friedman 2011: 38 For a mathematical derivation of the bias-variance trade-off see Matloff(2017): 48f. 36 36 Bias-variance trade-off Higher flexibility (higher k in CV) → lower error for training data (i.e. lower bias), but variance will increase from some point → Optimise combined error Taken from Hastie, Tibshirani and Friedman 2011: 38 For a mathematical derivation of the bias-variance trade-off see Matloff(2017): 48f.
  • 83. 37 37 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents 37 37 Statistical modelling, simulation and the linear model 1.Framework for data analysis and tools for data exploration 2.Statistical modelling and simulation-based tools 3.Permutation and Monte Carlo simulation 4.Bootstrapping 5.Cross-Validation and Bias-variance trade-off 6.Revisiting the linear model Contents
  • 84. 38 38 Relationship between two continuous variables: linear regression model ● Bivariate relationship between an explanatory variable and a response variable with: ● Example: Can we approximate pesticide runoff concentrations with passive sampling? Fernandez et al. 2014 yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ 2 ) The figure shows the concentrations of pesticides measured with passive samplers (TWA concentrations) and event-driven samplers (EDS peak concentrations). The concentrations are relatively similar for pesticides that were quantified in samples from both sampling devices, i.e. follow almost a 1:1 relationship, which means that passive sampling is a suitable technique to approximate peak concentrations. The non-filled points indicate cases where a compound was only quantified in samples of one of the sampling devices. Further details can be found in: Fernández D., Vermeirssen E.L.M., Bandow N., Muñoz K. & Schäfer R.B. (2014) Calibration and field application of passive sampling for episodic exposure to polar organic pesticides in streams. Environmental Pollution 194, 196–202. 38 38 Relationship between two continuous variables: linear regression model ● Bivariate relationship between an explanatory variable and a response variable with: ● Example: Can we approximate pesticide runoff concentrations with passive sampling? Fernandez et al. 2014 yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ 2 ) The figure shows the concentrations of pesticides measured with passive samplers (TWA concentrations) and event-driven samplers (EDS peak concentrations). The concentrations are relatively similar for pesticides that were quantified in samples from both sampling devices, i.e. follow almost a 1:1 relationship, which means that passive sampling is a suitable technique to approximate peak concentrations. The non-filled points indicate cases where a compound was only quantified in samples of one of the sampling devices. Further details can be found in: Fernández D., Vermeirssen E.L.M., Bandow N., Muñoz K. & Schäfer R.B. (2014) Calibration and field application of passive sampling for episodic exposure to polar organic pesticides in streams. Environmental Pollution 194, 196–202.
  • 85. 39 39 Relationship between two continuous variables: linear regression model ● Bivariate relationship between an explanatory variable and a response variable with: ● Aim: minimise ε (also called error sum of squares: SSE) yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ 2 ) A measure that is similar to the SSE is the Mean Squared Error (MSE), which is given as: MSE = 1 n− p−1 ∑ i=1 n ( yi− ^ yi) 2 ^ yi = b0+b1 xi The fitted values for the regression model, i.e. the estimates for y are given as: The denominator accounts for the number of explanatory variables p and the intercept and requires adjustment in case of no-intercept models (i.e. the denominator would turn into n-p). MSE is typically used when assessing the quality of the estimation. In case that the predictive accuracy is assessed, the Mean Squared Prediction Error (MSPE) is used for new observations yn+1 to ym . 39 39 Relationship between two continuous variables: linear regression model ● Bivariate relationship between an explanatory variable and a response variable with: ● Aim: minimise ε (also called error sum of squares: SSE) yi = b0+b1 xi+ϵi with ϵ ~ N (0,σ 2 ) A measure that is similar to the SSE is the Mean Squared Error (MSE), which is given as: MSE = 1 n− p−1 ∑ i=1 n ( yi− ^ yi) 2 ^ yi = b0+b1 xi The fitted values for the regression model, i.e. the estimates for y are given as: The denominator accounts for the number of explanatory variables p and the intercept and requires adjustment in case of no-intercept models (i.e. the denominator would turn into n-p). MSE is typically used when assessing the quality of the estimation. In case that the predictive accuracy is assessed, the Mean Squared Prediction Error (MSPE) is used for new observations yn+1 to ym .
  • 86. 40 40 Linear regression model SSY = SSR + SSE R2 = SSR SSY Total variation Explained variation Unexplained variation % of explained variance: adj. R2 = 1−1−R2  n−1 n− p−1 SSR refers to regression sum of squares and can be calculated as the summed quadratic differences between the fitted values and the mean for the response variable. SSE and SSY are defined as for the analysis of variance (ANOVA). Indeed, both ANOVA and linear regression are linear models and in R most functions apply to either of them. The square root of the R2 has the same absolute value as the Pearson correlation coefficient. The R2 is typically used to measure the goodness of fit of a regression model. The adjusted R2 should be preferred over the normal R2 as it takes the number of explanatory variables p into account (n is sample size). The denominator is n-p-1 accounting for the number of p and the intercept. However, this is more important in the case of multiple linear regression. 40 40 Linear regression model SSY = SSR + SSE R2 = SSR SSY Total variation Explained variation Unexplained variation % of explained variance: adj. R2 = 1−1−R2  n−1 n− p−1 SSR refers to regression sum of squares and can be calculated as the summed quadratic differences between the fitted values and the mean for the response variable. SSE and SSY are defined as for the analysis of variance (ANOVA). Indeed, both ANOVA and linear regression are linear models and in R most functions apply to either of them. The square root of the R2 has the same absolute value as the Pearson correlation coefficient. The R2 is typically used to measure the goodness of fit of a regression model. The adjusted R2 should be preferred over the normal R2 as it takes the number of explanatory variables p into account (n is sample size). The denominator is n-p-1 accounting for the number of p and the intercept. However, this is more important in the case of multiple linear regression.
  • 87. 41 41 Linear regression model ● Assumptions: ● Linear relationship (graphical diagnostics) ● Normal distribution of error (graphical diagnostics) ● Variance homogeneity (graphical diagnostics) ● Independence of errors (graphical diagnostics) ● If one or more assumptions not met, alternatives include: ● Generalised linear model, Generalised least squares, Mixed models ● Variable transformation (but using an appropriate model such as a Generalised linear model is usually the better option) Although hypothesis tests for checking the assumptions exist, most textbooks recommend graphical diagnostics. For data that is spatially or temporally structured or data that is nested/ hierarchically structured, the independence of errors assumption is typically violated. Since time series and spatial data are beyond the scope of this course (and is discussed in the “Advanced GIS” course), I refer to Faraway (2015): Linear models in R. p.81-83 for diagnostic tools to spot serial correlation or see Plant (2012): Spatial data analysis in ecology and agriculture using R. Spatial and temporal structure can be incorporated into the model using generalised least squares (see chapter 4 in Zuur, A. F. et al. 2009: Mixed effects models and extensions in ecology with R. Springer: New York). Nested/hierarchically structured data can be modelled with mixed effect models, which are discussed in the first part of this course. We will also discuss generalised linear models later in this course. Variable transformation and robust regression are discussed in many textbooks (e.g. Maindonald & Braun 2010, Quinn & Keough 2002) and are beyond the scope of this course (but variable transformation has been extensively discussed in the preceding course of univariate statistics). In linear regression analysis, we usually do not take the measurement error in x into account. This is discussed in detail in Warton, D.I., Wright, I.J., Falster, D.S., and Westoby, M. (2006). Bivariate line-fitting methods for allometry. Biological Reviews 81, 259-291. Warton et al. (2006) also provide information on alternatives to linear regression that should be used if the measurement error is relevant. 41 41 Linear regression model ● Assumptions: ● Linear relationship (graphical diagnostics) ● Normal distribution of error (graphical diagnostics) ● Variance homogeneity (graphical diagnostics) ● Independence of errors (graphical diagnostics) ● If one or more assumptions not met, alternatives include: ● Generalised linear model, Generalised least squares, Mixed models ● Variable transformation (but using an appropriate model such as a Generalised linear model is usually the better option) Although hypothesis tests for checking the assumptions exist, most textbooks recommend graphical diagnostics. For data that is spatially or temporally structured or data that is nested/ hierarchically structured, the independence of errors assumption is typically violated. Since time series and spatial data are beyond the scope of this course (and is discussed in the “Advanced GIS” course), I refer to Faraway (2015): Linear models in R. p.81-83 for diagnostic tools to spot serial correlation or see Plant (2012): Spatial data analysis in ecology and agriculture using R. Spatial and temporal structure can be incorporated into the model using generalised least squares (see chapter 4 in Zuur, A. F. et al. 2009: Mixed effects models and extensions in ecology with R. Springer: New York). Nested/hierarchically structured data can be modelled with mixed effect models, which are discussed in the first part of this course. We will also discuss generalised linear models later in this course. Variable transformation and robust regression are discussed in many textbooks (e.g. Maindonald & Braun 2010, Quinn & Keough 2002) and are beyond the scope of this course (but variable transformation has been extensively discussed in the preceding course of univariate statistics). In linear regression analysis, we usually do not take the measurement error in x into account. This is discussed in detail in Warton, D.I., Wright, I.J., Falster, D.S., and Westoby, M. (2006). Bivariate line-fitting methods for allometry. Biological Reviews 81, 259-291. Warton et al. (2006) also provide information on alternatives to linear regression that should be used if the measurement error is relevant.
  • 88. 42 42 Model diagnostics: Variance homogeneity „normal“ „strong increase“ „non-linear“ „slight increase“ Residuals vs. fitted values plots The graphical diagnostics for checking variance homogeneity are the same for linear regression and ANOVA (and other linear models), but the x-axis of ANOVA (and t-test) would display the factor levels and, consequently, the plots would not describe a continuous pattern. The displayed residuals-fitted values plots can be used to check whether the assumption of variance homogeneity, also termed homoscedasticity, (and the assumption of linearity in the case of regression) holds. If the residuals are not randomly distributed (upper right) but display patterns, this may indicate variance heterogeneity, also termed heteroscedasticity, (bottom and top left) or non-linearity (bottom right). In case of departures from the assumption of homoscedasticity, generalized least squares, generalized linear or additive models can represent a suitable alternative for continuous data. Depending on the data, data transformation or weighting observations may also be used to alleviate the issue, though transformation should only be considered if the data cannot be modelled non-transformed. 42 42 Model diagnostics: Variance homogeneity „normal“ „strong increase“ „non-linear“ „slight increase“ Residuals vs. fitted values plots The graphical diagnostics for checking variance homogeneity are the same for linear regression and ANOVA (and other linear models), but the x-axis of ANOVA (and t-test) would display the factor levels and, consequently, the plots would not describe a continuous pattern. The displayed residuals-fitted values plots can be used to check whether the assumption of variance homogeneity, also termed homoscedasticity, (and the assumption of linearity in the case of regression) holds. If the residuals are not randomly distributed (upper right) but display patterns, this may indicate variance heterogeneity, also termed heteroscedasticity, (bottom and top left) or non-linearity (bottom right). In case of departures from the assumption of homoscedasticity, generalized least squares, generalized linear or additive models can represent a suitable alternative for continuous data. Depending on the data, data transformation or weighting observations may also be used to alleviate the issue, though transformation should only be considered if the data cannot be modelled non-transformed.
  • 89. 43 43 Further model diagnostics Leverage points (predictor outlier) How to deal with leverage points/outliers? ● Check whether values are plausible ● Check robustness of model results when removing observations ● Fit different statistical model or transform data Leverage point that exerts high influence Non-influential leverage point Beside checking for assumptions, model diagnostics are used to detect influential points, leverage points (predictor outliers) and model outliers (outlier in response variable indicating model failure). Influential points exercise high influence on the model fit, but may not be outliers. Leverage points and outliers do not fit the model, but are not necessarily influential. Leverage points (1) exercise high influence on the fitted y (but not necessarily on the model fit) and (2) are distant from the other x-values. The leverage is calculated in terms of so-called hat values, which will be explained later in the course, and the average hat value is p/n, where p is the number of parameters in the model (including the intercept) and n is the number of observations. Faraway (2015):83 and Sheater (2009) suggest to look at points with hat values > 2 p/n more closely. However, hat values are independent of the response variable y and graphical inspection is most suitable to check whether a high leverage point is really problematic. A nice illustration of leverage can be found here: http://www.rob-mcculloch.org/teachingApplets/Leverage/index.html. Outliers in the response variable can be identified with studentized residuals. Here, points that deviate more than 2 standard deviations from the regression line may be considered as outlier (see Sheater 2009: p.60). There are of course different rules of thumb as to what can be regarded as an outlier – but it remains more or less a subjective decision. John Tukey suggested to define Y as an outlier if: Y < (Q1 − 1.5 IQR) or Y > (Q3 + 1.5 IQR), where Q1 denotes the lower quartile, Q3 denotes the upper quartile, and IQR = (Q3 − Q1) denotes the interquartile range. Hence, you could use a boxplot to identify an outlier. Another important measure in diagnostics plots represents Cooks distance. Cooks distance measures the influence of observations on the model fit by calculating the combined effect of leverage and of the magnitude of the residual. The higher Cooks distance the larger the change in model fit when the point is removed from the model. A point with a high Cooks distance tends to be either an outlier or a leverage point or both. There are different rules of thumb as to when consider a point as influential (e.g. Cooks D > 1 or Cooks D > 4/n-2), but in practice it is important to look for gaps in the values of Cooks distance (Sheater 2009: p.68). Methods such as robust regression and quantile regression have been developed to reduce the influence of influential points. However, they are outside the scope of this course. 43 43 Further model diagnostics Leverage points (predictor outlier) How to deal with leverage points/outliers? ● Check whether values are plausible ● Check robustness of model results when removing observations ● Fit different statistical model or transform data Leverage point that exerts high influence Non-influential leverage point Beside checking for assumptions, model diagnostics are used to detect influential points, leverage points (predictor outliers) and model outliers (outlier in response variable indicating model failure). Influential points exercise high influence on the model fit, but may not be outliers. Leverage points and outliers do not fit the model, but are not necessarily influential. Leverage points (1) exercise high influence on the fitted y (but not necessarily on the model fit) and (2) are distant from the other x-values. The leverage is calculated in terms of so-called hat values, which will be explained later in the course, and the average hat value is p/n, where p is the number of parameters in the model (including the intercept) and n is the number of observations. Faraway (2015):83 and Sheater (2009) suggest to look at points with hat values > 2 p/n more closely. However, hat values are independent of the response variable y and graphical inspection is most suitable to check whether a high leverage point is really problematic. A nice illustration of leverage can be found here: http://www.rob-mcculloch.org/teachingApplets/Leverage/index.html. Outliers in the response variable can be identified with studentized residuals. Here, points that deviate more than 2 standard deviations from the regression line may be considered as outlier (see Sheater 2009: p.60). There are of course different rules of thumb as to what can be regarded as an outlier – but it remains more or less a subjective decision. John Tukey suggested to define Y as an outlier if: Y < (Q1 − 1.5 IQR) or Y > (Q3 + 1.5 IQR), where Q1 denotes the lower quartile, Q3 denotes the upper quartile, and IQR = (Q3 − Q1) denotes the interquartile range. Hence, you could use a boxplot to identify an outlier. Another important measure in diagnostics plots represents Cooks distance. Cooks distance measures the influence of observations on the model fit by calculating the combined effect of leverage and of the magnitude of the residual. The higher Cooks distance the larger the change in model fit when the point is removed from the model. A point with a high Cooks distance tends to be either an outlier or a leverage point or both. There are different rules of thumb as to when consider a point as influential (e.g. Cooks D > 1 or Cooks D > 4/n-2), but in practice it is important to look for gaps in the values of Cooks distance (Sheater 2009: p.68). Methods such as robust regression and quantile regression have been developed to reduce the influence of influential points. However, they are outside the scope of this course.
  • 90. 44 44 Flowchart for simple linear regression Taken from Sheather 2009: p.103 Note that this flowchart only serves the purpose of giving orientation, whereas the suggestions may differ from the suggestions presented in this lecture. For example, if the errors do not have constant variance, the flowchart suggests the addition of new terms to the model and/or variable transformation. However, we have discussed in the lecture that other model types such as the Generalized linear model can be more appropriate for the data. Hence, before transformation of data, you should check whether the data can be directly modelled using a Generalized linear model or others (cf. Szöcs & Schäfer 2015). Note also that the bootstrap may not be reliable for small sample sizes, see the part on bootstrapping. Szöcs E. & Schäfer R. (2015) Ecotoxicology is not normal. Environmental Science and Pollution Research 22, 13990–13999. 44 44 Flowchart for simple linear regression Taken from Sheather 2009: p.103 Note that this flowchart only serves the purpose of giving orientation, whereas the suggestions may differ from the suggestions presented in this lecture. For example, if the errors do not have constant variance, the flowchart suggests the addition of new terms to the model and/or variable transformation. However, we have discussed in the lecture that other model types such as the Generalized linear model can be more appropriate for the data. Hence, before transformation of data, you should check whether the data can be directly modelled using a Generalized linear model or others (cf. Szöcs & Schäfer 2015). Note also that the bootstrap may not be reliable for small sample sizes, see the part on bootstrapping. Szöcs E. & Schäfer R. (2015) Ecotoxicology is not normal. Environmental Science and Pollution Research 22, 13990–13999.
  • 91. 45 45 Simulation-based approaches to simple linear regression ● Predictive accuracy measured with Mean square prediction error (MSPE): ● Cross-validation (CV): Calculate CV-MSPE and CV-R2 ● Bootstrapping (BS) in regression analysis: ● of residuals: BS residuals, add to to generate new and calculate regression coefficients → x fixed ● of cases: BS complete cases and calculate regression coefficients → x random ● If x and y random sample (e.g. x not fixed in experiment), residuals correlated or exhibit non-constant variance → BS cases MSPE = 1 m ∑ i=1 m ( yi− ^ yi)2 for the new observations 1 to m ^ y y* Bold variables indicate vectors. The mean square prediction error measures how well a new observation is predicted. For the relationship with other error measures see here. The algorithm for residual bootstrapping is easiest understood, when reformulating the equation for the ordinary linear regression model (see here for details) to: The bootstrap samples (samples with replacement) are drawn from the the n residuals yielding to a bootstrap sample of residuals These bootstrapped residuals are added to the vector of fitted responses ( ) to obtain a vector of new responses y*: These new responses are used to calculate new bootstrapped regression coefficients (i.e. ). The procedure is repeated 1,000 to 10,000 times and as usual in bootstrapping delivers the distribution for a test statistic t* (here for b0 and b1 ). For bootstrapping cases, pairs of x and y are bootstrapped and then the regression model is fitted, also providing bootstrapped regression coefficients (i.e. ). Now when to use what? In case that the residuals exhibit non-constant variance or are correlated, the bootstrap sample does not preserve the properties of the population sample and bootstrapping of cases should be preferred. However, if the observations for the predictors (x) have not been drawn randomly, but are fixed (for example, fixed concentration levels in an experiment), bootstrapping residuals should be preferred as it preserves these original x. For further details see Fox (2015): 658-660 and Hesterberg (2015) Americ. Statist. 69: 371–386. ϵi = yi−b0+b1 xi ⇔ ϵi = yi− ^ yi ϵ1 * ,ϵ2 * ,... ,ϵn * yi * = ^ yi+ϵi * ϵ1 ,ϵ2 ,... ,ϵn ^ y b0 * , b1 * b0 * , b1 * 45 45 Simulation-based approaches to simple linear regression ● Predictive accuracy measured with Mean square prediction error (MSPE): ● Cross-validation (CV): Calculate CV-MSPE and CV-R2 ● Bootstrapping (BS) in regression analysis: ● of residuals: BS residuals, add to to generate new and calculate regression coefficients → x fixed ● of cases: BS complete cases and calculate regression coefficients → x random ● If x and y random sample (e.g. x not fixed in experiment), residuals correlated or exhibit non-constant variance → BS cases MSPE = 1 m ∑ i=1 m ( yi− ^ yi)2 for the new observations 1 to m ^ y y* Bold variables indicate vectors. The mean square prediction error measures how well a new observation is predicted. For the relationship with other error measures see here. The algorithm for residual bootstrapping is easiest understood, when reformulating the equation for the ordinary linear regression model (see here for details) to: The bootstrap samples (samples with replacement) are drawn from the the n residuals yielding to a bootstrap sample of residuals These bootstrapped residuals are added to the vector of fitted responses ( ) to obtain a vector of new responses y*: These new responses are used to calculate new bootstrapped regression coefficients (i.e. ). The procedure is repeated 1,000 to 10,000 times and as usual in bootstrapping delivers the distribution for a test statistic t* (here for b0 and b1 ). For bootstrapping cases, pairs of x and y are bootstrapped and then the regression model is fitted, also providing bootstrapped regression coefficients (i.e. ). Now when to use what? In case that the residuals exhibit non-constant variance or are correlated, the bootstrap sample does not preserve the properties of the population sample and bootstrapping of cases should be preferred. However, if the observations for the predictors (x) have not been drawn randomly, but are fixed (for example, fixed concentration levels in an experiment), bootstrapping residuals should be preferred as it preserves these original x. For further details see Fox (2015): 658-660 and Hesterberg (2015) Americ. Statist. 69: 371–386. ϵi = yi−b0+b1 xi ⇔ ϵi = yi− ^ yi ϵ1 * ,ϵ2 * ,... ,ϵn * yi * = ^ yi+ϵi * ϵ1 ,ϵ2 ,... ,ϵn ^ y b0 * , b1 * b0 * , b1 *
  • 92. 46 46 Exercise We will work with the data set “possum” that includes biometric measurements of possums in Victoria, Australia. Conduct a linear regression analysis, diagnose and interpret the model and apply simulation-based approaches. 46 46 Exercise We will work with the data set “possum” that includes biometric measurements of possums in Victoria, Australia. Conduct a linear regression analysis, diagnose and interpret the model and apply simulation-based approaches.