Cross-validation Tutorial: What, how and which?

Cross-validation:
what, how and which?
Pradeep Reddy Raamana
raamana.com
“Statistics [from cross-validation] are like bikinis.
what they reveal is suggestive, but what they conceal is vital.”

P. Raamana
Goals for Today
• What is cross-validation? Training set Test set
2

P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
Training set Test set
≈ℵ≈
2

P. Raamana
Goals for Today
• What are the effects of
different CV choices?
≈ℵ≈
2

P. Raamana
Goals for Today
• What are the effects of
different CV choices?
≈ℵ≈
negative bias unbiased positive bias
2

P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical deﬁnition

P. Raamana
available
data (sample*) desired: accuracy on  
unseen data (population*)

P. Raamana
available
out-of-sample
predictions

P. Raamana
available
out-of-sample
predictions
3
avoid  
overﬁtting
*has a statistical deﬁnition

P. Raamana
CV helps quantify generalizability
4

P. Raamana
Why cross-validate?
5

P. Raamana
Why cross-validate?
bigger training set
better learning
5

P. Raamana
Why cross-validate?
bigger training set
better learning better testing
bigger test set
5

P. Raamana
Why cross-validate?
bigger training set
bigger test set
Key: Train & test sets must be disjoint.
5

P. Raamana
Why cross-validate?
bigger training set
bigger test set
And the dataset or sample size is ﬁxed.
5

P. Raamana
Why cross-validate?
bigger training set
bigger test set
They grow at the expense of each other!
5

P. Raamana
Why cross-validate?
bigger training set
bigger test set
They grow at the expense of each other!
cross-validate
to maximize both
5

P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
6

P. Raamana
Use cases
typically used”
• Use cases:
6

P. Raamana
accuracydistribution 
fromrepetitionofCV(%)
Use cases
typically used”
• Use cases:
• to estimate generalizability  
(test accuracy)
6

P. Raamana
Use cases
typically used”
• Use cases:
(test accuracy)
• to pick optimal parameters  
(model selection)
6

P. Raamana
Use cases
typically used”
• Use cases:
(test accuracy)
• to pick optimal parameters  
(model selection)
• to compare performance  
(model comparison).
6

P. Raamana
Key Aspects of CV
7

P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
7

P. Raamana
Key Aspects of CV
•maximal independence between  
training and test sets is desired.
7

P. Raamana
Key Aspects of CV
•This split could be
• over samples (e.g. indiv. diagnosis)
samples
(rows)
7

P. Raamana
Key Aspects of CV
samples
(rows)
7
healt
hy
dise
ase

P. Raamana
Key Aspects of CV
• over time (for task prediction in fMRI)
time (columns)
samples
(rows)
7
healt
hy
dise
ase

P. Raamana
Key Aspects of CV
• over time (for task prediction in fMRI)
2. How often you repeat randomized splits?
•to expose classiﬁer to full variability
•As many as times as you can e.g. 100
≈ℵ≈
time (columns)
samples
(rows)
7
healt
hy
dise
ase

P. Raamana
Validation set
Training set
8*biased towards X —> overﬁt to X

P. Raamana
Validation set
goodness of ﬁt
of the model
Training set

P. Raamana
Validation set
goodness of ﬁt
of the model
biased* towards
the training set
Training set

P. Raamana
Validation set
goodness of ﬁt
of the model
biased* towards
the training set

P. Raamana
Validation set
goodness of ﬁt
of the model
biased* towards
the training set
≈ℵ≈

P. Raamana
Validation set
optimize
parameters
goodness of ﬁt
of the model
biased* towards
the training set
≈ℵ≈

P. Raamana
Validation set
optimize
parameters
goodness of ﬁt
of the model
biased towards
the test set
biased* towards
the training set
≈ℵ≈

P. Raamana
Validation set
optimize
parameters
goodness of ﬁt
of the model
biased towards
the test set
biased* towards
the training set
Training set Test set Validation set
≈ℵ≈

P. Raamana
Validation set
optimize
parameters
goodness of ﬁt
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
≈ℵ≈

P. Raamana
Validation set
optimize
parameters
goodness of ﬁt
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
Whole dataset
≈ℵ≈

P. Raamana
Validation set
optimize
parameters
goodness of ﬁt
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
Whole dataset
≈ℵ≈
inner-loop

P. Raamana
Validation set
optimize
parameters
goodness of ﬁt
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
Whole dataset
≈ℵ≈
inner-loop
outer-loop

P. Raamana
Terminology
9
Data split
Training
Testing
Validation

P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize 
hyperparameters
Evaluate fully-optimized
classiﬁer to report
performance

P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
core parameters
Optimize 
hyperparameters
performance
Don’ts (Invalid use)
Don’t report training error as
the test error!
Don’t do feature selection or
anything supervised on test
set to learn or optimize!
Don’t use it in any way to train
classiﬁer or optimize
parameters

P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
core parameters
Optimize 
hyperparameters
performance
Don’ts (Invalid use)
Don’t report training error as
the test error!
Don’t do feature selection or
anything supervised on test
set to learn or optimize!
Don’t use it in any way to train
classiﬁer or optimize
parameters
Alternative
names
Training  
(no confusion)
Validation  
(or tweaking, tuning,
optimization set)
Test set (more
accurately reporting
set)

P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10

P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
trial
1
2
…
k
10

P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
trial
1
2
…
k
Note: different folds won’t be contiguous. 10

P. Raamana
Repeated Holdout CV
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11

P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
whole dataset
11

P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Note: there could be overlap among the test sets  
from different trials! Hence large n is recommended.
whole dataset
11

P. Raamana
CV has many variations!
12

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
12

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
12

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
12

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
12

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
• across train/test
12

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
12
Controls MCIc

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
12
Controls MCIc
Training (MCIc)Training (CN)

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
12
Controls MCIc
Training (MCIc)Training (CN)
Test Set (CN) Tes

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
• across classes
12
Controls MCIc

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
• across classes
12
Controls MCIc
•inverted:  
very small training, large
testing

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
• across classes
12
Controls MCIc
•inverted:  
testing
•leave one [unit] out:

P. Raamana
•k-fold, k = 2, 3, 5, 10, 20
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratiﬁed
• across classes
12
Controls MCIc
•inverted:  
testing
•leave one [unit] out:
• unit —> sample / pair / tuple
/ condition / task / block out

P. Raamana
Measuring bias
in CV measurements
Whole dataset
13

P. Raamana
Measuring bias
in CV measurements
Inner-CV
Whole dataset
13

P. Raamana
Measuring bias
in CV measurements
cross-validation
accuracy!
Inner-CV
Whole dataset
13

P. Raamana
Measuring bias
in CV measurements
Validation set
cross-validation
accuracy!
Inner-CV
Whole dataset
13

P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy!
Inner-CV
Whole dataset
13

P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
Inner-CV
Whole dataset
13

P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
positive bias unbiased negative bias
Inner-CV
Whole dataset
13

P. Raamana
fMRI datasets
14
Dataset Intra- or inter? # samples
# blocks  
(sessions or subjects)
Tasks
Haxby Intra 209 12 seconds various
Duncan Inter 196 49 subjects various
Wager Inter 390 34 subjects various
Cohen Inter 80 24 subjects various
Moran Inter 138 36 subjects various
Henson Inter 286 16 subjects various
Knops Inter 14 19 subjects various
Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B.
(2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.

P. Raamana
Repeated holdout (10 trials, 20% test)
15

P. Raamana
Repeated holdout (10 trials, 20% test)Classiﬁeraccuracy 
viacross-validation
15

P. Raamana
Classiﬁer accuracy on validation set
Classiﬁeraccuracy 
viacross-validation
15

P. Raamana
viacross-validation
unbiased!
15

P. Raamana
viacross-validation
unbiased!
negatively 
biased
15

P. Raamana
viacross-validation
unbiased!
negatively 
biased
positively- 
biased
15

P. Raamana
CV vs. Validation: real data
16
conservative

P. Raamana
Simulations:
known ground truth
17

P. Raamana
CV vs. Validation
18

P. Raamana
Commensurability across folds
19

P. Raamana
• It’s not enough to properly split each
fold, and accurately evaluate
classiﬁer performance!
19

P. Raamana
19
Train Test

P. Raamana
19
Train Test
AUC1

P. Raamana
19
Train Test
AUC1
AUC2
AUC3
AUCn

P. Raamana
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn

P. Raamana
commensurate!
different!)
19
Train Test
AUC1
AUC2
AUC3
AUCn
x1
x2

P. Raamana
commensurate!
different!)
19
Train Test
AUC1
AUC2
AUC3
AUCn
x1
x2
L1

P. Raamana
commensurate!
different!)
19
Train Test
AUC1
AUC2
AUC3
AUCn
L2
x1
x2
L1

P. Raamana
Performance Metrics
20
Metric
Commensurate
across folds?
Advantages Disadvantages
Accuracy /
Error rate
Yes
Universally applicable;  
Multi-class;
Sensitive to  
class- and  
cost-imbalance  
Area under
ROC (AUC)
Only when ROC
is computed
within fold
Averages over all ratios
of misclassiﬁcation costs
Not easily extendable to
multi-class problems
F1 score Yes
Information  
retrieval
Does not take true
negatives into account

P. Raamana
Subtle Sources of Bias in CV
21
Type* Approach
sexy
name I
made up
How to avoid it?
k-hacking
Try many k’s in k-fold CV  
(or different training %)  
and report only the best
k-hacking
Pick k=10, repeat it many times  
(n>200 or as many as possible) and  
report the full distribution (not box plots)
metric-
hacking
Try different performance
metrics (accuracy, AUC, F1,
error rate), and report the best
m-hacking
Choose the most appropriate and
recognized metric for the problem e.g.
AUC for binary classiﬁcation etc
ROI-
hacking
Assess many ROIs (or their
features, or combinations), but
report only the best
r-hacking
Adopt a whole-brain data-driven approach
to discover best ROIs within an inner CV,
then report their out-of-sample predictive
accuracy
feature- or
dataset-
hacking
Try subsets of feature[s] or
subsamples of dataset[s], but
report only the best
d-hacking
Use and report on everything: all analyses
on all datasets, try inter-dataset CV, run
non-parametric statistical comparisons!
*exact incidence of these hacking approaches is unknown, but non-zero.

P. Raamana
Overﬁtting
22
Underﬁt

P. Raamana
Overfitting
22
Overfit
Underfit

P. Raamana
Overfitting
22
Good fit
Overfit
Underfit

P. Raamana
50 shades of overﬁtting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.

P. Raamana
“Clever forms of overﬁtting”
25from http://hunch.net/?p=22

P. Raamana
Is overﬁtting always bad?
26
some credit card  
fraud detection systems  
use it successfully
Others?

Cross-validation Tutorial: What, how and which?

P. Raamana
Limitations of CV
28

P. Raamana
Limitations of CV
• Number of CV repetitions increases with
28

P. Raamana
Limitations of CV
• sample size:
28

P. Raamana
Limitations of CV
• sample size:
• large sample —> large number of repetitions
28

P. Raamana
Limitations of CV
• sample size:
• esp. if the model training is computationally
expensive.
28

P. Raamana
Limitations of CV
• sample size:
expensive.
• number of model parameters, exponentially
28

P. Raamana
Limitations of CV
• sample size:
expensive.
• number of model parameters, exponentially
• to choose the best combination!
28

P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
29

P. Raamana
Recommendations
• Use repeated-holdout (10-50% for testing)
• respecting sample/dependency structure
• ensuring independence between train & test sets
29

P. Raamana
Recommendations
• Use repeated-holdout (10-50% for testing)
• respecting sample/dependency structure
• ensuring independence between train & test sets
• Use biggest test set, and large # repetitions when possible
• Not possible with leave-one-sample-out.
29

P. Raamana
Conclusions
• Results could vary considerably 
with a different CV scheme
30

P. Raamana
Conclusions
• CV results can have variance (>10%)
30

P. Raamana
Conclusions
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
30

P. Raamana
Conclusions
• type of split
• Proper splitting is not enough, 
proper pooling is needed too.
30

P. Raamana
Conclusions
• type of split
• Proper splitting is not enough, 
proper pooling is needed too.
30
• Bad examples:
• just mean: 𝜇%
• std. dev.: 𝜇±𝜎%
• Good examples:
• Using 250 iterations of
10-fold cross-validation,
we obtain the following
distribution of AUC.

P. Raamana
References
• Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo,
A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain
decoders: cross-validation, caveats, and guidelines.
NeuroImage. http://doi.org/10.1016/j.neuroimage.2016.10.038
• Arlot, S., & Celisse, A. (2010). A survey of cross-validation
procedures for model selection. Statistics Surveys, 4, 40–79.
• Forman, G. (2010). Apples-to-apples in cross-validation studies:
pitfalls in classiﬁer performance measurement. ACM SIGKDD
Explorations Newsletter.
31
neuro
predict
github.com/raamana/neuropredict

P. Raamana
Acknowledgements
32
Gael Varoquax

Now, it’s time to cross-validate!
xkcd

Useful tools
Software/
toolbox
Target
audience
Lang-
uage
Number  
of ML
techniques
Neuro-
imaging
oriented?
Coding
required?
Effort
needed
Use case
scikit-learn Generic ML Python Many No Yes High To try many ML techniques
neuro-
predict
Neuroimagers Python
1
(more soon)
Yes No Easy
Quick evaluation of
predictive performance!
nilearn Neuroimagers Python Few Yes Yes Medium
Some image processing is
required
PRoNTo Neuroimagers Matlab Few Yes Yes High Integration with matlab
PyMVPA Neuroimagers Python Few Yes Yes High Integration with Python
Weka Generic ML Java Many No Yes High
GUI to try many
techniques
Shogun Generic ML C++ Many No Yes High Efﬁcient

P. Raamana
Model selection
Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Springer, Berlin: Springer series in statistics.
44

P. Raamana
Datasets
• 7 fMRI datasets
• intra-subject
• inter-subject
• OASIS
• gender
discrimination
from VBM
maps
• Simulations
Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B.
(2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
47

Cross-validation Tutorial: What, how and which?

More Related Content

What's hot (20)

Similar to Cross-validation Tutorial: What, how and which? (20)

Recently uploaded (20)

Cross-validation Tutorial: What, how and which?