SlideShare a Scribd company logo
Cross-validation:
what, how and which?
Pradeep Reddy Raamana
raamana.com
“Statistics [from cross-validation] are like bikinis.
what they reveal is suggestive, but what they conceal is vital.”
P. Raamana
Goals for Today
2
P. Raamana
Goals for Today
• What is cross-validation? Training set Test set
2
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
Training set Test set
≈ℵ≈
2
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
• What are the effects of
different CV choices?
Training set Test set
≈ℵ≈
2
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
• What are the effects of
different CV choices?
Training set Test set
≈ℵ≈
negative bias unbiased positive bias
2
P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
out-of-sample
predictions
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
out-of-sample
predictions
3
avoid 

overfitting
*has a statistical definition
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
Why cross-validate?
Training set Test set
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
cross-validate
to maximize both
5
P. Raamana
Use cases
6
P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
6
P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
6
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability 

(test accuracy)
6
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability 

(test accuracy)
• to pick optimal parameters 

(model selection)
6
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability 

(test accuracy)
• to pick optimal parameters 

(model selection)
• to compare performance 

(model comparison).
6
P. Raamana
Key Aspects of CV
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
samples
(rows)
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
samples
(rows)
7
healt
hy
dise
ase
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
time (columns)
samples
(rows)
7
healt
hy
dise
ase
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
time (columns)
samples
(rows)
7
healt
hy
dise
ase
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
2. How often you repeat randomized splits?
•to expose classifier to full variability
•As many as times as you can e.g. 100
≈ℵ≈
time (columns)
samples
(rows)
7
healt
hy
dise
ase
P. Raamana
Validation set
Training set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
Training set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set Test set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set Test set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased* towards
the training set
Training set Test set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
Training set Test set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
Training set Test set Validation set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Training set Test set Validation set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
inner-loop
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
inner-loop
outer-loop
8*biased towards X —> overfit to X
P. Raamana
Terminology
9
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize

hyperparameters
Evaluate fully-optimized
classifier to report
performance
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize

hyperparameters
Evaluate fully-optimized
classifier to report
performance
Don’ts (Invalid use)
Don’t report training error as
the test error!
Don’t do feature selection or
anything supervised on test
set to learn or optimize!
Don’t use it in any way to train
classifier or optimize
parameters
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize

hyperparameters
Evaluate fully-optimized
classifier to report
performance
Don’ts (Invalid use)
Don’t report training error as
the test error!
Don’t do feature selection or
anything supervised on test
set to learn or optimize!
Don’t use it in any way to train
classifier or optimize
parameters
Alternative
names
Training 

(no confusion)
Validation 

(or tweaking, tuning,
optimization set)
Test set (more
accurately reporting
set)
P. Raamana
K-fold CV
10
P. Raamana
K-fold CV
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
Note: different folds won’t be contiguous. 10
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
Note: different folds won’t be contiguous. 10
P. Raamana
Repeated Holdout CV
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Note: there could be overlap among the test sets 

from different trials! Hence large n is recommended.
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
CV has many variations!
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN)
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN)
Test Set (CN) Tes
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted: 

very small training, large
testing
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted: 

very small training, large
testing
•leave one [unit] out:
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted: 

very small training, large
testing
•leave one [unit] out:
• unit —> sample / pair / tuple
/ condition / task / block out
P. Raamana
Measuring bias
in CV measurements
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
positive bias unbiased negative bias
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
fMRI datasets
14
Dataset Intra- or inter? # samples
# blocks 

(sessions or subjects)
Tasks
Haxby Intra 209 12 seconds various
Duncan Inter 196 49 subjects various
Wager Inter 390 34 subjects various
Cohen Inter 80 24 subjects various
Moran Inter 138 36 subjects various
Henson Inter 286 16 subjects various
Knops Inter 14 19 subjects various
Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B.
(2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
P. Raamana
Repeated holdout (10 trials, 20% test)
15
P. Raamana
Repeated holdout (10 trials, 20% test)Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
negatively

biased
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
negatively

biased
positively-

biased
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
negatively

biased
positively-

biased
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
negatively

biased
positively-

biased
15
P. Raamana
CV vs. Validation: real data
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
Simulations:
known ground truth
17
P. Raamana
Simulations:
known ground truth
17
P. Raamana
Simulations:
known ground truth
17
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
Commensurability across folds
19
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
AUC1
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
AUC1
AUC2
AUC3
AUCn
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
x1
x2
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
x1
x2
L1
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
L2
x1
x2
L1
P. Raamana
Performance Metrics
20
Metric
Commensurate
across folds?
Advantages Disadvantages
Accuracy /
Error rate
Yes
Universally applicable; 

Multi-class;
Sensitive to 

class- and 

cost-imbalance 

Area under
ROC (AUC)
Only when ROC
is computed
within fold
Averages over all ratios
of misclassification costs
Not easily extendable to
multi-class problems
F1 score Yes
Information 

retrieval
Does not take true
negatives into account
P. Raamana
Subtle Sources of Bias in CV
21
Type* Approach
sexy
name I
made up
How to avoid it?
k-hacking
Try many k’s in k-fold CV 

(or different training %) 

and report only the best
k-hacking
Pick k=10, repeat it many times 

(n>200 or as many as possible) and 

report the full distribution (not box plots)
metric-
hacking
Try different performance
metrics (accuracy, AUC, F1,
error rate), and report the best
m-hacking
Choose the most appropriate and
recognized metric for the problem e.g.
AUC for binary classification etc
ROI-
hacking
Assess many ROIs (or their
features, or combinations), but
report only the best
r-hacking
Adopt a whole-brain data-driven approach
to discover best ROIs within an inner CV,
then report their out-of-sample predictive
accuracy
feature- or
dataset-
hacking
Try subsets of feature[s] or
subsamples of dataset[s], but
report only the best
d-hacking
Use and report on everything: all analyses
on all datasets, try inter-dataset CV, run
non-parametric statistical comparisons!
*exact incidence of these hacking approaches is unknown, but non-zero.
P. Raamana
Overfitting
22
P. Raamana
Overfitting
22
P. Raamana
Overfitting
22
Underfit
P. Raamana
Overfitting
22
Overfit
Underfit
P. Raamana
Overfitting
22
Good fit
Overfit
Underfit
P. Raamana
50 shades of overfitting
23
Decade
Population
© mathworks
human 

annihilation?
P. Raamana
50 shades of overfitting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
P. Raamana
50 shades of overfitting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
P. Raamana
“Clever forms of overfitting”
25from http://hunch.net/?p=22
P. Raamana
Is overfitting always bad?
26
some credit card 

fraud detection systems 

use it successfully
Others?
Cross-validation Tutorial: What, how and which?
P. Raamana
Limitations of CV
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
• number of model parameters, exponentially
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
• number of model parameters, exponentially
• to choose the best combination!
28
P. Raamana
Recommendations
29
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
29
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
• Use repeated-holdout (10-50% for testing)
• respecting sample/dependency structure
• ensuring independence between train & test sets
29
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
• Use repeated-holdout (10-50% for testing)
• respecting sample/dependency structure
• ensuring independence between train & test sets
• Use biggest test set, and large # repetitions when possible
• Not possible with leave-one-sample-out.
29
P. Raamana
Conclusions
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
• Proper splitting is not enough,

proper pooling is needed too.
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
• Proper splitting is not enough,

proper pooling is needed too.
30
• Bad examples:
• just mean: 𝜇%
• std. dev.: 𝜇±𝜎%
• Good examples:
• Using 250 iterations of
10-fold cross-validation,
we obtain the following
distribution of AUC.
P. Raamana
References
• Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo,
A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain
decoders: cross-validation, caveats, and guidelines.
NeuroImage. http://doi.org/10.1016/j.neuroimage.2016.10.038
• Arlot, S., & Celisse, A. (2010). A survey of cross-validation
procedures for model selection. Statistics Surveys, 4, 40–79.
• Forman, G. (2010). Apples-to-apples in cross-validation studies:
pitfalls in classifier performance measurement. ACM SIGKDD
Explorations Newsletter.
31
neuro
predict
github.com/raamana/neuropredict
P. Raamana
Acknowledgements
32
Gael Varoquax
Now, it’s time to cross-validate!
xkcd
Useful tools
Software/
toolbox
Target
audience
Lang-
uage
Number 

of ML
techniques
Neuro-
imaging
oriented?
Coding
required?
Effort
needed
Use case
scikit-learn Generic ML Python Many No Yes High To try many ML techniques
neuro-
predict
Neuroimagers Python
1
(more soon)
Yes No Easy
Quick evaluation of
predictive performance!
nilearn Neuroimagers Python Few Yes Yes Medium
Some image processing is
required
PRoNTo Neuroimagers Matlab Few Yes Yes High Integration with matlab
PyMVPA Neuroimagers Python Few Yes Yes High Integration with Python
Weka Generic ML Java Many No Yes High
GUI to try many
techniques
Shogun Generic ML C++ Many No Yes High Efficient
P. Raamana
Model selection
Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Springer, Berlin: Springer series in statistics.
44
P. Raamana
Datasets
• 7 fMRI datasets
• intra-subject
• inter-subject
• OASIS
• gender
discrimination
from VBM
maps
• Simulations
Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B.
(2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
47

More Related Content

PPTX
Naïve Bayes Classifier Algorithm.pptx
PDF
Cross validation
PPTX
K-Folds Cross Validation Method
PPTX
Linear Regression and Logistic Regression in ML
PPTX
Support vector machine-SVM's
PPTX
Curse of dimensionality
PDF
Gradient descent method
PDF
Bias and variance trade off
Naïve Bayes Classifier Algorithm.pptx
Cross validation
K-Folds Cross Validation Method
Linear Regression and Logistic Regression in ML
Support vector machine-SVM's
Curse of dimensionality
Gradient descent method
Bias and variance trade off

What's hot (20)

PDF
R data types
PDF
PL/SQL TRIGGERS
PPTX
Supervised and unsupervised learning
PPTX
Gradient descent method
PDF
Linear regression
PPTX
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
PPTX
Object oriented database
PPTX
Machine Learning-Linear regression
PPTX
Clustering in data Mining (Data Mining)
PPTX
Data mining Measuring similarity and desimilarity
PDF
Logistic regression in Machine Learning
PDF
Feature selection
PPT
3.2 partitioning methods
PDF
Linear Regression vs Logistic Regression | Edureka
PPTX
Decision Tree - ID3
PPSX
Functional dependency
PPTX
K Nearest Neighbor Algorithm
PPT
2.5 backpropagation
PPTX
Elements of dynamic programming
PPTX
Ensemble Method (Bagging Boosting)
R data types
PL/SQL TRIGGERS
Supervised and unsupervised learning
Gradient descent method
Linear regression
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
Object oriented database
Machine Learning-Linear regression
Clustering in data Mining (Data Mining)
Data mining Measuring similarity and desimilarity
Logistic regression in Machine Learning
Feature selection
3.2 partitioning methods
Linear Regression vs Logistic Regression | Edureka
Decision Tree - ID3
Functional dependency
K Nearest Neighbor Algorithm
2.5 backpropagation
Elements of dynamic programming
Ensemble Method (Bagging Boosting)
Ad

Similar to Cross-validation Tutorial: What, how and which? (20)

PDF
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
PPTX
UNIT - 5 DESIGN AND ANALYSIS OF MACHINE LEARNING EXPERIMENTS
PPT
CROSS-VALIDATION AND MODEL SELECTION (1).ppt
PPTX
IME 672 - Classifier Evaluation I.pptx
PDF
Simple rules for building robust machine learning models
PPTX
Build_Machine_Learning_System for Machine Learning Course
PPTX
Model validation
PPTX
ML2_ML (1) concepts explained in details.pptx
PPTX
credibility : evaluating what's been learned from data science
PPTX
Performance Measurement for Machine Leaning.pptx
PDF
Modelling and evaluation
PPTX
How to Win Machine Learning Competitions ?
PPTX
Predicting Hospital Readmission Using TreeNet
PDF
Max Kuhn's talk on R machine learning
PDF
ML MODULE 5.pdf
PDF
Machine Learning using biased data
PDF
VSSML18. OptiML and Fusions
PDF
Introduction to Artificial Intelligence_ Lec 10
PPTX
module_of_healthcare_wound_healing_mbbs_3.pptx
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
UNIT - 5 DESIGN AND ANALYSIS OF MACHINE LEARNING EXPERIMENTS
CROSS-VALIDATION AND MODEL SELECTION (1).ppt
IME 672 - Classifier Evaluation I.pptx
Simple rules for building robust machine learning models
Build_Machine_Learning_System for Machine Learning Course
Model validation
ML2_ML (1) concepts explained in details.pptx
credibility : evaluating what's been learned from data science
Performance Measurement for Machine Leaning.pptx
Modelling and evaluation
How to Win Machine Learning Competitions ?
Predicting Hospital Readmission Using TreeNet
Max Kuhn's talk on R machine learning
ML MODULE 5.pdf
Machine Learning using biased data
VSSML18. OptiML and Fusions
Introduction to Artificial Intelligence_ Lec 10
module_of_healthcare_wound_healing_mbbs_3.pptx
Ad

Recently uploaded (20)

PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Mega Projects Data Mega Projects Data
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Acumen Training GuidePresentation.pptx
Reliability_Chapter_ presentation 1221.5784
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Foundation of Data Science unit number two notes
Data_Analytics_and_PowerBI_Presentation.pptx
Database Infoormation System (DBIS).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Knowledge Engineering Part 1
Supervised vs unsupervised machine learning algorithms
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Mega Projects Data Mega Projects Data
Qualitative Qantitative and Mixed Methods.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
IBA_Chapter_11_Slides_Final_Accessible.pptx

Cross-validation Tutorial: What, how and which?

  • 1. Cross-validation: what, how and which? Pradeep Reddy Raamana raamana.com “Statistics [from cross-validation] are like bikinis. what they reveal is suggestive, but what they conceal is vital.”
  • 3. P. Raamana Goals for Today • What is cross-validation? Training set Test set 2
  • 4. P. Raamana Goals for Today • What is cross-validation? • How to perform it? Training set Test set ≈ℵ≈ 2
  • 5. P. Raamana Goals for Today • What is cross-validation? • How to perform it? • What are the effects of different CV choices? Training set Test set ≈ℵ≈ 2
  • 6. P. Raamana Goals for Today • What is cross-validation? • How to perform it? • What are the effects of different CV choices? Training set Test set ≈ℵ≈ negative bias unbiased positive bias 2
  • 7. P. Raamana What is generalizability? available data (sample*) 3*has a statistical definition
  • 8. P. Raamana What is generalizability? available data (sample*) 3*has a statistical definition
  • 9. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) 3*has a statistical definition
  • 10. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) 3*has a statistical definition
  • 11. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) out-of-sample predictions 3*has a statistical definition
  • 12. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) out-of-sample predictions 3 avoid 
 overfitting *has a statistical definition
  • 13. P. Raamana CV helps quantify generalizability 4
  • 14. P. Raamana CV helps quantify generalizability 4
  • 15. P. Raamana CV helps quantify generalizability 4
  • 16. P. Raamana CV helps quantify generalizability 4
  • 18. P. Raamana Why cross-validate? Training set Test set bigger training set better learning 5
  • 19. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set 5
  • 20. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. 5
  • 21. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. 5
  • 22. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! 5
  • 23. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! 5
  • 24. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! cross-validate to maximize both 5
  • 26. P. Raamana Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” 6
  • 27. P. Raamana Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: 6
  • 28. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) 6
  • 29. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) • to pick optimal parameters 
 (model selection) 6
  • 30. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) • to pick optimal parameters 
 (model selection) • to compare performance 
 (model comparison). 6
  • 32. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test 7
  • 33. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. 7
  • 34. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) samples (rows) 7
  • 35. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) samples (rows) 7 healt hy dise ase
  • 36. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) time (columns) samples (rows) 7 healt hy dise ase
  • 37. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) time (columns) samples (rows) 7 healt hy dise ase
  • 38. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) 2. How often you repeat randomized splits? •to expose classifier to full variability •As many as times as you can e.g. 100 ≈ℵ≈ time (columns) samples (rows) 7 healt hy dise ase
  • 39. P. Raamana Validation set Training set 8*biased towards X —> overfit to X
  • 40. P. Raamana Validation set goodness of fit of the model Training set 8*biased towards X —> overfit to X
  • 41. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set 8*biased towards X —> overfit to X
  • 42. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set Test set 8*biased towards X —> overfit to X
  • 43. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 44. P. Raamana Validation set optimize parameters goodness of fit of the model biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 45. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 46. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 47. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 48. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 49. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ inner-loop 8*biased towards X —> overfit to X
  • 50. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ inner-loop outer-loop 8*biased towards X —> overfit to X
  • 53. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance
  • 54. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance Don’ts (Invalid use) Don’t report training error as the test error! Don’t do feature selection or anything supervised on test set to learn or optimize! Don’t use it in any way to train classifier or optimize parameters
  • 55. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance Don’ts (Invalid use) Don’t report training error as the test error! Don’t do feature selection or anything supervised on test set to learn or optimize! Don’t use it in any way to train classifier or optimize parameters Alternative names Training 
 (no confusion) Validation 
 (or tweaking, tuning, optimization set) Test set (more accurately reporting set)
  • 58. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  • 59. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  • 60. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  • 61. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  • 62. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k 10
  • 63. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k Note: different folds won’t be contiguous. 10
  • 64. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k Note: different folds won’t be contiguous. 10
  • 65. P. Raamana Repeated Holdout CV Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 66. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 67. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 68. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 69. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 70. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Note: there could be overlap among the test sets 
 from different trials! Hence large n is recommended. Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 71. P. Raamana CV has many variations! 12
  • 72. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 12
  • 73. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) 12
  • 74. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 12
  • 75. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified 12
  • 76. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12
  • 77. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc
  • 78. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN)
  • 79. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  • 80. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  • 81. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  • 82. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  • 83. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing
  • 84. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing •leave one [unit] out:
  • 85. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing •leave one [unit] out: • unit —> sample / pair / tuple / condition / task / block out
  • 86. P. Raamana Measuring bias in CV measurements Whole dataset 13
  • 87. P. Raamana Measuring bias in CV measurements Training set Test set Inner-CV Whole dataset 13
  • 88. P. Raamana Measuring bias in CV measurements cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  • 89. P. Raamana Measuring bias in CV measurements Validation set cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  • 90. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  • 91. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! ≈ Training set Test set Inner-CV Whole dataset 13
  • 92. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! ≈ positive bias unbiased negative bias Training set Test set Inner-CV Whole dataset 13
  • 93. P. Raamana fMRI datasets 14 Dataset Intra- or inter? # samples # blocks 
 (sessions or subjects) Tasks Haxby Intra 209 12 seconds various Duncan Inter 196 49 subjects various Wager Inter 390 34 subjects various Cohen Inter 80 24 subjects various Moran Inter 138 36 subjects various Henson Inter 286 16 subjects various Knops Inter 14 19 subjects various Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
  • 94. P. Raamana Repeated holdout (10 trials, 20% test) 15
  • 95. P. Raamana Repeated holdout (10 trials, 20% test)Classifieraccuracy
 viacross-validation 15
  • 96. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  • 97. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  • 98. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  • 99. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! 15
  • 100. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased 15
  • 101. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  • 102. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  • 103. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  • 104. P. Raamana CV vs. Validation: real data 16 conservative
  • 105. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  • 106. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  • 107. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  • 108. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  • 112. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  • 113. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  • 114. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  • 115. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  • 117. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19
  • 118. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test
  • 119. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test AUC1
  • 120. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test AUC1 AUC2 AUC3 AUCn
  • 121. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn
  • 122. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn x1 x2
  • 123. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn x1 x2 L1
  • 124. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn L2 x1 x2 L1
  • 125. P. Raamana Performance Metrics 20 Metric Commensurate across folds? Advantages Disadvantages Accuracy / Error rate Yes Universally applicable; 
 Multi-class; Sensitive to 
 class- and 
 cost-imbalance 
 Area under ROC (AUC) Only when ROC is computed within fold Averages over all ratios of misclassification costs Not easily extendable to multi-class problems F1 score Yes Information 
 retrieval Does not take true negatives into account
  • 126. P. Raamana Subtle Sources of Bias in CV 21 Type* Approach sexy name I made up How to avoid it? k-hacking Try many k’s in k-fold CV 
 (or different training %) 
 and report only the best k-hacking Pick k=10, repeat it many times 
 (n>200 or as many as possible) and 
 report the full distribution (not box plots) metric- hacking Try different performance metrics (accuracy, AUC, F1, error rate), and report the best m-hacking Choose the most appropriate and recognized metric for the problem e.g. AUC for binary classification etc ROI- hacking Assess many ROIs (or their features, or combinations), but report only the best r-hacking Adopt a whole-brain data-driven approach to discover best ROIs within an inner CV, then report their out-of-sample predictive accuracy feature- or dataset- hacking Try subsets of feature[s] or subsamples of dataset[s], but report only the best d-hacking Use and report on everything: all analyses on all datasets, try inter-dataset CV, run non-parametric statistical comparisons! *exact incidence of these hacking approaches is unknown, but non-zero.
  • 132. P. Raamana 50 shades of overfitting 23 Decade Population © mathworks human 
 annihilation?
  • 133. P. Raamana 50 shades of overfitting 24 Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
  • 134. P. Raamana 50 shades of overfitting 24 Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
  • 135. P. Raamana “Clever forms of overfitting” 25from http://hunch.net/?p=22
  • 136. P. Raamana Is overfitting always bad? 26 some credit card 
 fraud detection systems 
 use it successfully Others?
  • 139. P. Raamana Limitations of CV • Number of CV repetitions increases with 28
  • 140. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: 28
  • 141. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions 28
  • 142. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. 28
  • 143. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. • number of model parameters, exponentially 28
  • 144. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. • number of model parameters, exponentially • to choose the best combination! 28
  • 146. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! 29
  • 147. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! • Use repeated-holdout (10-50% for testing) • respecting sample/dependency structure • ensuring independence between train & test sets 29
  • 148. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! • Use repeated-holdout (10-50% for testing) • respecting sample/dependency structure • ensuring independence between train & test sets • Use biggest test set, and large # repetitions when possible • Not possible with leave-one-sample-out. 29
  • 150. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme 30
  • 151. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) 30
  • 152. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates 30
  • 153. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates • Proper splitting is not enough,
 proper pooling is needed too. 30
  • 154. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates • Proper splitting is not enough,
 proper pooling is needed too. 30 • Bad examples: • just mean: 𝜇% • std. dev.: 𝜇±𝜎% • Good examples: • Using 250 iterations of 10-fold cross-validation, we obtain the following distribution of AUC.
  • 155. P. Raamana References • Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. http://doi.org/10.1016/j.neuroimage.2016.10.038 • Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. • Forman, G. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter. 31 neuro predict github.com/raamana/neuropredict
  • 157. Now, it’s time to cross-validate! xkcd
  • 158. Useful tools Software/ toolbox Target audience Lang- uage Number 
 of ML techniques Neuro- imaging oriented? Coding required? Effort needed Use case scikit-learn Generic ML Python Many No Yes High To try many ML techniques neuro- predict Neuroimagers Python 1 (more soon) Yes No Easy Quick evaluation of predictive performance! nilearn Neuroimagers Python Few Yes Yes Medium Some image processing is required PRoNTo Neuroimagers Matlab Few Yes Yes High Integration with matlab PyMVPA Neuroimagers Python Few Yes Yes High Integration with Python Weka Generic ML Java Many No Yes High GUI to try many techniques Shogun Generic ML C++ Many No Yes High Efficient
  • 159. P. Raamana Model selection Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Springer, Berlin: Springer series in statistics. 44
  • 160. P. Raamana Datasets • 7 fMRI datasets • intra-subject • inter-subject • OASIS • gender discrimination from VBM maps • Simulations Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. 47