Cross-Validation Schemes
Bradley Buchsbaum
2024-04-23
CrossValidation.Rmd
Cross-Validation Approaches
Cross-validation is used in rMVPA
to evaluate the
performance of a trained classifier. In general this is achieved by
splitting the data into training and tests sets, fitting a model on the
training data and then evaluating its performance on the test set. The
methods described below are used for cases where one does not have a
predfined “test set”, but rather wants examine test set performance by
repeatedly analyzing the training data itself.
Blocked Cross-Validation
In fMRI analyses images are generally acquired over a number of scans or “runs” that form natural breaks in the data. Due to temporal auto-correlation in the data, it is generally not a good idea to train and test a classifier on trials collected in the same run. Therefore, when dividing the data into data blocks for cross-validation, it is natural to use scanning “run” as a means of splitting up the data into training and test folds.
There is a special data structure to help set this up called
blocked_cross_validation
. All we need is an variable that
indcates the block index of each trial in the experiment. For example,
imagine we have five scans/blocks, each with 100 images.
block_var <- rep(1:5, each=100)
cval <- blocked_cross_validation(block_var)
print(cval)
## cross-validation: blocked
## nobservations: 500
## nfolds: 5
## block sizes: 100 100 100 100 100
Now, to generate cross-validation samples, we use the
crossval_sample
generic function. We need to give it data
for the independent variables, data
and a response variable
y
. The result will be a data.frame
(or
tibble
to be precise) that contains in each row the samples
necessary to conduct a complete leave-one-block out cross-validation
analysis.
dat <- data.frame(x1=rnorm(500), x2=rnorm(500), x3=rnorm(500))
y <- rep(letters[1:5], length.out=500)
sam <- crossval_samples(cval, dat,y)
sam
## # A tibble: 5 × 5
## ytrain ytest train test .id
## <named list> <named list> <named list> <named list> <chr>
## 1 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 1
## 2 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 2
## 3 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 3
## 4 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 4
## 5 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 5
Notice that the data.frame
contains five variable:
ytrain
, ytest
, train
,
test
and .id
which contains the training
responses, the test responses, the training data, the test data, and a
integer id variable, respectively. The first four variables are
list
elements because they contain vector- or matrix- valued
elements in each cell. Indeed, the train
and
test
variables are S3 classes of type resample
from the modelr
package. For example, to access training
data for the first cross-validation fold, we can do the following:
train_dat <- as.data.frame(sam$train[[1]])
print(train_dat[1:5,])
## x1 x2 x3
## 101 -0.3872136 0.2360958 -1.7055817
## 102 -0.7854327 0.6289534 -0.8554131
## 103 -1.0567369 0.4179257 -0.1449016
## 104 -0.7955414 1.9767585 -0.3244470
## 105 -1.7562754 -0.5062863 -0.1725649
As a toy example, below we loop through the cross-validation sets
using the dplyr
rowwise
function, fit an
sda
model and put the fitted model into a new
data.frame.
Bootstrap Blocked Cross-Validation
The above blocked cross-validation iterates through the full dataset
once, with each block being held out as a test set for one
cross-validation fold. In some cases we might want a cross-validation
scheme that resamples the ful dataset more extensively, while still
respecting the block sampling structure of fMRI. In this case we can use
the bootstrap_blocked_cross_validation
scheme. Suppose, as
above, we have an integer-valued block variable. Now we create a
bootstrap_blocked_cross_validation
with 20 bootstrap
replications:
block_var <- rep(1:5, each=100)
cval <- bootstrap_blocked_cross_validation(block_var, nreps=20)
print(cval)
## cross-validation: bootstrap blocked
## n observations: 500
## n bootstrap reps: 20
## block sizes: 100 100 100 100 100
Now we create a set of new resamples. But this time instead of 5 resamples (one for each block), we have 100 resamples (20 for each block). Each block is used as a test set 20 times and for each of those 20 resamples, the training data is sampled with replacement from the remaining runs.
dat <- data.frame(x1=rnorm(500), x2=rnorm(500), x3=rnorm(500))
y <- rep(letters[1:5], length.out=500)
sam <- crossval_samples(cval, dat,y)
sam
## # A tibble: 100 × 5
## ytrain ytest train test .id
## <list> <list> <list> <list> <chr>
## 1 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 001
## 2 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 002
## 3 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 003
## 4 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 004
## 5 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 005
## 6 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 006
## 7 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 007
## 8 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 008
## 9 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 009
## 10 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 010
## # ℹ 90 more rows
Repeated Split-Half Resampling over Blocks
Another approach for repeatedly sampling the data while respecting
block structure is encapsulated in the
twofold_blocked_cross_validation
resampling scheme. Here
every training resample is drawn from a random half the blocks and every
test set is determined by the other half. This is done
nreps
times, yielding a set of split-half (or “two fold”)
resamples. Note that this approach requires more than two blocks since
with 2 blocks it would always split the data in the identical way,
i.e. it does not subsample or bootstrap trials within
blocks.
block_var <- rep(1:5, each=100)
cval <- twofold_blocked_cross_validation(block_var, nreps=20)
print(cval)
## twofold cross-validation: blocked
## nobservations: 500
## nreps: 20
## block sizes: 100 100 100 100 100
Again, we resample using crossval_samples
and supply a
data set and a response variable.
dat <- data.frame(x1=rnorm(500), x2=rnorm(500), x3=rnorm(500))
y <- rep(letters[1:5], length.out=500)
sam <- crossval_samples(cval, dat,y)
sam
## # A tibble: 10 × 5
## ytrain ytest train test .id
## <list> <list> <list> <list> <chr>
## 1 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 01
## 2 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 02
## 3 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 03
## 4 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 04
## 5 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 05
## 6 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 06
## 7 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 07
## 8 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 08
## 9 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 09
## 10 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 10