Cross-Validation Schemes

Cross-Validation Approaches

Cross-validation is used in rMVPA to evaluate the performance of a trained classifier. In general this is achieved by splitting the data into training and tests sets, fitting a model on the training data and then evaluating its performance on the test set. The methods described below are used for cases where one does not have a predfined “test set”, but rather wants examine test set performance by repeatedly analyzing the training data itself.

Blocked Cross-Validation

In fMRI analyses images are generally acquired over a number of scans or “runs” that form natural breaks in the data. Due to temporal auto-correlation in the data, it is generally not a good idea to train and test a classifier on trials collected in the same run. Therefore, when dividing the data into data blocks for cross-validation, it is natural to use scanning “run” as a means of splitting up the data into training and test folds.

There is a special data structure to help set this up called blocked_cross_validation. All we need is an variable that indcates the block index of each trial in the experiment. For example, imagine we have five scans/blocks, each with 100 images.

block_var <- rep(1:5, each=100)
cval <- blocked_cross_validation(block_var)
print(cval)

## cross-validation: blocked 
##   nobservations:  500 
##   nfolds:  5 
##   block sizes:  100 100 100 100 100

Now, to generate cross-validation samples, we use the crossval_sample generic function. We need to give it data for the independent variables, data and a response variable y. The result will be a data.frame (or tibble to be precise) that contains in each row the samples necessary to conduct a complete leave-one-block out cross-validation analysis.

dat <- data.frame(x1=rnorm(500), x2=rnorm(500), x3=rnorm(500))
y <- rep(letters[1:5], length.out=500)
sam <- crossval_samples(cval, dat,y)
sam

## # A tibble: 5 × 5
##   ytrain       ytest        train                test                 .id  
##   <named list> <named list> <named list>         <named list>         <chr>
## 1 <chr [400]>  <chr [100]>  <resample [400 x 3]> <resample [100 x 3]> 1    
## 2 <chr [400]>  <chr [100]>  <resample [400 x 3]> <resample [100 x 3]> 2    
## 3 <chr [400]>  <chr [100]>  <resample [400 x 3]> <resample [100 x 3]> 3    
## 4 <chr [400]>  <chr [100]>  <resample [400 x 3]> <resample [100 x 3]> 4    
## 5 <chr [400]>  <chr [100]>  <resample [400 x 3]> <resample [100 x 3]> 5

Notice that the data.frame contains five variable: ytrain, ytest, train, test and .id which contains the training responses, the test responses, the training data, the test data, and a integer id variable, respectively. The first four variables are listelements because they contain vector- or matrix- valued elements in each cell. Indeed, the train and test variables are S3 classes of type resample from the modelr package. For example, to access training data for the first cross-validation fold, we can do the following:

train_dat <- as.data.frame(sam$train[[1]])
print(train_dat[1:5,])

##             x1         x2         x3
## 101 -0.3872136  0.2360958 -1.7055817
## 102 -0.7854327  0.6289534 -0.8554131
## 103 -1.0567369  0.4179257 -0.1449016
## 104 -0.7955414  1.9767585 -0.3244470
## 105 -1.7562754 -0.5062863 -0.1725649

As a toy example, below we loop through the cross-validation sets using the dplyr rowwise function, fit an sda model and put the fitted model into a new data.frame.

library(dplyr)

model_fits <- sam %>% rowwise() %>% do({
  train_dat <- as.data.frame(.$train)
  y_train <- .$ytrain
  fit <- sda::sda(as.matrix(train_dat), y_train, verbose=FALSE)
  tibble::tibble(fit=list(fit))
})

Bootstrap Blocked Cross-Validation

The above blocked cross-validation iterates through the full dataset once, with each block being held out as a test set for one cross-validation fold. In some cases we might want a cross-validation scheme that resamples the ful dataset more extensively, while still respecting the block sampling structure of fMRI. In this case we can use the bootstrap_blocked_cross_validation scheme. Suppose, as above, we have an integer-valued block variable. Now we create a bootstrap_blocked_cross_validation with 20 bootstrap replications:

block_var <- rep(1:5, each=100)
cval <- bootstrap_blocked_cross_validation(block_var, nreps=20)
print(cval)

## cross-validation: bootstrap blocked 
##   n observations:  500 
##   n bootstrap reps:  20 
##   block sizes:  100 100 100 100 100

Now we create a set of new resamples. But this time instead of 5 resamples (one for each block), we have 100 resamples (20 for each block). Each block is used as a test set 20 times and for each of those 20 resamples, the training data is sampled with replacement from the remaining runs.

dat <- data.frame(x1=rnorm(500), x2=rnorm(500), x3=rnorm(500))
y <- rep(letters[1:5], length.out=500)
sam <- crossval_samples(cval, dat,y)
sam

## # A tibble: 100 × 5
##    ytrain      ytest       train                test                 .id  
##    <list>      <list>      <list>               <list>               <chr>
##  1 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 001  
##  2 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 002  
##  3 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 003  
##  4 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 004  
##  5 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 005  
##  6 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 006  
##  7 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 007  
##  8 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 008  
##  9 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 009  
## 10 <chr [400]> <chr [100]> <resample [400 x 3]> <resample [100 x 3]> 010  
## # ℹ 90 more rows

Repeated Split-Half Resampling over Blocks

Another approach for repeatedly sampling the data while respecting block structure is encapsulated in the twofold_blocked_cross_validation resampling scheme. Here every training resample is drawn from a random half the blocks and every test set is determined by the other half. This is done nreps times, yielding a set of split-half (or “two fold”) resamples. Note that this approach requires more than two blocks since with 2 blocks it would always split the data in the identical way, i.e. it does not subsample or bootstrap trials within blocks.

block_var <- rep(1:5, each=100)
cval <- twofold_blocked_cross_validation(block_var, nreps=20)
print(cval)

## twofold cross-validation: blocked 
##   nobservations:  500 
##   nreps:  20 
##   block sizes:  100 100 100 100 100

Again, we resample using crossval_samples and supply a data set and a response variable.

dat <- data.frame(x1=rnorm(500), x2=rnorm(500), x3=rnorm(500))
y <- rep(letters[1:5], length.out=500)
sam <- crossval_samples(cval, dat,y)
sam

## # A tibble: 10 × 5
##    ytrain      ytest       train                test                 .id  
##    <list>      <list>      <list>               <list>               <chr>
##  1 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 01   
##  2 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 02   
##  3 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 03   
##  4 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 04   
##  5 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 05   
##  6 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 06   
##  7 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 07   
##  8 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 08   
##  9 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 09   
## 10 <chr [300]> <chr [200]> <resample [300 x 3]> <resample [200 x 3]> 10

Bradley Buchsbaum

2024-04-23

Cross-Validation Approaches

Blocked Cross-Validation

Bootstrap Blocked Cross-Validation

Repeated Split-Half Resampling over Blocks