Skip to contents

Introduction

Cross-validation is a critical component in evaluating the performance and generalizability of MVPA models. The rMVPA package provides several cross-validation strategies specifically designed for neuroimaging data, where temporal structure and run/block organization must be carefully considered.

This vignette covers: - The importance of cross-validation in MVPA - Available cross-validation schemes - How to implement each scheme - Best practices and considerations - Working examples

Cross-Validation Fundamentals

Why Cross-Validation Matters in MVPA

In neuroimaging analyses, cross-validation serves multiple purposes:

  1. Performance Estimation: Provides unbiased estimates of model performance
  2. Generalizability: Tests how well patterns generalize to new data
  3. Overfitting Prevention: Helps detect and prevent overfitting to noise
  4. Temporal Independence: Ensures temporal independence between training and test sets

Available Cross-Validation Schemes

The rMVPA package implements several cross-validation strategies:

  1. Blocked Cross-Validation: Uses scanning runs as natural validation blocks
  2. K-Fold Cross-Validation: Randomly partitions data into k folds
  3. Bootstrap Blocked Cross-Validation: Resamples within blocks with replacement
  4. Sequential Blocked Cross-Validation: Creates sequential folds within blocks
  5. Two-Fold Blocked Cross-Validation: Splits blocks into two groups
  6. Custom Cross-Validation: Allows user-defined validation schemes

Blocked Cross-Validation

Overview

Blocked cross-validation is the most common approach for fMRI data. It respects the temporal structure of the data by using scanning runs as natural validation blocks.

# Create a simple blocked structure: 5 runs with 20 trials each
block_var <- rep(1:5, each = 20)
cval <- blocked_cross_validation(block_var)
print(cval)
## 
##  █▀▀ Blocked Cross-Validation ▀▀█ 
## 
## ├─ Dataset Information 
## │  ├─ Observations:  100 
## │  └─ Number of Folds:  5 
## └─ Block Information 
##    ├─ Total Blocks:  5 
##    ├─ Mean Block Size:  20  (SD:  0 ) 
##    └─ Block Sizes:  1: 20, 2: 20, 3: 20, 4: 20, 5: 20

Implementation Example

# Generate example data
set.seed(123)
dat <- data.frame(
  x1 = rnorm(100),  # 100 trials total
  x2 = rnorm(100),
  x3 = rnorm(100)
)
y <- factor(rep(letters[1:5], length.out = 100))  # 5 conditions

# Generate cross-validation samples
samples <- crossval_samples(cval, dat, y)
print(samples)
## # A tibble: 5 × 5
##   ytrain       ytest        train               test                .id  
##   <named list> <named list> <named list>        <named list>        <chr>
## 1 <fct [80]>   <fct [20]>   <resample [80 x 3]> <resample [20 x 3]> 1    
## 2 <fct [80]>   <fct [20]>   <resample [80 x 3]> <resample [20 x 3]> 2    
## 3 <fct [80]>   <fct [20]>   <resample [80 x 3]> <resample [20 x 3]> 3    
## 4 <fct [80]>   <fct [20]>   <resample [80 x 3]> <resample [20 x 3]> 4    
## 5 <fct [80]>   <fct [20]>   <resample [80 x 3]> <resample [20 x 3]> 5

Understanding the Output

Each row in the samples tibble contains: - ytrain: Training labels - ytest: Test labels - train: Training data subset - test: Test data subset - .id: Fold identifier

Bootstrap Blocked Cross-Validation

Overview

This method combines blocking with bootstrap resampling, providing more stable performance estimates while respecting the run structure.

# Create bootstrap blocked CV with 20 repetitions
boot_cval <- bootstrap_blocked_cross_validation(block_var, nreps = 20)
print(boot_cval)
## 
##  █▀▀ Bootstrap Blocked Cross-Validation ▀▀█ 
## 
## ├─ Configuration 
## │  ├─ Observations:  100 
## │  └─ Bootstrap Repetitions:  20 
## ├─ Block Information 
## │  ├─ Total Blocks:  5 
## │  ├─ Mean Block Size:  20  (SD:  0 ) 
## │  └─ Block Sizes:  1: 20, 2: 20, 3: 20, 4: 20, 5: 20 
## └─ Sampling Weights 
##    └─ Status:  None  (uniform sampling)
# Generate samples
boot_samples <- crossval_samples(boot_cval, dat, y)
print(boot_samples)
## # A tibble: 100 × 5
##    ytrain     ytest      train               test                .id  
##    <list>     <list>     <list>              <list>              <chr>
##  1 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 001  
##  2 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 002  
##  3 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 003  
##  4 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 004  
##  5 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 005  
##  6 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 006  
##  7 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 007  
##  8 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 008  
##  9 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 009  
## 10 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 010  
## # ℹ 90 more rows

Optional Weighted Sampling

You can provide weights to influence the sampling probability:

# Create weights (e.g., based on motion parameters)
weights <- runif(length(block_var))
weighted_boot_cval <- bootstrap_blocked_cross_validation(
  block_var, 
  nreps = 20,
  weights = weights
)
print(weighted_boot_cval)
## 
##  █▀▀ Bootstrap Blocked Cross-Validation ▀▀█ 
## 
## ├─ Configuration 
## │  ├─ Observations:  100 
## │  └─ Bootstrap Repetitions:  20 
## ├─ Block Information 
## │  ├─ Total Blocks:  5 
## │  ├─ Mean Block Size:  20  (SD:  0 ) 
## │  └─ Block Sizes:  1: 20, 2: 20, 3: 20, 4: 20, 5: 20 
## └─ Sampling Weights 
##    ├─ Status:  Present 
##    ├─ Range:  [0.001, 0.020] 
##    └─ Non-zero Weights:  100  ( 100.0% )

Sequential Blocked Cross-Validation

Overview

This method creates sequential folds within each block, useful when temporal order matters.

# Create sequential blocked CV with 2 folds and 4 repetitions
seq_cval <- sequential_blocked_cross_validation(
  block_var,
  nfolds = 2,
  nreps = 4
)
print(seq_cval)
## $block_var
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##  [75] 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 
## $nfolds
## [1] 2
## 
## $nreps
## [1] 4
## 
## $block_ind
## [1] 1 2 3 4 5
## 
## attr(,"class")
## [1] "sequential_blocked_cross_validation" "cross_validation"                   
## [3] "list"

Two-Fold Blocked Cross-Validation

Overview

This approach randomly splits blocks into two groups, useful for rapid performance estimation.

# Create two-fold blocked CV with 10 repetitions
twofold_cval <- twofold_blocked_cross_validation(block_var, nreps = 10)
print(twofold_cval)
## 
##  █▀▀ Two-Fold Blocked Cross-Validation ▀▀█ 
## 
## ├─ Configuration 
## │  ├─ Observations:  100 
## │  ├─ Number of Folds:  2 
## │  └─ Repetitions:  10 
## └─ Block Information 
##    ├─ Total Blocks:  5 
##    ├─ Mean Block Size:  20  (SD:  0 ) 
##    └─ Block Sizes:  1: 20, 2: 20, 3: 20, 4: 20, 5: 20

Custom Cross-Validation

Overview

For specialized validation schemes, you can define custom training/testing splits:

# Define custom splits
custom_splits <- list(
  list(train = 1:60, test = 61:100),
  list(train = 1:40, test = 41:100),
  list(train = 1:80, test = 81:100)
)

# Create custom CV
custom_cval <- custom_cross_validation(custom_splits)
print(custom_cval)
## $sample_set
## $sample_set[[1]]
## $sample_set[[1]]$train
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60
## 
## $sample_set[[1]]$test
##  [1]  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79
## [20]  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98
## [39]  99 100
## 
## 
## $sample_set[[2]]
## $sample_set[[2]]$train
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 
## $sample_set[[2]]$test
##  [1]  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59
## [20]  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
## [39]  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97
## [58]  98  99 100
## 
## 
## $sample_set[[3]]
## $sample_set[[3]]$train
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## [76] 76 77 78 79 80
## 
## $sample_set[[3]]$test
##  [1]  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99
## [20] 100
## 
## 
## 
## $nfolds
## [1] 3
## 
## attr(,"class")
## [1] "custom_cross_validation" "cross_validation"       
## [3] "list"

Practical Example: Model Training

Here’s a complete example using blocked cross-validation with an SDA classifier:

# Setup cross-validation
block_var <- rep(1:5, each = 20)
cval <- blocked_cross_validation(block_var)

# Generate data
set.seed(123)
dat <- data.frame(matrix(rnorm(100 * 10), 100, 10))
y <- factor(rep(letters[1:5], 20))

# Generate CV samples
samples <- crossval_samples(cval, dat, y)

# Train models for each fold
model_fits <- samples %>% 
  rowwise() %>% 
  do({
    train_dat <- as.data.frame(.$train)
    y_train <- .$ytrain
    fit <- sda::sda(as.matrix(train_dat), y_train, verbose = FALSE)
    tibble::tibble(fit = list(fit))
  })

print(model_fits)
## # A tibble: 5 × 1
## # Rowwise: 
##   fit   
##   <list>
## 1 <sda> 
## 2 <sda> 
## 3 <sda> 
## 4 <sda> 
## 5 <sda>

Best Practices

When choosing a cross-validation strategy, consider:

  1. Data Structure
    • Use blocked CV when data has clear run/session boundaries
    • Consider sequential CV when temporal order matters
    • Use bootstrap blocked CV for more stable estimates
  2. Sample Size
    • For small datasets, bootstrap blocked CV can help
    • For large datasets, simple blocked CV may suffice
  3. Temporal Dependencies
    • Always respect the temporal structure of fMRI data
    • Avoid mixing training/testing data from the same run
    • Consider temporal autocorrelation
  4. Computation Time
    • Bootstrap and sequential methods require more computation
    • Two-fold CV offers quick preliminary results
    • Balance estimation stability with computational cost

Summary

The rMVPA package provides a comprehensive suite of cross-validation strategies for neuroimaging data. Key points:

  • Choose the appropriate CV strategy based on your data structure
  • Consider temporal dependencies in fMRI data
  • Use bootstrap methods for more stable estimates when needed
  • Implement custom schemes for specialized needs

For implementation details, refer to the source code in crossval.R.

Integration with Regional and Searchlight Analyses

Cross-validation strategies in rMVPA are designed to work seamlessly with both regional and searchlight analyses. Here we’ll demonstrate how to incorporate different cross-validation schemes into these analyses.

Regional Analysis Example

Let’s perform a regional MVPA analysis using different cross-validation strategies:

# Generate a sample dataset
data_out <- gen_sample_dataset(D = c(6,6,6), nobs = 80, blocks = 4, nlevels = 2)

# Create a region mask
mask <- data_out$dataset$mask
nvox <- sum(mask)
region_mask <- neuroim2::NeuroVol(
  sample(1:3, size = nvox, replace = TRUE), 
  space(mask), 
  indices = which(mask > 0)
)

# Create MVPA dataset
dset <- mvpa_dataset(data_out$dataset$train_data, mask = data_out$dataset$mask)

# Load the classification model
mod <- load_model("sda_notune")
tune_grid <- data.frame(lambda = 0.01, diagonal = FALSE)

# Example 1: Using Blocked Cross-Validation
blocked_cv <- blocked_cross_validation(data_out$design$block_var)
mvpa_mod_blocked <- mvpa_model(
  mod, 
  dataset = dset, 
  design = data_out$design,
  crossval = blocked_cv,
  tune_grid = tune_grid
)

# Run regional analysis with blocked CV
results_blocked <- run_regional(mvpa_mod_blocked, region_mask)
## INFO [2025-02-13 23:41:14] 
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
# Example 2: Using Bootstrap Blocked Cross-Validation
bootstrap_cv <- bootstrap_blocked_cross_validation(
  data_out$design$block_var,
  nreps = 10
)
mvpa_mod_boot <- mvpa_model(
  mod,
  dataset = dset,
  design = data_out$design,
  crossval = bootstrap_cv,
  tune_grid = tune_grid
)

# Run regional analysis with bootstrap CV
results_boot <- run_regional(mvpa_mod_boot, region_mask)
## INFO [2025-02-13 23:41:16] 
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
# Compare performance between CV strategies
cat("Blocked CV Performance:\n")
## Blocked CV Performance:
print(results_blocked$performance_table)
## # A tibble: 3 × 3
##   roinum Accuracy    AUC
##    <int>    <dbl>  <dbl>
## 1      1    0.6   0.0975
## 2      2    0.55  0.0931
## 3      3    0.575 0.114
cat("\nBootstrap CV Performance:\n")
## 
## Bootstrap CV Performance:
print(results_boot$performance_table)
## # A tibble: 3 × 3
##   roinum Accuracy    AUC
##    <int>    <dbl>  <dbl>
## 1      1    0.575 0.0569
## 2      2    0.512 0.0750
## 3      3    0.588 0.09

Searchlight Analysis Example

We can also use different cross-validation strategies in searchlight analysis:

# Example 3: Searchlight with Sequential Blocked Cross-Validation
seq_cv <- sequential_blocked_cross_validation(
  data_out$design$block_var,
  nfolds = 2,
  nreps = 4
)

mvpa_mod_seq <- mvpa_model(
  mod,
  dataset = dset,
  design = data_out$design,
  crossval = seq_cv,
  tune_grid = tune_grid
)

# Run searchlight analysis
results_searchlight <- run_searchlight(
  mvpa_mod_seq,
  radius = 2,
  method = "standard"
)

Key Considerations

When integrating cross-validation with regional or searchlight analyses:

  1. Memory Usage
    • Bootstrap and sequential methods generate more folds
    • Consider reducing nreps for large searchlight analyses
    • Monitor memory usage with large datasets
  2. Computation Time
    • Searchlight analysis multiplies computation across voxels
    • Two-fold CV might be preferable for initial searchlight exploration
    • Use parallel processing when available
  3. Result Stability
    • Bootstrap methods provide confidence intervals but require more computation
    • Consider the trade-off between estimation stability and computational cost
    • For searchlight analyses, simpler CV schemes might be sufficient
  4. Performance Metrics
    • Different CV schemes might produce slightly different performance estimates
    • Bootstrap methods generally provide more conservative estimates
    • Consider averaging results across repetitions for bootstrap and sequential CV

Example: Comparing CV Strategies in Regional Analysis

Here’s a more detailed comparison of different CV strategies:

# Create different CV schemes
cv_schemes <- list(
  blocked = blocked_cross_validation(data_out$design$block_var),
  bootstrap = bootstrap_blocked_cross_validation(data_out$design$block_var, nreps = 10),
  twofold = twofold_blocked_cross_validation(data_out$design$block_var, nreps = 5)
)

# Run regional analysis with each CV scheme
results <- lapply(cv_schemes, function(cv) {
  mvpa_mod <- mvpa_model(
    mod,
    dataset = dset,
    design = data_out$design,
    crossval = cv,
    tune_grid = tune_grid
  )
  run_regional(mvpa_mod, region_mask)
})
## INFO [2025-02-13 23:41:16] 
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
## INFO [2025-02-13 23:41:18] 
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
## INFO [2025-02-13 23:41:19] 
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
# Compare performance across CV schemes
performance_comparison <- lapply(names(results), function(name) {
  perf <- results[[name]]$performance_table
  perf$cv_scheme <- name
  perf
})

# Combine results
all_performance <- do.call(rbind, performance_comparison)
print(all_performance)
## # A tibble: 9 × 4
##   roinum Accuracy    AUC cv_scheme
##    <int>    <dbl>  <dbl> <chr>    
## 1      1    0.6   0.0975 blocked  
## 2      2    0.55  0.0931 blocked  
## 3      3    0.575 0.114  blocked  
## 4      1    0.6   0.0675 bootstrap
## 5      2    0.55  0.0663 bootstrap
## 6      3    0.6   0.0700 bootstrap
## 7      1    0.588 0.114  twofold  
## 8      2    0.538 0.0813 twofold  
## 9      3    0.525 0.0806 twofold

This integration demonstrates how different cross-validation strategies can be easily incorporated into the broader MVPA analysis framework, allowing researchers to choose the most appropriate validation approach for their specific analysis needs.

For more details on regional and searchlight analyses, refer to their respective vignettes and the implementation in regional.R and searchlight.R.