Cross-Validation Strategies in rMVPA
Bradley Buchsbaum
2025-02-13
CrossValidation.Rmd
Introduction
Cross-validation is a critical component in evaluating the
performance and generalizability of MVPA models. The rMVPA
package provides several cross-validation strategies specifically
designed for neuroimaging data, where temporal structure and run/block
organization must be carefully considered.
This vignette covers: - The importance of cross-validation in MVPA - Available cross-validation schemes - How to implement each scheme - Best practices and considerations - Working examples
Cross-Validation Fundamentals
Why Cross-Validation Matters in MVPA
In neuroimaging analyses, cross-validation serves multiple purposes:
- Performance Estimation: Provides unbiased estimates of model performance
- Generalizability: Tests how well patterns generalize to new data
- Overfitting Prevention: Helps detect and prevent overfitting to noise
- Temporal Independence: Ensures temporal independence between training and test sets
Available Cross-Validation Schemes
The rMVPA
package implements several cross-validation
strategies:
- Blocked Cross-Validation: Uses scanning runs as natural validation blocks
- K-Fold Cross-Validation: Randomly partitions data into k folds
- Bootstrap Blocked Cross-Validation: Resamples within blocks with replacement
- Sequential Blocked Cross-Validation: Creates sequential folds within blocks
- Two-Fold Blocked Cross-Validation: Splits blocks into two groups
- Custom Cross-Validation: Allows user-defined validation schemes
Blocked Cross-Validation
Overview
Blocked cross-validation is the most common approach for fMRI data. It respects the temporal structure of the data by using scanning runs as natural validation blocks.
# Create a simple blocked structure: 5 runs with 20 trials each
block_var <- rep(1:5, each = 20)
cval <- blocked_cross_validation(block_var)
print(cval)
##
## █▀▀ Blocked Cross-Validation ▀▀█
##
## ├─ Dataset Information
## │ ├─ Observations: 100
## │ └─ Number of Folds: 5
## └─ Block Information
## ├─ Total Blocks: 5
## ├─ Mean Block Size: 20 (SD: 0 )
## └─ Block Sizes: 1: 20, 2: 20, 3: 20, 4: 20, 5: 20
Implementation Example
# Generate example data
set.seed(123)
dat <- data.frame(
x1 = rnorm(100), # 100 trials total
x2 = rnorm(100),
x3 = rnorm(100)
)
y <- factor(rep(letters[1:5], length.out = 100)) # 5 conditions
# Generate cross-validation samples
samples <- crossval_samples(cval, dat, y)
print(samples)
## # A tibble: 5 × 5
## ytrain ytest train test .id
## <named list> <named list> <named list> <named list> <chr>
## 1 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 1
## 2 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 2
## 3 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 3
## 4 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 4
## 5 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 5
Bootstrap Blocked Cross-Validation
Overview
This method combines blocking with bootstrap resampling, providing more stable performance estimates while respecting the run structure.
# Create bootstrap blocked CV with 20 repetitions
boot_cval <- bootstrap_blocked_cross_validation(block_var, nreps = 20)
print(boot_cval)
##
## █▀▀ Bootstrap Blocked Cross-Validation ▀▀█
##
## ├─ Configuration
## │ ├─ Observations: 100
## │ └─ Bootstrap Repetitions: 20
## ├─ Block Information
## │ ├─ Total Blocks: 5
## │ ├─ Mean Block Size: 20 (SD: 0 )
## │ └─ Block Sizes: 1: 20, 2: 20, 3: 20, 4: 20, 5: 20
## └─ Sampling Weights
## └─ Status: None (uniform sampling)
# Generate samples
boot_samples <- crossval_samples(boot_cval, dat, y)
print(boot_samples)
## # A tibble: 100 × 5
## ytrain ytest train test .id
## <list> <list> <list> <list> <chr>
## 1 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 001
## 2 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 002
## 3 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 003
## 4 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 004
## 5 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 005
## 6 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 006
## 7 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 007
## 8 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 008
## 9 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 009
## 10 <fct [80]> <fct [20]> <resample [80 x 3]> <resample [20 x 3]> 010
## # ℹ 90 more rows
Optional Weighted Sampling
You can provide weights to influence the sampling probability:
# Create weights (e.g., based on motion parameters)
weights <- runif(length(block_var))
weighted_boot_cval <- bootstrap_blocked_cross_validation(
block_var,
nreps = 20,
weights = weights
)
print(weighted_boot_cval)
##
## █▀▀ Bootstrap Blocked Cross-Validation ▀▀█
##
## ├─ Configuration
## │ ├─ Observations: 100
## │ └─ Bootstrap Repetitions: 20
## ├─ Block Information
## │ ├─ Total Blocks: 5
## │ ├─ Mean Block Size: 20 (SD: 0 )
## │ └─ Block Sizes: 1: 20, 2: 20, 3: 20, 4: 20, 5: 20
## └─ Sampling Weights
## ├─ Status: Present
## ├─ Range: [0.001, 0.020]
## └─ Non-zero Weights: 100 ( 100.0% )
Sequential Blocked Cross-Validation
Overview
This method creates sequential folds within each block, useful when temporal order matters.
# Create sequential blocked CV with 2 folds and 4 repetitions
seq_cval <- sequential_blocked_cross_validation(
block_var,
nfolds = 2,
nreps = 4
)
print(seq_cval)
## $block_var
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [75] 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
##
## $nfolds
## [1] 2
##
## $nreps
## [1] 4
##
## $block_ind
## [1] 1 2 3 4 5
##
## attr(,"class")
## [1] "sequential_blocked_cross_validation" "cross_validation"
## [3] "list"
Two-Fold Blocked Cross-Validation
Overview
This approach randomly splits blocks into two groups, useful for rapid performance estimation.
# Create two-fold blocked CV with 10 repetitions
twofold_cval <- twofold_blocked_cross_validation(block_var, nreps = 10)
print(twofold_cval)
##
## █▀▀ Two-Fold Blocked Cross-Validation ▀▀█
##
## ├─ Configuration
## │ ├─ Observations: 100
## │ ├─ Number of Folds: 2
## │ └─ Repetitions: 10
## └─ Block Information
## ├─ Total Blocks: 5
## ├─ Mean Block Size: 20 (SD: 0 )
## └─ Block Sizes: 1: 20, 2: 20, 3: 20, 4: 20, 5: 20
Custom Cross-Validation
Overview
For specialized validation schemes, you can define custom training/testing splits:
# Define custom splits
custom_splits <- list(
list(train = 1:60, test = 61:100),
list(train = 1:40, test = 41:100),
list(train = 1:80, test = 81:100)
)
# Create custom CV
custom_cval <- custom_cross_validation(custom_splits)
print(custom_cval)
## $sample_set
## $sample_set[[1]]
## $sample_set[[1]]$train
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60
##
## $sample_set[[1]]$test
## [1] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## [20] 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
## [39] 99 100
##
##
## $sample_set[[2]]
## $sample_set[[2]]$train
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##
## $sample_set[[2]]$test
## [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
## [20] 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
## [39] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
## [58] 98 99 100
##
##
## $sample_set[[3]]
## $sample_set[[3]]$train
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## [76] 76 77 78 79 80
##
## $sample_set[[3]]$test
## [1] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
## [20] 100
##
##
##
## $nfolds
## [1] 3
##
## attr(,"class")
## [1] "custom_cross_validation" "cross_validation"
## [3] "list"
Practical Example: Model Training
Here’s a complete example using blocked cross-validation with an SDA classifier:
# Setup cross-validation
block_var <- rep(1:5, each = 20)
cval <- blocked_cross_validation(block_var)
# Generate data
set.seed(123)
dat <- data.frame(matrix(rnorm(100 * 10), 100, 10))
y <- factor(rep(letters[1:5], 20))
# Generate CV samples
samples <- crossval_samples(cval, dat, y)
# Train models for each fold
model_fits <- samples %>%
rowwise() %>%
do({
train_dat <- as.data.frame(.$train)
y_train <- .$ytrain
fit <- sda::sda(as.matrix(train_dat), y_train, verbose = FALSE)
tibble::tibble(fit = list(fit))
})
print(model_fits)
## # A tibble: 5 × 1
## # Rowwise:
## fit
## <list>
## 1 <sda>
## 2 <sda>
## 3 <sda>
## 4 <sda>
## 5 <sda>
Best Practices
When choosing a cross-validation strategy, consider:
-
Data Structure
- Use blocked CV when data has clear run/session boundaries
- Consider sequential CV when temporal order matters
- Use bootstrap blocked CV for more stable estimates
-
Sample Size
- For small datasets, bootstrap blocked CV can help
- For large datasets, simple blocked CV may suffice
-
Temporal Dependencies
- Always respect the temporal structure of fMRI data
- Avoid mixing training/testing data from the same run
- Consider temporal autocorrelation
-
Computation Time
- Bootstrap and sequential methods require more computation
- Two-fold CV offers quick preliminary results
- Balance estimation stability with computational cost
Summary
The rMVPA
package provides a comprehensive suite of
cross-validation strategies for neuroimaging data. Key points:
- Choose the appropriate CV strategy based on your data structure
- Consider temporal dependencies in fMRI data
- Use bootstrap methods for more stable estimates when needed
- Implement custom schemes for specialized needs
For implementation details, refer to the source code in
crossval.R
.
Integration with Regional and Searchlight Analyses
Cross-validation strategies in rMVPA
are designed to
work seamlessly with both regional and searchlight analyses. Here we’ll
demonstrate how to incorporate different cross-validation schemes into
these analyses.
Regional Analysis Example
Let’s perform a regional MVPA analysis using different cross-validation strategies:
# Generate a sample dataset
data_out <- gen_sample_dataset(D = c(6,6,6), nobs = 80, blocks = 4, nlevels = 2)
# Create a region mask
mask <- data_out$dataset$mask
nvox <- sum(mask)
region_mask <- neuroim2::NeuroVol(
sample(1:3, size = nvox, replace = TRUE),
space(mask),
indices = which(mask > 0)
)
# Create MVPA dataset
dset <- mvpa_dataset(data_out$dataset$train_data, mask = data_out$dataset$mask)
# Load the classification model
mod <- load_model("sda_notune")
tune_grid <- data.frame(lambda = 0.01, diagonal = FALSE)
# Example 1: Using Blocked Cross-Validation
blocked_cv <- blocked_cross_validation(data_out$design$block_var)
mvpa_mod_blocked <- mvpa_model(
mod,
dataset = dset,
design = data_out$design,
crossval = blocked_cv,
tune_grid = tune_grid
)
# Run regional analysis with blocked CV
results_blocked <- run_regional(mvpa_mod_blocked, region_mask)
## INFO [2025-02-13 23:41:14]
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
# Example 2: Using Bootstrap Blocked Cross-Validation
bootstrap_cv <- bootstrap_blocked_cross_validation(
data_out$design$block_var,
nreps = 10
)
mvpa_mod_boot <- mvpa_model(
mod,
dataset = dset,
design = data_out$design,
crossval = bootstrap_cv,
tune_grid = tune_grid
)
# Run regional analysis with bootstrap CV
results_boot <- run_regional(mvpa_mod_boot, region_mask)
## INFO [2025-02-13 23:41:16]
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
# Compare performance between CV strategies
cat("Blocked CV Performance:\n")
## Blocked CV Performance:
print(results_blocked$performance_table)
## # A tibble: 3 × 3
## roinum Accuracy AUC
## <int> <dbl> <dbl>
## 1 1 0.6 0.0975
## 2 2 0.55 0.0931
## 3 3 0.575 0.114
cat("\nBootstrap CV Performance:\n")
##
## Bootstrap CV Performance:
print(results_boot$performance_table)
## # A tibble: 3 × 3
## roinum Accuracy AUC
## <int> <dbl> <dbl>
## 1 1 0.575 0.0569
## 2 2 0.512 0.0750
## 3 3 0.588 0.09
Searchlight Analysis Example
We can also use different cross-validation strategies in searchlight analysis:
# Example 3: Searchlight with Sequential Blocked Cross-Validation
seq_cv <- sequential_blocked_cross_validation(
data_out$design$block_var,
nfolds = 2,
nreps = 4
)
mvpa_mod_seq <- mvpa_model(
mod,
dataset = dset,
design = data_out$design,
crossval = seq_cv,
tune_grid = tune_grid
)
# Run searchlight analysis
results_searchlight <- run_searchlight(
mvpa_mod_seq,
radius = 2,
method = "standard"
)
Key Considerations
When integrating cross-validation with regional or searchlight analyses:
-
Memory Usage
- Bootstrap and sequential methods generate more folds
- Consider reducing
nreps
for large searchlight analyses - Monitor memory usage with large datasets
-
Computation Time
- Searchlight analysis multiplies computation across voxels
- Two-fold CV might be preferable for initial searchlight exploration
- Use parallel processing when available
-
Result Stability
- Bootstrap methods provide confidence intervals but require more computation
- Consider the trade-off between estimation stability and computational cost
- For searchlight analyses, simpler CV schemes might be sufficient
-
Performance Metrics
- Different CV schemes might produce slightly different performance estimates
- Bootstrap methods generally provide more conservative estimates
- Consider averaging results across repetitions for bootstrap and sequential CV
Example: Comparing CV Strategies in Regional Analysis
Here’s a more detailed comparison of different CV strategies:
# Create different CV schemes
cv_schemes <- list(
blocked = blocked_cross_validation(data_out$design$block_var),
bootstrap = bootstrap_blocked_cross_validation(data_out$design$block_var, nreps = 10),
twofold = twofold_blocked_cross_validation(data_out$design$block_var, nreps = 5)
)
# Run regional analysis with each CV scheme
results <- lapply(cv_schemes, function(cv) {
mvpa_mod <- mvpa_model(
mod,
dataset = dset,
design = data_out$design,
crossval = cv,
tune_grid = tune_grid
)
run_regional(mvpa_mod, region_mask)
})
## INFO [2025-02-13 23:41:16]
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
## INFO [2025-02-13 23:41:18]
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
## INFO [2025-02-13 23:41:19]
## ✨ MVPA Iteration Complete
## ├─ Total ROIs: 3
## ├─ Processed: 3
## └─ Skipped: 0
# Compare performance across CV schemes
performance_comparison <- lapply(names(results), function(name) {
perf <- results[[name]]$performance_table
perf$cv_scheme <- name
perf
})
# Combine results
all_performance <- do.call(rbind, performance_comparison)
print(all_performance)
## # A tibble: 9 × 4
## roinum Accuracy AUC cv_scheme
## <int> <dbl> <dbl> <chr>
## 1 1 0.6 0.0975 blocked
## 2 2 0.55 0.0931 blocked
## 3 3 0.575 0.114 blocked
## 4 1 0.6 0.0675 bootstrap
## 5 2 0.55 0.0663 bootstrap
## 6 3 0.6 0.0700 bootstrap
## 7 1 0.588 0.114 twofold
## 8 2 0.538 0.0813 twofold
## 9 3 0.525 0.0806 twofold
This integration demonstrates how different cross-validation strategies can be easily incorporated into the broader MVPA analysis framework, allowing researchers to choose the most appropriate validation approach for their specific analysis needs.
For more details on regional and searchlight analyses, refer to their
respective vignettes and the implementation in regional.R
and searchlight.R
.