Feature Selection in rMVPA
Bradley Buchsbaum
2025-09-28
FeatureSelection.Rmd
Introduction
Neuroimaging datasets often contain far more features than
informative signal. Feature selection reduces dimensionality, improves
interpretability, and can boost predictive performance—provided it is
performed inside the cross‑validation loop to avoid selection bias. In
rMVPA
, you configure selection with a feature‑selector
object. The package supports an ANOVA‑based F‑test (“FTest”) and
correlation‑adjusted t‑scores (“catscore”, via sda.ranking
)
to rank features, together with simple cutoffs: keep the top
k
features or the top proportion p
(e.g.,
p = 0.1
keeps the top 10%).
Creating a Feature Selector Object
To create a feature selector in rMVPA
, use the
feature_selector()
function. For example, to construct a
feature selector using the FTest method with a top_k cutoff (selecting
the top 10 features):
suppressPackageStartupMessages(library(rMVPA))
# Create a feature selector using FTest with top_k cutoff (select top 10 features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_k", cutoff_value = 10)
fsel
## Feature Selector Object\n-----------------------\nMethod: FTest \nCutoff Type: top_k \nCutoff Value: 10 \n
Similarly, you can create a feature selector that selects a proportion of features using the top_p option. In the example below, we select the top 10% of features based on the FTest ranking:
# Create a feature selector using FTest with top_p cutoff (select top 10% of features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_p", cutoff_value = 0.1)
fsel
## Feature Selector Object\n-----------------------\nMethod: FTest \nCutoff Type: top_p \nCutoff Value: 0.1 \n
Applying Feature Selection to Data
The select_features()
function applies the feature
selection process to a given feature matrix X
and a
response variable Y
. The function returns a logical vector
with TRUE
for selected features and FALSE
otherwise.
Below is an example using simulated data:
# Simulate a response variable (categorical)
Y <- factor(rep(letters[1:4], each = 25))
# Simulate a feature matrix with 100 samples and 100 features
X <- matrix(rnorm(100 * 100), nrow = 100, ncol = 100)
# Apply feature selection using the FTest method with top_k cutoff
fsel <- feature_selector(method = "FTest", cutoff_type = "top_k", cutoff_value = 10)
selected_features <- select_features(fsel, X, Y)
# The number of selected features should be equal to the cutoff value (10)
cat("Number of selected features (top_k):", sum(selected_features), "\n")
## Number of selected features (top_k): 10
Now, let’s use the top_p option. This will select a proportion of the features. For example, with a cutoff value of 0.1, the top 10% of features will be selected:
# Apply feature selection using the FTest method with top_p cutoff (select top 10% of features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_p", cutoff_value = 0.1)
selected_features <- select_features(fsel, X, Y)
# Calculate the proportion of features selected
the_proportion <- sum(selected_features) / ncol(X)
cat("Proportion of features selected (top_p):", the_proportion, "\n")
## Proportion of features selected (top_p): 0.1
Using the catscore Method
Alternatively, you can use the catscore
method to
perform feature selection. The catscore
method computes a
correlation-adjusted t-score for each feature. Here’s an example:
# Create a feature selector using catscore with top_k cutoff (select top 10 features)
fsel <- feature_selector(method = "catscore", cutoff_type = "top_k", cutoff_value = 10)
# Simulate a response variable and feature matrix
Y <- factor(rep(letters[1:3], length.out = 90))
X <- matrix(rnorm(90 * 50), nrow = 90, ncol = 50)
# Apply feature selection using catscore
selected_features <- select_features(fsel, X, Y, ranking.score = "entropy")
cat("Number of features selected using catscore (top_k):", sum(selected_features), "\n")
## Number of features selected using catscore (top_k): 10
Summary
Feature selection is a powerful tool to reduce dimensionality in
high-dimensional datasets, especially in neuroimaging applications. In
rMVPA
, the integration of feature selection into
cross-validation workflows helps ensure that models are built on
unbiased, relevant subsets of data. You can choose between different
methods (FTest or catscore) and cutoff strategies (top_k vs top_p) based
on your specific analysis needs.