Feature Selection in rMVPA
Bradley Buchsbaum
2025-02-13
FeatureSelection.Rmd
Introduction
In modern neuroimaging and machine learning analyses, datasets often
contain a large number of features (e.g., voxels). Many of these
features can be noisy or irrelevant for the prediction task at hand.
Feature selection is a critical step to reduce dimensionality, improve
model interpretability, and potentially enhance predictive performance.
It is essential to integrate feature selection within the
cross-validation framework to avoid introducing selection bias. In the
rMVPA
package, feature selection can be applied
automatically as part of the analysis pipeline.
In rMVPA
, a feature selector object is used to
specify the method and criteria for feature selection. Two primary
methods are available:
- FTest: Performs a one-way ANOVA F-test on each feature and uses the resulting p-values for selection.
-
catscore: Computes correlation-adjusted t-scores
(using the
sda.ranking
function from thesda
package) to rank features.
Additionally, two types of cutoff criteria are supported:
- Top_k (or “topk”): Selects the top k features based on their ranking.
- Top_p (or “topp”): Selects a proportion p of the features (e.g., setting p to 0.1 selects the top 10% of features).
Creating a Feature Selector Object
To create a feature selector in rMVPA
, use the
feature_selector()
function. For example, to construct a
feature selector using the FTest method with a top_k cutoff (selecting
the top 10 features):
suppressPackageStartupMessages(library(rMVPA))
# Create a feature selector using FTest with top_k cutoff (select top 10 features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_k", cutoff_value = 10)
fsel
## $cutoff_type
## [1] "top_k"
##
## $cutoff_value
## [1] 10
##
## attr(,"class")
## [1] "FTest" "feature_selector" "list"
Similarly, you can create a feature selector that selects a proportion of features using the top_p option. In the example below, we select the top 10% of features based on the FTest ranking:
# Create a feature selector using FTest with top_p cutoff (select top 10% of features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_p", cutoff_value = 0.1)
fsel
## $cutoff_type
## [1] "top_p"
##
## $cutoff_value
## [1] 0.1
##
## attr(,"class")
## [1] "FTest" "feature_selector" "list"
Applying Feature Selection to Data
The select_features()
function applies the feature
selection process to a given feature matrix X
and a
response variable Y
. The function returns a logical vector
with TRUE
for selected features and FALSE
otherwise.
Below is an example using simulated data:
# Simulate a response variable (categorical)
Y <- factor(rep(letters[1:4], each = 25))
# Simulate a feature matrix with 100 samples and 100 features
X <- matrix(rnorm(100 * 100), nrow = 100, ncol = 100)
# Apply feature selection using the FTest method with top_k cutoff
fsel <- feature_selector(method = "FTest", cutoff_type = "top_k", cutoff_value = 10)
selected_features <- select_features(fsel, X, Y)
# The number of selected features should be equal to the cutoff value (10)
cat("Number of selected features (top_k):", sum(selected_features), "\n")
## Number of selected features (top_k): 10
Now, let’s use the top_p option. This will select a proportion of the features. For example, with a cutoff value of 0.1, the top 10% of features will be selected:
# Apply feature selection using the FTest method with top_p cutoff (select top 10% of features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_p", cutoff_value = 0.1)
selected_features <- select_features(fsel, X, Y)
# Calculate the proportion of features selected
the_proportion <- sum(selected_features) / ncol(X)
cat("Proportion of features selected (top_p):", the_proportion, "\n")
## Proportion of features selected (top_p): 0.1
Using the catscore Method
Alternatively, you can use the catscore
method to
perform feature selection. The catscore
method computes a
correlation-adjusted t-score for each feature. Here’s an example:
# Create a feature selector using catscore with top_k cutoff (select top 10 features)
fsel <- feature_selector(method = "catscore", cutoff_type = "top_k", cutoff_value = 10)
# Simulate a response variable and feature matrix
Y <- factor(rep(letters[1:3], length.out = 90))
X <- matrix(rnorm(90 * 50), nrow = 90, ncol = 50)
# Apply feature selection using catscore
selected_features <- select_features(fsel, X, Y, ranking.score = "entropy")
cat("Number of features selected using catscore (top_k):", sum(selected_features), "\n")
## Number of features selected using catscore (top_k): 10
Summary
Feature selection is a powerful tool to reduce dimensionality in
high-dimensional datasets, especially in neuroimaging applications. In
rMVPA
, the integration of feature selection into
cross-validation workflows helps ensure that models are built on
unbiased, relevant subsets of data. You can choose between different
methods (FTest or catscore) and cutoff strategies (top_k vs top_p) based
on your specific analysis needs.