Skip to contents

Introduction

In modern neuroimaging and machine learning analyses, datasets often contain a large number of features (e.g., voxels). Many of these features can be noisy or irrelevant for the prediction task at hand. Feature selection is a critical step to reduce dimensionality, improve model interpretability, and potentially enhance predictive performance. It is essential to integrate feature selection within the cross-validation framework to avoid introducing selection bias. In the rMVPA package, feature selection can be applied automatically as part of the analysis pipeline.

In rMVPA, a feature selector object is used to specify the method and criteria for feature selection. Two primary methods are available:

  • FTest: Performs a one-way ANOVA F-test on each feature and uses the resulting p-values for selection.
  • catscore: Computes correlation-adjusted t-scores (using the sda.ranking function from the sda package) to rank features.

Additionally, two types of cutoff criteria are supported:

  • Top_k (or “topk”): Selects the top k features based on their ranking.
  • Top_p (or “topp”): Selects a proportion p of the features (e.g., setting p to 0.1 selects the top 10% of features).

Creating a Feature Selector Object

To create a feature selector in rMVPA, use the feature_selector() function. For example, to construct a feature selector using the FTest method with a top_k cutoff (selecting the top 10 features):

suppressPackageStartupMessages(library(rMVPA))
# Create a feature selector using FTest with top_k cutoff (select top 10 features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_k", cutoff_value = 10)
fsel
## $cutoff_type
## [1] "top_k"
## 
## $cutoff_value
## [1] 10
## 
## attr(,"class")
## [1] "FTest"            "feature_selector" "list"

Similarly, you can create a feature selector that selects a proportion of features using the top_p option. In the example below, we select the top 10% of features based on the FTest ranking:

# Create a feature selector using FTest with top_p cutoff (select top 10% of features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_p", cutoff_value = 0.1)
fsel
## $cutoff_type
## [1] "top_p"
## 
## $cutoff_value
## [1] 0.1
## 
## attr(,"class")
## [1] "FTest"            "feature_selector" "list"

Applying Feature Selection to Data

The select_features() function applies the feature selection process to a given feature matrix X and a response variable Y. The function returns a logical vector with TRUE for selected features and FALSE otherwise.

Below is an example using simulated data:

# Simulate a response variable (categorical)
Y <- factor(rep(letters[1:4], each = 25))

# Simulate a feature matrix with 100 samples and 100 features
X <- matrix(rnorm(100 * 100), nrow = 100, ncol = 100)

# Apply feature selection using the FTest method with top_k cutoff
fsel <- feature_selector(method = "FTest", cutoff_type = "top_k", cutoff_value = 10)
selected_features <- select_features(fsel, X, Y)

# The number of selected features should be equal to the cutoff value (10)
cat("Number of selected features (top_k):", sum(selected_features), "\n")
## Number of selected features (top_k): 10

Now, let’s use the top_p option. This will select a proportion of the features. For example, with a cutoff value of 0.1, the top 10% of features will be selected:

# Apply feature selection using the FTest method with top_p cutoff (select top 10% of features)
fsel <- feature_selector(method = "FTest", cutoff_type = "top_p", cutoff_value = 0.1)
selected_features <- select_features(fsel, X, Y)

# Calculate the proportion of features selected
the_proportion <- sum(selected_features) / ncol(X)
cat("Proportion of features selected (top_p):", the_proportion, "\n")
## Proportion of features selected (top_p): 0.1

Using the catscore Method

Alternatively, you can use the catscore method to perform feature selection. The catscore method computes a correlation-adjusted t-score for each feature. Here’s an example:

# Create a feature selector using catscore with top_k cutoff (select top 10 features)
fsel <- feature_selector(method = "catscore", cutoff_type = "top_k", cutoff_value = 10)

# Simulate a response variable and feature matrix
Y <- factor(rep(letters[1:3], length.out = 90))
X <- matrix(rnorm(90 * 50), nrow = 90, ncol = 50)

# Apply feature selection using catscore
selected_features <- select_features(fsel, X, Y, ranking.score = "entropy")

cat("Number of features selected using catscore (top_k):", sum(selected_features), "\n")
## Number of features selected using catscore (top_k): 10

Summary

Feature selection is a powerful tool to reduce dimensionality in high-dimensional datasets, especially in neuroimaging applications. In rMVPA, the integration of feature selection into cross-validation workflows helps ensure that models are built on unbiased, relevant subsets of data. You can choose between different methods (FTest or catscore) and cutoff strategies (top_k vs top_p) based on your specific analysis needs.