Skip to contents

Feature Selection

Feature selection is used to select a subset of variables to include into a classification or regression model. This can be useful when one has many irrelevant features and one wants to quickly reduce the problem to a more manageable size. Alternatively, if one wants to impose a limit on the total number of included features (say, 1000) across a set of classification tasks (e.g. over a group of subjects) then it can be useful to use a feature selection before running the analysis. Note that to avoid bias, feature selection must be incorported into cross-validation procedure such that feature selection is carried out for each training fold. This is automatically done in the analysis pileplines provided in rMVPA.

Creating a feature_selector object

To use feature selection in rMVPA we must construct a feature_selector instance which specifies the feature selection method and the criteria for variable inclusion. There are two ways of specificying a threshold of cutoff value for variable inclusion: the first, called top_k, selects features by including the top k most important features according to a feature importance algorithm; the second, called top_p selects feature by including a proportion p of all variables after ranking them by feature importance. Thus, if p is .1, we would include the top 10% of all features.

First, we demonstrate setting creatinga feature_selector using the top_k method, using k=10, and the univariate F-test for feature ranking:

fsel <- feature_selector(method="FTest",cutoff_type = "top_k", cutoff_value = 10)
fsel
## $cutoff_type
## [1] "top_k"
## 
## $cutoff_value
## [1] 10
## 
## attr(,"class")
## [1] "FTest"            "feature_selector" "list"

To apply feature selection to some data, we do as follows:

## create a dependent variable.
Y <- factor(rep(letters[1:4], 25))

## create a feature matrix
X <- matrix(rnorm(100*100), 100, 100)

## select_features returns a logical vector with selected features as TRUE and otherwise FALSE
res <- select_features(fsel,X,Y)

## the total number of TRUE values should equal the cutoff_value of 10.
print(sum(res))
## [1] 10

Now we select feature using a proprtion cutoff using the top_p option.

fsel <- feature_selector(method="FTest",cutoff_type = "top_p", cutoff_value = .1)
res <- select_features(fsel,X,Y)
## selecting features via FTest
## cutoff type top_p
## cutoff value 0.1
## retaining 10 features in matrix with 100 columns
## the total number of TRUE values should equal the cutoff_value of 10.
print(sum(res)/ncol(X))
## [1] 0.1