Feature Selection
Bradley Buchsbaum
2024-04-23
FeatureSelection.Rmd
Feature Selection
Feature selection is used to select a subset of variables to include
into a classification or regression model. This can be useful when one
has many irrelevant features and one wants to quickly reduce the problem
to a more manageable size. Alternatively, if one wants to impose a limit
on the total number of included features (say, 1000) across a set of
classification tasks (e.g. over a group of subjects) then it can be
useful to use a feature selection before running the analysis. Note that
to avoid bias, feature selection must be incorported into
cross-validation procedure such that feature selection is carried out
for each training fold. This is automatically done in the analysis
pileplines provided in rMVPA
.
Creating a feature_selector
object
To use feature selection in rMVPA
we must construct a
feature_selector
instance which specifies the feature
selection method and the criteria for variable inclusion. There are two
ways of specificying a threshold of cutoff
value for
variable inclusion: the first, called top_k
, selects
features by including the top k most important features
according to a feature importance algorithm; the second, called
top_p
selects feature by including a proportion p
of all variables after ranking them by feature importance. Thus, if
p is .1, we would include the top 10% of all features.
First, we demonstrate setting creatinga feature_selector
using the top_k
method, using k=10, and the univariate
F-test for feature ranking:
fsel <- feature_selector(method="FTest",cutoff_type = "top_k", cutoff_value = 10)
fsel
## $cutoff_type
## [1] "top_k"
##
## $cutoff_value
## [1] 10
##
## attr(,"class")
## [1] "FTest" "feature_selector" "list"
To apply feature selection to some data, we do as follows:
## create a dependent variable.
Y <- factor(rep(letters[1:4], 25))
## create a feature matrix
X <- matrix(rnorm(100*100), 100, 100)
## select_features returns a logical vector with selected features as TRUE and otherwise FALSE
res <- select_features(fsel,X,Y)
## the total number of TRUE values should equal the cutoff_value of 10.
print(sum(res))
## [1] 10
Now we select feature using a proprtion cutoff using the
top_p
option.
fsel <- feature_selector(method="FTest",cutoff_type = "top_p", cutoff_value = .1)
res <- select_features(fsel,X,Y)
## selecting features via FTest
## cutoff type top_p
## cutoff value 0.1
## retaining 10 features in matrix with 100 columns
## [1] 0.1