Skip to contents

The Goal: Unified Dimensionality Reduction

Multivariate data analysis often involves reducing dimensionality or transforming data using techniques like Principal Component Analysis (PCA), Partial Least Squares (PLS), Contrastive PCA (cPCA), Nyström approximation for Kernel PCA, or representing data in a specific basis (e.g., Fourier, splines). While each method has unique mathematical underpinnings, they share common operational needs:

  • Fitting the model to training data.
  • Extracting key components (scores, loadings/coefficients).
  • Projecting new data points into the reduced/transformed space.
  • Reconstructing approximations of the original data from the reduced space.
  • Integrating these steps with pre-processing (like centering or scaling).
  • Comparing or tuning models using cross-validation.

Handling these tasks consistently across different algorithms can lead to repetitive code and complex workflows. The multivarious package aims to simplify this by providing a unified interface centered around the concept of a bi_projector.

The bi_projector: A Two-Way Map

The bi_projector class is the cornerstone of multivarious. It represents a linear transformation (or an approximation thereof) that provides a two-way mapping:

  1. Samples (Rows) ↔︎ Scores: Maps data points from the original high-dimensional space to a lower-dimensional latent space (scores), and potentially back.
  2. Variables (Columns) ↔︎ Components/Loadings: Maps original variables to their representation in the latent space (loadings/components), and potentially back.

Think of it as encapsulating the core results of a dimensionality reduction technique (like the U, S, V components of an SVD, or the scores and loadings of PCA/PLS) along with any necessary pre-processing information.

Crucially, many functions within multivarious (e.g., pca(), pls(), cPCAplus(), nystrom_approx(), regress()) return objects that inherit from bi_projector.

Key Actions with a bi_projector

Because different methods return a bi_projector, you can perform common tasks using a consistent set of verbs:

  • scores(model): Get the scores (latent space representation) of the training data.
  • coef(model) or loadings(model): Get the loadings or coefficients mapping variables to components.
  • project(model, newdata): Project new samples (rows of newdata) into the latent space defined by the model.
  • reconstruct(model, ...): Reconstruct an approximation of the original data from the latent space (either from training scores or provided new scores/coefficients).
  • truncate(model, ncomp): Reduce the number of components kept in the model.
  • summary(model): Get a concise summary of the model dimensions.

This consistent API simplifies writing generic analysis code and makes it easier to swap between different dimensionality reduction methods.

Example: PCA Workflow

Let’s demonstrate a typical workflow using PCA on the classic iris dataset.

# Load iris dataset and select numeric columns
data(iris)
X <- as.matrix(iris[, 1:4])

# 1. Define a pre-processor (center the data)
preproc <- center()

# 2. Fit PCA using svd_wrapper, keeping 3 components
#    The pre-processor is applied internally.
fit <- pca(X, ncomp = 3, preproc = preproc)

# The result 'fit' is a bi_projector
print(fit)
#> PCA object  -- derived from SVD
#> 
#> Data: 150 observations x 4 variables
#> Components retained: 3
#> 
#> Variance explained (per component):
#>  1 2 3  92.95  5.33  1.72%  (cumulative:  92.95 98.28   100%)

# 3. Access results
iris_scores <- scores(fit) # Scores of the centered training data (150 x 3)
iris_loadings <- loadings(fit) # Loadings (4 x 3)
cat("\nDimensions of Scores:", dim(iris_scores), "\n")
#> 
#> Dimensions of Scores: 150 3
cat("Dimensions of Loadings:", dim(iris_loadings), "\n")
#> Dimensions of Loadings:

# 4. Project new data
# Create some new iris-like samples (5 samples, 4 variables)
set.seed(123)
new_iris_data <- matrix(rnorm(5 * 4, mean = colMeans(X), sd = apply(X, 2, sd)), 
                        nrow = 5, byrow = TRUE)

# Project the new data into the PCA space defined by 'fit'
# Pre-processing (centering using training data means) is applied automatically.
projected_new_scores <- project(fit, new_iris_data)
cat("\nDimensions of Projected New Data Scores:", dim(projected_new_scores), "\n")
#> 
#> Dimensions of Projected New Data Scores: 5 3
print(head(projected_new_scores))
#>            [,1]       [,2]        [,3]
#> [1,] -2.2172144  0.8590909 -0.44924532
#> [2,] -0.3270495 -0.5478369  0.07965279
#> [3,] -1.7602954  0.9106117 -0.52932939
#> [4,]  0.2367242 -0.3204326 -0.50433574
#> [5,] -1.1529598  0.5426518  0.85478044

# 5. Reconstruct approximated original data from scores
# Reconstruct the first few original samples
reconstructed_X_approx <- reconstruct(fit, comp=1:3) # uses scores(fit) by default
cat("\nReconstructed Approximation of Original Data (first 5 rows):\n")
#> 
#> Reconstructed Approximation of Original Data (first 5 rows):
print(head(reconstructed_X_approx))
#>          [,1]     [,2]     [,3]      [,4]
#> [1,] 5.099286 3.500723 1.401086 0.1982949
#> [2,] 4.868758 3.031661 1.447517 0.1253679
#> [3,] 4.693700 3.206384 1.309582 0.1849507
#> [4,] 4.623843 3.075837 1.463736 0.2569583
#> [5,] 5.019326 3.580414 1.370606 0.2461680
#> [6,] 5.407635 3.892262 1.688387 0.4182392

print(head(X)) # Original data for comparison
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]          5.1         3.5          1.4         0.2
#> [2,]          4.9         3.0          1.4         0.2
#> [3,]          4.7         3.2          1.3         0.2
#> [4,]          4.6         3.1          1.5         0.2
#> [5,]          5.0         3.6          1.4         0.2
#> [6,]          5.4         3.9          1.7         0.4

This example shows how fitting (pca), accessing results (scores, loadings), and applying the model to new data (project) follow a consistent pattern, regardless of whether the underlying method was PCA, PLS, or another technique returning a bi_projector.

Beyond Basic Projection: The multivarious Ecosystem

The unified bi_projector interface enables several powerful features within the package:

  • Pre-processing Pipelines: Define reusable pre-processing steps (see vignette("PreProcessing")).
  • Model Composition: Chain multiple bi_projector steps together (e.g., pre-processing → PCA → rotation) into a single composite projector (see vignette("Composing_Projectors")).
  • Cross-Validation: Easily perform cross-validation to select hyperparameters (like the number of components) using helpers that understand the bi_projector structure (see vignette("CrossValidation")).

Projecting Variables (project_vars)

While project() operates on new samples (rows), the bi_projector also supports projecting new variables (columns) into the component space defined by the model’s scores (U vectors in SVD terms). This is done using project_vars().

# Using the 'fit' object from the PCA example above

# Create a new variable (column) with the same number of samples as original data
set.seed(456)
new_variable <- rnorm(nrow(X))

# Project this new variable into the component space defined by the PCA scores (fit$s)
# Result shows how the new variable relates to the principal components.
projected_variable_loadings <- project_vars(fit, new_variable)
#> Warning in sweep(X, 2, cm, "-"): STATS is longer than the extent of
#> 'dim(x)[MARGIN]'
cat("\nProjection of new variable onto components:", projected_variable_loadings, "\n")
#> 
#> Projection of new variable onto components: 0.0003296036 -0.0005098876 0.004254116

Conclusion

The multivarious package provides a consistent and extensible framework for common dimensionality reduction and related linear transformation tasks. By leveraging the bi_projector class, it offers a unified API for fitting models, projecting new data, reconstruction, and accessing key model components. This simplifies workflows, promotes code reuse, and facilitates integration with pre-processing, model composition, and cross-validation tools within the package ecosystem.