Skip to contents

An R package for Generalized Principal Component Analysis (GPCA) and related matrix decompositions when data is observed in non-Euclidean inner-product spaces.

What is Generalized PCA?

Standard PCA assumes all observations and variables are equally important and that Euclidean distance is the appropriate similarity measure. However, many real-world datasets violate these assumptions:

  • Weighted observations: Survey data where rows represent different population sizes
  • Variable precision: Measurements with different accuracies or importance
  • Correlated features: Spatial/temporal data with known dependency structures
  • Domain-specific geometry: Functional data, shape analysis, or other specialized metrics

GPCA extends standard PCA by incorporating row and column metrics (M and A) that encode prior knowledge about data structure, following the framework of Allen, Grosenick & Taylor (2014). The package also implements generalized PLS methods building on ideas from Beaton et al. (2016).

Key Features

Core Functionality

  • Generalized PCA (genpca): Decomposition with row metric M and column metric A
  • Covariance-based GPCA (genpca_cov): Direct analysis of pre-computed covariance matrices
  • Multiple computational backends:
    • eigen: Direct eigendecomposition for small-to-medium problems
    • spectra: Matrix-free C++ implementation for large/sparse data
    • randomized: Block-sketch approximation for wide low-rank settings
    • deflation: Sequential extraction for memory-constrained scenarios

Advanced Methods

  • Generalized PLS/PLS-SVD (genpls/genplsc): Two-block analysis with metrics
  • Operator-level computations (gplssvd_op): Efficient PLS without materializing whitened matrices
  • Constraint handling: Automatic validation and repair of metric matrices (PSD enforcement)

Integration

  • Full compatibility with the multivarious package ecosystem
  • Unified interface for preprocessing, projection, reconstruction, and transfer learning
  • Support for sparse matrices via the Matrix package

Backend Selection Guide

Use method = "auto" unless you have a strong reason to pin a backend.

Method Best for Pros Cons
eigen Small/medium problems, exact reference runs Most stable reference behavior Builds larger intermediate matrices; for very large sparse constraints may rely on truncated eigensolve (maxeig) and become slower/approximate
spectra Large diagonal/sparse-friendly problems Matrix-free iterative solve; lower memory Accuracy/performance can depend on conditioning and iteration behavior
randomized Wide (p >> n), sparse-metric, low-rank workloads Often fastest in wide settings; block GEMM/SpMM path Approximate by design; tune oversample, n_power, n_polish
deflation Few components with limited memory Low memory, component-by-component extraction Can converge slowly; C++ path currently expects sparse metrics
auto Default production usage Chooses among eigen/spectra/randomized heuristically Heuristics may not be optimal for every hardware/data regime

Installation

# install.packages("devtools")
devtools::install_github("bbuchsbaum/genpca")

You’ll also want these runtime dependencies installed:

install.packages(c("Matrix", "RSpectra", "multivarious"))
# Optional for some utilities / tests
install.packages(c("irlba", "knitr", "rmarkdown"))

Quick Start

Basic Usage

library(genpca)
set.seed(1)
X <- matrix(rnorm(200 * 50), 200, 50)

# Standard PCA (identity metrics by default)
fit <- genpca(X, ncomp = 5, preproc = multivarious::center())
fit$sdev                             # singular values
head(multivarious::scores(fit))      # scores (n × k)
head(multivarious::components(fit))  # loadings (p × k)

Weighted GPCA

# Example: Survey data with population weights
library(Matrix)
pop_weights <- runif(nrow(X), 0.5, 1.5)  # population sizes
M <- Diagonal(nrow(X), x = pop_weights)  # row metric

# Variable importance weights  
var_importance <- c(rep(2, 10), rep(1, 30), rep(0.5, 10))
A <- Diagonal(ncol(X), x = var_importance)  # column metric

fit_weighted <- genpca(X, M = M, A = A, ncomp = 5, 
                      preproc = multivarious::center())

Covariance-based GPCA

# When you have pre-computed covariance C = X'MX
C <- crossprod(X, M %*% X)
fit_cov <- genpca_cov(C, R = A, ncomp = 5, method = "gmd")
# Mathematically equivalent to fit_weighted above

Generalized PLS

# Two-block analysis with canonical PLS
Y <- matrix(rnorm(200 * 20), 200, 20)
pls <- genpls(X, Y, ncomp = 3, 
              preproc_x = multivarious::center(), 
              preproc_y = multivarious::center())
pls$d                    # canonical correlations
dim(pls$vx); dim(pls$vy) # X/Y weight matrices

Understanding Metrics in GPCA

The metrics M and A define inner products and distances in the observation and variable spaces:

Row Metric M (n × n)

  • Defines relationships between observations
  • Inner product: ||x||_M^2 = x^T M x
  • Distance: d_M(x,y)^2 = (x−y)^T M (x−y)
  • Common choices:
    • Identity: Standard equal weighting
    • Diagonal: Population/sample weights
    • Precision matrix: Account for observation correlations
    • Kernel matrices: Encode similarity structures

Column Metric A (p × p)

  • Defines relationships between variables
  • Inner product: ||v||_A^2 = v^T A v
  • Distance: d_A(v,w)^2 = (v−w)^T A (v−w)
  • Common choices:
    • Identity: Standard equal importance
    • Diagonal: Variable weights/importance
    • Covariance/precision: Variable dependencies
    • Graph Laplacian: Spatial/temporal smoothness

When M = I and A = I, GPCA reduces to standard PCA.

Documentation

  • Vignettes
    • genpca: Generalized PCA and Related Decompositions (overview)
    • Generalized PLS‑SVD: Explicit Whitening Reference (operator vs explicit whitening)

Build locally:

devtools::build_vignettes()
browseVignettes("genpca")

Testing and guarantees

  • Eigen vs Spectra: unit tests assert tight agreement on modest problems (sdev within 1e‑6, scores within 1e‑5 up to sign).
  • Deflation vs Eigen: additional tests (n≈60, p≈40, k=8) assert:
    • sdev within 1e‑4,
    • subspace agreement via principal angles,
    • scores/components close after Procrustes alignment (≈ 1e‑3 relative).

Run tests locally:

library(testthat)
library(pkgload)
pkgload::load_all()
testthat::test_dir("tests/testthat")

References

The methods in this package are based on:

  • Allen, G. I., Grosenick, L., & Taylor, J. (2014). A generalized least-square matrix decomposition. Journal of the American Statistical Association, 109(505), 145-159. doi:10.1080/01621459.2013.852978

  • Beaton, D., ADNI, et al. (2016). Generalized partial least squares: A framework for simultaneously capturing common and individual variation. NeuroImage, 141, 346-363. doi:10.1016/j.neuroimage.2016.07.034

For additional theoretical background on generalized decompositions, see:

  • Beaton, D. (2020). Generalized eigen, singular value, and partial least squares decompositions: The GSVD package. arXiv preprint arXiv:2010.14734.

License

MIT (see LICENSE).

Contributing

Issues and PRs welcome. Please open a ticket with a minimal example, your R session info, and (if relevant) a pointer to the metric matrices that reproduce the behavior.

Albers theme

This package uses the albersdown theme. Vignettes are styled with vignettes/albers.css and a local vignettes/albers.js; the palette family is provided via params$family (default ‘red’). The pkgdown site uses template: { package: albersdown }.

Albers theme

This package uses the albersdown theme. Existing vignette theme hooks are replaced so albers.css and local albers.js render consistently on CRAN and GitHub Pages. The defaults are configured via params$family and params$preset (family = ‘teal’, preset = ‘homage’). The pkgdown site uses template: { package: albersdown } together with generated pkgdown/extra.css and pkgdown/extra.js so the theme is linked and activated on site pages.