Penalized MFA with Clusterwise Spatial Smoothness Constraints
Source:R/penalized_mfa_clusterwise.R
penalized_mfa_clusterwise.RdThis function implements a spatially-regularized Multiple Factor Analysis for multi-subject neuroimaging or spatial cluster data. Unlike standard MFA which encourages similarity among block loadings globally, this method uses spatial coordinates of clusters (e.g., brain regions, spatial locations) to construct a graph and enforces smoothness of loadings among spatially adjacent clusters via a graph Laplacian penalty.
Usage
penalized_mfa_clusterwise(
data_list,
coords_list,
ncomp = 2L,
lambda = 1,
adjacency_opts = list(),
max_iter = 10L,
nsteps_inner = 1L,
learning_rate = 0.01,
optimizer = c("gradient", "adam"),
beta1 = 0.9,
beta2 = 0.999,
adam_epsilon = 1e-08,
tol_obj = 1e-07,
tol_inner = NULL,
verbose = FALSE,
preproc = NULL,
memory_budget_mb = 1024,
normalized_laplacian = TRUE
)
pmfa_cluster(
data_list,
coords_list,
ncomp = 2L,
lambda = 1,
adjacency_opts = list(),
max_iter = 10L,
nsteps_inner = 1L,
learning_rate = 0.01,
optimizer = c("gradient", "adam"),
beta1 = 0.9,
beta2 = 0.999,
adam_epsilon = 1e-08,
tol_obj = 1e-07,
tol_inner = NULL,
verbose = FALSE,
preproc = NULL,
memory_budget_mb = 1024,
normalized_laplacian = TRUE
)Arguments
- data_list
A list of length \(S\) containing numeric matrices. Each element \(\mathbf{X}_s\) is an \(n_s \times k_s\) matrix where \(n_s\) is the number of observations and \(k_s\) is the number of clusters/features for subject \(s\). Data should be column-centered (mean zero per cluster) for proper reconstruction.
- coords_list
A list of length \(S\) containing coordinate matrices. Each element is a \(k_s \times 3\) matrix of (x, y, z) spatial coordinates for the clusters in the corresponding data block. Must have exactly 3 columns.
- ncomp
Integer number of latent components to extract per block (default: 2). When a block has fewer than `ncomp` clusters, the effective rank is automatically reduced to match the number of clusters.
- lambda
Non-negative scalar controlling the spatial smoothness penalty strength (default: 1). Larger values enforce stronger smoothness among spatially adjacent clusters. When \(\lambda = 0\), the method reduces to independent PCA per block. Typical range: 0.1 to 10.
- adjacency_opts
A named list of options passed to the internal `spatial_constraints` function for adjacency matrix construction (default: empty list). Supported options:
k_nn: Number of nearest neighbors (default: 6)
- max_iter
Maximum number of outer block-coordinate descent iterations (default: 10). More iterations allow better convergence but increase computation time. Typical range: 10 to 100.
- nsteps_inner
Number of gradient update steps per block in each outer iteration (default: 1). Higher values perform more thorough optimization of each block before moving to the next. For stability, starting with 1 is recommended.
- learning_rate
Step size for the optimizer (default: 0.01). Controls the magnitude of updates in the gradient direction. Too large may cause divergence; too small slows convergence. Typical range: 0.001 to 0.1.
- optimizer
Character string specifying the optimization algorithm (default: "gradient"). Options are:
"gradient": Standard gradient descent with fixed learning rate"adam": Adaptive Moment Estimation with momentum and adaptive learning rates
Adam is generally more robust but gradient descent offers more control.
- beta1
Adam hyperparameter for first moment decay (default: 0.9). Controls the exponential decay rate for gradient moving average. Typical range: 0.8 to 0.95. Only used when `optimizer = "adam"`.
- beta2
Adam hyperparameter for second moment decay (default: 0.999). Controls the exponential decay rate for squared gradient moving average. Typical range: 0.99 to 0.9999. Only used when `optimizer = "adam"`.
- adam_epsilon
Small constant added to denominator in Adam for numerical stability (default: 1e-8). Prevents division by zero. Only used when `optimizer = "adam"`.
- tol_obj
Numeric tolerance for outer loop convergence based on relative change in objective function (default: 1e-7). The algorithm stops when \(|f^{(t+1)} - f^{(t)}| / (|f^{(t)}| + \epsilon) < \text{tol_obj}\). Smaller values require more precise convergence. Typical range: 1e-8 to 1e-5.
- tol_inner
Optional numeric tolerance for inner loop convergence based on Frobenius norm of change in loading matrix (default: NULL, no early stopping). When specified, inner loop stops if \(\|\mathbf{V}_s^{\text{new}} - \mathbf{V}_s\|_F < \text{tol_inner}\).
- verbose
Logical indicating whether to print iteration progress (default: FALSE). When TRUE, displays detailed information about objective values, convergence, memory usage, and potential issues using the `cli` package.
- preproc
Preprocessing specification (default: NULL, will force centering). Can be:
NULL: Automatically applies centering to each blockA single `pre_processor` object: Applied independently to each block
A list of `pre_processor` objects: One per block
**Critical:** Preprocessing must preserve the number of columns to maintain spatial correspondence.
- memory_budget_mb
Numeric specifying the maximum memory in megabytes allocated per block for precomputing \(\mathbf{X}_s^\top \mathbf{X}_s\) (default: 1024). If a block's \(k_s^2 \times 8 / (1024^2)\) exceeds this budget, gradients are computed on-the-fly instead. Increase for better speed with large memory; decrease if memory is constrained.
- normalized_laplacian
Logical indicating whether to use the normalized graph Laplacian (default: TRUE). When TRUE, uses \(\mathbf{L}_{\text{sym}} = \mathbf{D}^{-1/2} \mathbf{L} \mathbf{D}^{-1/2}\) for scale-free smoothness that is independent of node degree. When FALSE, uses the unnormalized Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{A}\). Normalized is recommended for most applications.
Value
A `multiblock_projector` object of class `"penalized_mfa_clusterwise"` with the following components:
vConcatenated loading matrix (total_clusters × ncomp) formed by vertically stacking all block-specific loading matrices. Padded with zeros for variable-rank blocks.
preprocA pass-through preprocessor (identity transformation), since this function expects pre-processed input.
block_indicesNamed list indicating which rows of `v` correspond to each block/subject. Each element is an integer vector of row indices.
Additional information stored as named elements (accessible via `$`):
V_listList of length S containing the final orthonormal loading matrices for each block. Each element is a \(k_s \times r_s\) matrix where \(r_s = \min(\text{ncomp}, k_s)\).
SadjThe (normalized or unnormalized) graph Laplacian matrix used for the spatial smoothness penalty. Sparse matrix of dimension \(\sum_s k_s \times \sum_s k_s\).
LVThe matrix product \(\mathbf{L} \mathbf{V}\) from the final iteration, useful for computing spatial penalty contributions.
obj_valuesNumeric vector of objective function values at each outer iteration, including the initial value. Length is (iterations_run + 1).
lambdaThe spatial smoothness penalty weight used.
precompute_infoLogical vector of length S indicating which blocks used precomputed \(\mathbf{X}_s^\top \mathbf{X}_s\) (TRUE) vs. on-the-fly computation (FALSE).
iterations_runInteger indicating how many outer iterations were completed before convergence or reaching max_iter.
ncomp_blockInteger vector of length S containing the effective number of components extracted for each block (may differ from `ncomp` for blocks with fewer clusters).
Details
## Optimization Problem
For \(S\) subjects/blocks with data matrices \(\mathbf{X}_s \in \mathbb{R}^{n_s \times k_s}\) (where \(k_s\) is the number of clusters for subject \(s\)), we estimate orthonormal loading matrices \(\mathbf{V}_s \in \mathbb{R}^{k_s \times r_s}\) by minimizing:
$$ \min_{\{\mathbf{V}_s\}} \sum_{s=1}^S \|\mathbf{X}_s - \mathbf{X}_s \mathbf{V}_s \mathbf{V}_s^\top\|_F^2 + \lambda \,\text{tr}(\mathbf{V}^\top \mathbf{L} \mathbf{V}) $$
where:
The first term is the reconstruction error (sum of squared residuals)
\(\mathbf{V}\) is the vertical concatenation of all \(\mathbf{V}_s\)
\(\mathbf{L} = \mathbf{D} - \mathbf{A}\) is the graph Laplacian
\(\mathbf{A}\) is the adjacency matrix constructed from spatial coordinates
\(\mathbf{D}\) is the degree matrix (\(D_{ii} = \sum_j A_{ij}\))
\(\lambda \geq 0\) controls the spatial smoothness strength
## Spatial Smoothness Penalty
The Laplacian penalty \(\text{tr}(\mathbf{V}^\top \mathbf{L} \mathbf{V})\) can be rewritten as: $$ \frac{1}{2} \sum_{i,j} A_{ij} \|\mathbf{v}_i - \mathbf{v}_j\|_2^2 $$
This penalizes the squared Euclidean distance between loadings of adjacent clusters. When two clusters \(i\) and \(j\) are connected (\(A_{ij} = 1\)), their loading vectors \(\mathbf{v}_i\) and \(\mathbf{v}_j\) are encouraged to be similar.
**Normalized vs. Unnormalized Laplacian:**
- Unnormalized (
normalized_laplacian = FALSE) \(\mathbf{L} = \mathbf{D} - \mathbf{A}\). Smoothness is scaled by node degree; high-degree nodes have stronger smoothness constraints.
- Normalized (
normalized_laplacian = TRUE, default) \(\mathbf{L}_{\text{sym}} = \mathbf{D}^{-1/2} \mathbf{L} \mathbf{D}^{-1/2}\). Scale-free smoothness that is independent of node degree. Recommended for most applications.
## Graph Construction
The spatial adjacency graph is built using k-nearest neighbors (k-NN) in 3D space:
Cluster coordinates from all subjects are pooled
For each cluster, find its k nearest neighbors (default: k=6)
Create edges between each cluster and its neighbors
Symmetrize the adjacency matrix
The number of neighbors can be controlled via `adjacency_opts = list(k_nn = k)`.
## Optimization Algorithm
The method uses block-coordinate descent (BCD) with Riemannian optimization:
1. **Initialization**: Initialize each \(\mathbf{V}_s\) via SVD of \(\mathbf{X}_s\) 2. **Outer loop** (max_iter iterations): Cycle through all subjects/blocks 3. **Inner loop** (nsteps_inner steps per block): - Compute reconstruction gradient: \(\nabla_{\mathbf{V}_s} = -2 \mathbf{X}_s^\top \mathbf{X}_s \mathbf{V}_s\) - Compute spatial gradient: \(2\lambda \mathbf{L}_{ss} \mathbf{V}_s + 2\lambda \mathbf{L}_{s,-s} \mathbf{V}_{-s}\) - Project combined gradient onto Stiefel manifold tangent space - Update via gradient descent or Adam - Retract to manifold via QR decomposition 4. **Convergence check**: Stop when relative change in objective < `tol_obj`
## Memory Management
For large datasets, computing and storing \(\mathbf{X}_s^\top \mathbf{X}_s\) for all blocks may exceed available memory. The `memory_budget_mb` parameter controls this tradeoff:
**Precomputed mode** (if \(k_s^2 \times 8 / (1024^2)\) < `memory_budget_mb`): Store \(\mathbf{X}_s^\top \mathbf{X}_s\) for faster gradient computation
**On-the-fly mode** (otherwise): Compute gradients as needed, saving memory at the cost of computation time
## Variable-Rank Loadings
When subjects have different numbers of clusters, each block can have a different effective rank. The algorithm automatically sets \(r_s = \min(\text{ncomp}, k_s)\) for each block and pads the concatenated loading matrix with zeros for consistency.
## Preprocessing
Data preprocessing is crucial for reconstruction-based methods. The function defaults to centering each block if no preprocessing is specified. **Important:** Preprocessing must preserve the number of columns (clusters) to maintain correspondence with spatial coordinates.
## When Lambda = 0
When \(\lambda = 0\), the spatial penalty is disabled and the method reduces to independent PCA on each block. This provides a useful baseline for comparison.
## Practical Considerations
- Choosing lambda
Start with \(\lambda = 0\) (no smoothness) and gradually increase. Typical range: 0.1 to 10. Monitor the objective function components (reconstruction vs. smoothness) to balance the tradeoff.
- Number of neighbors
More neighbors (larger k-NN) create a denser graph with stronger smoothness constraints. Default k=6 works well for 3D spatial data.
- Convergence
The method may converge slowly for large \(\lambda\). Increase `max_iter` or adjust `learning_rate` if needed.
- Optimizer choice
Adam is generally more robust and converges faster. Use gradient descent for better control or when Adam's adaptive behavior is undesired.
Engineering Improvements
This implementation includes several optimizations:
**Mathematical**: Normalized Laplacian default, \(\lambda=0\) optimization skips
**Numerical**: Sparse matrix operations, in-block QR retraction, fast orthogonalization
**Stability**: Variable-rank loadings, robust parameter validation, Adam state management
**Architecture**: Modular helper functions, comprehensive logging, clear error messages
**Memory**: Adaptive memory budget, efficient gradient computation strategies
**API**: Shorter `pmfa_cluster()` alias, consistent naming conventions
Examples
if (FALSE) { # \dontrun{
# Example 1: Basic usage with simulated spatial cluster data
set.seed(123)
S <- 3 # 3 subjects
# Generate data with spatial structure
data_list <- lapply(1:S, function(s) {
n <- 50 # 50 observations
k <- 20 # 20 clusters
matrix(rnorm(n * k), n, k)
})
# Generate 3D spatial coordinates for clusters
coords_list <- lapply(1:S, function(s) {
matrix(runif(20 * 3, 0, 10), 20, 3) # Random 3D positions
})
# Fit model with spatial smoothness
res <- penalized_mfa_clusterwise(
data_list, coords_list,
ncomp = 3,
lambda = 1,
max_iter = 20,
verbose = TRUE
)
print(res)
# Plot convergence
plot(res$obj_values, type = 'b', xlab = 'Iteration', ylab = 'Objective',
main = 'Convergence of Spatially-Regularized MFA')
# Example 2: Using the shorter alias
res2 <- pmfa_cluster(data_list, coords_list, ncomp = 2, lambda = 0.5)
# Example 3: Compare different lambda values
lambdas <- c(0, 0.1, 0.5, 1, 2, 5)
results <- lapply(lambdas, function(lam) {
fit <- pmfa_cluster(data_list, coords_list, ncomp = 2, lambda = lam,
verbose = FALSE)
list(
lambda = lam,
final_obj = tail(fit$obj_values, 1),
iterations = fit$iterations_run
)
})
# Example 4: Using Adam optimizer for faster convergence
res_adam <- pmfa_cluster(
data_list, coords_list,
ncomp = 3,
lambda = 1,
optimizer = "adam",
learning_rate = 0.05,
max_iter = 30,
verbose = TRUE
)
# Example 5: Controlling k-NN graph construction
res_dense <- pmfa_cluster(
data_list, coords_list,
ncomp = 2,
lambda = 1,
adjacency_opts = list(k_nn = 10), # More neighbors = denser graph
verbose = TRUE
)
# Example 6: Using unnormalized Laplacian
res_unnorm <- pmfa_cluster(
data_list, coords_list,
ncomp = 2,
lambda = 1,
normalized_laplacian = FALSE
)
# Example 7: Memory-constrained settings
# For large datasets, reduce memory budget
res_mem <- pmfa_cluster(
data_list, coords_list,
ncomp = 2,
lambda = 1,
memory_budget_mb = 256, # Limit memory per block
verbose = TRUE
)
# Check which blocks used precomputed gradients
print(res_mem$precompute_info)
# Example 8: Extracting block-specific loadings
V_list <- res$V_list
# Loadings for first subject
V1 <- V_list[[1]]
dim(V1) # k_s x ncomp
# Visualize spatial smoothness of first loading vector
library(scatterplot3d)
scatterplot3d(
coords_list[[1]],
color = rank(V1[, 1]),
main = "Spatial Pattern of First Loading (Subject 1)",
xlab = "X", ylab = "Y", zlab = "Z"
)
# Example 9: Examining the spatial penalty contribution
# Extract Laplacian and loadings
L <- res$Sadj
LV <- res$LV
v <- res$v
# Compute spatial penalty: tr(V' L V)
spatial_penalty <- sum(LV * v)
cat("Spatial penalty term:", spatial_penalty, "\n")
# Example 10: Variable-rank loadings (blocks with different sizes)
# Simulate data with varying cluster numbers
data_var <- list(
matrix(rnorm(50 * 15), 50, 15), # 15 clusters
matrix(rnorm(50 * 20), 50, 20), # 20 clusters
matrix(rnorm(50 * 10), 50, 10) # 10 clusters
)
coords_var <- list(
matrix(runif(15 * 3), 15, 3),
matrix(runif(20 * 3), 20, 3),
matrix(runif(10 * 3), 10, 3)
)
res_var <- pmfa_cluster(
data_var, coords_var,
ncomp = 8, # Will be capped at 10 (smallest block)
lambda = 1,
verbose = TRUE
)
# Check effective ranks per block
print(res_var$ncomp_block)
# Example 11: Custom preprocessing per block
library(multivarious)
preproc_list <- list(
center(), # Just center first block
standardize(), # Center and scale second block
center() # Just center third block
)
res_preproc <- pmfa_cluster(
data_list, coords_list,
ncomp = 2,
lambda = 1,
preproc = preproc_list
)
# Example 12: Lambda selection via cross-validation style approach
# (Simplified - would need proper CV in practice)
lambda_grid <- 10^seq(-1, 1, length.out = 10)
cv_results <- data.frame(
lambda = lambda_grid,
recon_error = NA,
spatial_penalty = NA,
total_obj = NA
)
for (i in seq_along(lambda_grid)) {
fit <- pmfa_cluster(data_list, coords_list, ncomp = 2,
lambda = lambda_grid[i], verbose = FALSE)
cv_results$total_obj[i] <- tail(fit$obj_values, 1)
# Could decompose into reconstruction and spatial components
}
# Plot objective vs lambda
plot(cv_results$lambda, cv_results$total_obj, log = "x",
type = "b", xlab = "Lambda (log scale)", ylab = "Final Objective",
main = "Objective Function vs. Smoothness Penalty")
} # }