Assign pseudolabels based on sparse similarity matrix clustering

Takes a sparse similarity matrix and identifies clusters of highly similar samples. Focuses on finding diverse, high-confidence cluster representatives.

Usage

assign_pseudolabels(
  sim_matrix,
  min_clusters = 10,
  max_clusters = 100,
  sim_threshold = NULL,
  min_cluster_size = 2,
  max_cluster_size = Inf,
  diversity_weight = 0.3,
  adaptive_threshold_quantile = 0.8,
  seed = NULL,
  use_advanced_diversity = TRUE,
  verbose = FALSE
)

Arguments

sim_matrix: Sparse similarity matrix (dgCMatrix or similar). Should be symmetric with values between 0 and 1, where higher values indicate greater similarity.
min_clusters: Minimum number of clusters to find (default: 10)
max_clusters: Maximum number of clusters to find (default: 100)
sim_threshold: Minimum similarity threshold for cluster membership. If NULL, will be adaptively determined from the distribution of non-zero similarities.
min_cluster_size: Minimum number of samples required to form a cluster (default: 2)
max_cluster_size: Maximum number of samples allowed in a cluster (default: Inf)
diversity_weight: Weight for promoting diversity among cluster representatives [0,1]. Controls the trade-off between diversity and cluster confidence: 0 = select largest/most confident clusters, 1 = maximize diversity (minimize pairwise similarities), 0.5 = balanced approach (default: 0.3)
adaptive_threshold_quantile: Quantile of non-zero similarities to use as adaptive threshold (default: 0.8)
seed: Random seed for reproducibility
use_advanced_diversity: Whether to use advanced farthest-first traversal for diversity selection on large candidate sets (default: TRUE)
verbose: Whether to print progress information (default: FALSE)

Value

A list containing:

labels: Factor vector of pseudolabels, with NA for unassigned samples
representatives: Integer vector of row indices serving as cluster representatives
cluster_info: Data frame with cluster statistics including cluster ID, representative index, size, and average similarity
threshold_used: The similarity threshold that was applied
n_clusters: Number of clusters found

Details

The algorithm works in several steps:

Determine similarity threshold (adaptive or user-specified)
Build a graph of high-similarity connections
Find connected components as initial clusters
Select diverse representatives from clusters
Optionally merge small clusters or split large ones

The diversity constraint helps ensure that cluster representatives are spread across the similarity space, making them better anchors for domain alignment.

Examples

# \donttest{
# Create example sparse similarity matrix
library(Matrix)
n <- 1000
# Generate random sparse matrix with values in [0,1]
sim_matrix <- Matrix::rsparsematrix(n, n, density = 0.05, rand.x = runif)
sim_matrix <- (sim_matrix + Matrix::t(sim_matrix)) / 2  # Make symmetric
Matrix::diag(sim_matrix) <- 1  # Self-similarity = 1

# Find pseudolabels
result <- assign_pseudolabels(sim_matrix, min_clusters = 20, verbose = TRUE)
#> Processing 1000 samples with 97248 non-zero similarities
#> Adaptive threshold: 0.4069 ( 0.8 quantile of 97248 similarities)
#> After thresholding: 19450 connections remain
#> Found 1 initial clusters
#> Found 1 valid clusters after size filtering
#> Warning: Final number of representatives (1) is below min_clusters (20). Consider lowering sim_threshold or min_cluster_size.
#> Final result: 1 clusters, 1000 samples assigned
table(result$labels, useNA = "always")
#> 
#> anchor_001       <NA> 
#>       1000          0 
# }