Provides pseudolabeling system for unsupervised domain adaptation with KEMA. Identifies high-confidence anchor samples to guide domain alignment.
Details
The pseudolabeling system addresses unsupervised domain adaptation by identifying reliable correspondences between samples from different domains. The approach uses similarity-based clustering, diversity-aware selection, adaptive thresholding, and quality control filtering.
Main functions:
assign_pseudolabels(): General-purpose pseudolabeling from sparse similarity matriceshigh_sim_pseudolabels(): Specialized function for multi-domain data using cosine similaritycreate_synthetic_similarity_matrix(): Generate synthetic data for testingevaluate_pseudolabeling(): Evaluate pseudolabeling performance against ground truth
Integration with KEMA:
# Generate pseudolabels
plabs <- assign_pseudolabels(similarity_matrix, min_clusters = 20)
# Use with KEMA
fit <- kema.hyperdesign(
data = strata,
y = plabs$labels,
u = 0.8, # Trust geometry over pseudolabels
dweight = 0.2, # Mild class separation
simfun = function(lab) binary_label_matrix(lab, type = "s"),
disfun = function(lab) binary_label_matrix(lab, type = "d")
)Key parameters:
sim_threshold: Controls which similarities are considered "high". Can be adaptive.
diversity_weight: Balances cluster coherence vs. representative diversity
min_clusters/max_clusters: Controls the number of anchor points
min_cluster_size: Ensures clusters are large enough to be reliable
Examples
# \donttest{
library(Matrix)
# Create synthetic similarity matrix
n <- 100
sim_matrix <- Matrix::rsparsematrix(n, n, density = 0.1, rand.x = runif)
sim_matrix <- (sim_matrix + Matrix::t(sim_matrix)) / 2
Matrix::diag(sim_matrix) <- 1
# Assign pseudolabels
result <- assign_pseudolabels(sim_matrix, min_clusters = 5)
#> Warning: Final number of representatives (1) is below min_clusters (5). Consider lowering sim_threshold or min_cluster_size.
table(result$labels, useNA = "always")
#>
#> anchor_001 <NA>
#> 98 2
# }