Getting Started with delarr
delarr team
2026-02-16
Source:vignettes/delarr-getting-started.Rmd
delarr-getting-started.RmdOverview
delarr is a lightweight delayed matrix type for R
designed for datasets that are too large to fit comfortably in memory.
Rather than loading an entire matrix at once, delarr keeps
data on disk (or in any external storage) and streams it through your
analysis pipeline in manageable chunks. The package keeps its surface
area deliberately small—one S3 class plus a handful of verbs—while still
supporting fused transformations, streaming reductions, and pluggable
backends. Operations are lazy: they build up a pipeline of work that
only executes when you call collect(). This lets you chain
multiple transformations without intermediate allocations. This vignette
walks through the essentials: creating delayed arrays, building lazy
pipelines, streaming to and from HDF5 files, and writing custom backends
for your own storage formats.
Installation
delarr is under active development. Install the latest
development version by cloning the repository and loading it with
pkgload or devtools:
# install.packages("pkgload")
pkgload::load_all("/path/to/delarr")Once installed, load the package:
A first lazy pipeline
The core constructor delarr() wraps an existing matrix
(or a seed backend). All transformation verbs—d_map(),
d_center(), d_scale(), etc.—return a new
delarr without actually computing anything. The operations
are recorded and fused together, then executed in a single pass when you
call collect().
This lazy evaluation has two major benefits:
- Memory efficiency: Intermediate results are never materialised. A chain of ten transformations uses the same memory as one.
- Streaming: When backed by on-disk storage, data flows through the pipeline in chunks, so you can process datasets larger than RAM.
set.seed(1)
mat <- matrix(rnorm(20), 5, 4)
arr <- delarr(mat)
result <- arr |>
d_center(dim = "rows", na.rm = FALSE) |>
d_map(~ .x * 0.5) |>
d_reduce(mean, dim = "rows") |>
collect()
result
#> [1] -1.561251e-17 6.938894e-18 0.000000e+00 1.387779e-17 0.000000e+00The pipeline above:
-
d_center(dim = "rows")— subtracts each row’s mean (operates across columns within each row) -
d_map(~ .x * 0.5)— multiplies every element by 0.5 (elementwise, no row/column orientation) -
d_reduce(mean, dim = "rows")— computes the mean of each row, collapsing to a single column
Broadcasting and binary operations
Binary operations (d_map2() or arithmetic operators)
remain lazy and support broadcasting of scalars and row/column vectors.
When you add a vector to a matrix, delarr automatically
determines whether it should broadcast across rows or columns based on
the vector length. This makes common operations like subtracting row
means or scaling by column standard deviations simple and
memory-efficient—the bias/scale vectors are applied on-the-fly as each
chunk streams through.
Streaming to HDF5
HDF5 is a popular format for storing large numerical arrays on disk.
The delarr_hdf5() function creates a delayed array backed
by an HDF5 dataset, opening the file only when collect()
executes and closing it immediately after.
To write results back to disk without loading the full matrix into
memory, pair your pipeline with hdf5_writer(). This creates
a streaming sink that receives data chunk-by-chunk. The
compression parameter controls gzip compression level (0-9,
with 4 as a good default balance of speed and size).
# Create test data in an HDF5 file
tf_in <- tempfile(fileext = ".h5")
input <- matrix(runif(30), 5, 6)
write_hdf5(input, tf_in, "X")
# Load lazily, transform, and stream to a new file
X <- delarr_hdf5(tf_in, "X")
centred <- X |> d_center("cols")
tf_out <- tempfile(fileext = ".h5")
writer <- hdf5_writer(tf_out, "X_centered", ncol = ncol(centred))
collect(centred, into = writer)
# Verify the result
read_hdf5(tf_out, "X_centered")
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.05913912 -0.301175319 0.38311389 -0.23447141 -0.127991653
#> [2,] 0.44272897 -0.084369772 -0.23615866 -0.01394294 0.372375760
#> [3,] 0.01961660 0.117992784 -0.07069631 0.27382248 -0.120381101
#> [4,] -0.17368323 0.261363598 -0.19736736 -0.40824127 -0.133289660
#> [5,] -0.34780146 0.006188708 0.12110843 0.38283314 0.009286654
#> [,6]
#> [1,] 0.1153051266
#> [2,] 0.0874462614
#> [3,] -0.3869036657
#> [4,] 0.0004274896
#> [5,] 0.1837247880
# Clean up
unlink(c(tf_in, tf_out))Custom backends with delarr_seed()
The real power of delarr comes from its pluggable
backend system. Any storage layer—databases, cloud storage,
memory-mapped files, or custom binary formats—can become a
delarr backend by implementing a simple contract:
Dimensions: Know the total number of rows and columns
Pull function: Given row and column indices, return the corresponding submatrix
The delarr_seed() function wraps these two pieces into a
seed object that delarr() understands. The pull function is
called lazily during collect(), receiving only the indices
needed for the current chunk.
random_backend <- list(
pull = function(rows = NULL, cols = NULL) {
rows <- if (is.null(rows)) seq_len(100) else rows
cols <- if (is.null(cols)) seq_len(50) else cols
matrix(rnorm(length(rows) * length(cols)), length(rows), length(cols))
}
)
seed <- delarr_seed(
nrow = 100,
ncol = 50,
pull = function(rows, cols) random_backend$pull(rows, cols)
)
rand_arr <- delarr(seed)
rand_arr |>
d_map(~ .x^2) |>
d_reduce(mean, dim = "cols") |>
collect()
#> [1] 0.7233715 1.1830402 0.9029285 1.0262304 1.3119029 1.0274333 1.2002124
#> [8] 1.2593355 1.0879169 1.0533032 1.0211869 1.0731659 1.0520222 0.9345241
#> [15] 1.0942047 1.0145886 1.3023160 1.0513637 1.2534753 1.1267768 1.1987305
#> [22] 1.0589216 0.9288278 0.9502513 1.1073139 0.7505165 1.0521073 1.1741118
#> [29] 1.0906651 1.3899871 1.0702889 1.1941535 1.1789617 0.9774898 1.0885509
#> [36] 0.9973437 1.2122340 1.1195303 0.9233984 0.8966970 0.8562727 1.0275834
#> [43] 1.2038752 0.7201439 1.1070323 0.8734731 0.9997249 1.0055618 1.0938935
#> [50] 0.8017712Available verbs
delarr provides a focused set of transformation verbs: -
d_map() /
d_map2(): Elementwise transformations
using formula or function syntax -
d_center() /
d_scale() /
d_zscore(): Common preprocessing
operations with row or column orientation -
d_detrend(): Remove linear trends from
rows or columns - d_where(): Masked
updates—apply changes only where a condition holds -
d_reduce(): Row-wise or column-wise
reductions (sum, mean, min, max, or custom functions)
All verbs return a new delarr, so you can chain them
freely. The operations fuse together and execute in a single streaming
pass.
How delarr stays efficient
delarr uses several strategies to minimise memory usage
and maximise throughput: Operation fusion. When you
chain multiple transformations, delarr doesn’t create
intermediate matrices. A pipeline like
d_center() |> d_map(~ .x * 2) |> d_scale() records
three operations but executes them in a single pass. Each element is
centred, scaled, and transformed together before moving to the next
chunk.
Chunked streaming. Data flows through the pipeline
in column chunks (you control the size via chunk_size in
collect()). Only one chunk lives in memory at a time, so a
10 GB matrix can be processed with just a few hundred megabytes of
RAM.
Lazy file access. On-disk backends like
delarr_hdf5() and delarr_mmap() open files
only during collect() and read only the columns needed for
the current chunk. The file handle is closed immediately after, avoiding
resource leaks.
Minimal dispatch overhead. delarr uses
simple S3 classes and closures rather than S4 machinery. The core
execution loop is a tight for over column chunks with
direct function calls—no method lookup per element.
Streaming reductions. Operations like
d_reduce(mean, dim = "cols") compute running accumulators
as chunks flow through, never materialising the full matrix. The final
result requires only O(ncol) or O(nrow) memory, not O(nrow × ncol).
When to use delarr
delarr is a good fit when:
- Your data is larger than available RAM
- You need to apply the same pipeline to many datasets
- You want streaming I/O without writing explicit chunking loops
- You prefer a simple, pipe-friendly API over S4 complexity
For sparse matrices, GPU acceleration, or parallel distributed
computing, consider specialised packages. delarr focuses on
doing one thing well: efficient lazy operations on dense, out-of-core
matrices.