SLURM script submission and monitoring from R
parade-scripts-monitoring.RmdOverview
parade provides a comprehensive suite of tools to submit any R script to SLURM and monitor it interactively from within R—no shell access required. This approach gives you the full power of SLURM job scheduling while maintaining the convenience of R-based workflow management.
Key monitoring capabilities
- Live resource monitoring: Real-time CPU usage, memory consumption, and job status
-
Interactive dashboards: Single job
(
script_top()) and multi-job (jobs_top()) monitoring -
Log streaming: View live output from running jobs
(
script_tail()) -
Status checking: Quick job state queries
(
script_status()) - Job management: Cancel, wait for, and track multiple jobs
- Error handling: Automatic detection of failed jobs and error diagnostics
Quick start example
library(parade)
# Recommended (HPC): one-command setup
parade_init_hpc(persist = TRUE)
# Manual alternative:
# paths_init(profile = "hpc")
# parade_doctor(create = TRUE)
# Optional: Configure site defaults
# Note: mem=NA means use cluster default memory allocation
slurm_defaults_set(partition="general", time="2h", cpus_per_task=16, mem=NA, persist=TRUE) # Saves your preferred SLURM resources for future submissions
slurm_template_set("registry://templates/parade-slurm.tmpl") # Points parade at the batchtools template stored under the registry alias
# Submit a script file
job <- submit_slurm("scripts/train.R", args = c("--fold", "1"))
# Or submit a function directly
job <- slurm_call(
function(fold) {
# Your training code here
message("Training fold ", fold)
model <- train_model(fold)
return(model)
},
fold = 1,
name = "train-fold-1",
write_result = "artifacts://models/fold1.rds"
)
# Quick status check
script_status(job)
# Unified dashboard (summary; use action="top" for interactive monitoring)
parade_dashboard(job)
# View recent log output
script_tail(job, 80)
# Launch interactive monitor
script_top(job, refresh = 2, nlog = 40)Uniform verbs with function submissions
A jobset is a collection of one or more jobs that
can be managed together. By default, slurm_call() returns a
single job object, but you can request a jobset (a one-element
collection) to use the same workflow verbs that work with multiple
jobs:
# Submit a single function job, returned as a one-element jobset
jobs <- slurm_call(
function(file) {
# Pretend to do useful work
message("Processing ", basename(file))
Sys.sleep(1)
read.csv(file)[1:5, ]
},
file = "data/example.csv",
name = "proc-example",
write_result = path$artifacts("results/{run}/{stem}.rds"),
.as_jobset = TRUE # Return as jobset instead of single job
)
# Now you can use jobset verbs:
# - progress(): Show a progress bar while jobs run
# - collect(): Wait for completion and retrieve results
# - status(): Check job states
# - cancel(): Cancel running jobs
jobs |> progress() |> collect()
# Open logs if needed (no-ops for local engine)
open_logs(jobs, selection = "all")And for multiple inputs the pattern is identical with
slurm_map():
Core monitoring functions
Single job monitoring: script_top()
The script_top() function provides a real-time,
interactive dashboard for monitoring a single SLURM job:
job <- submit_slurm("analysis.R")
script_top(job, refresh = 2, nlog = 30, clear = TRUE)Features displayed: - Job identification: Name, SLURM job ID, assigned node - Resource usage: CPU percentage with visual progress bar, allocated vs. used CPUs - Memory statistics: Average and maximum RSS (Resident Set Size), virtual memory usage - Timing information: Elapsed time, CPU time used, uptime since monitoring started - Live log output: Most recent log lines from the job (configurable number) - Status tracking: Automatically detects when jobs complete or fail
Parameters: - refresh: Update interval
in seconds (default: 2) - nlog: Number of recent log lines
to display (default: 30)
- clear: Whether to clear screen between updates for
smoother display (default: TRUE)
Multi-job dashboard: jobs_top()
Monitor multiple jobs simultaneously with a tabular overview plus detailed logs from running jobs:
# Submit multiple jobs
job1 <- submit_slurm("preprocess.R", args = c("--dataset", "A"))
job2 <- submit_slurm("preprocess.R", args = c("--dataset", "B"))
job3 <- submit_slurm("model_train.R")
# Monitor all jobs together
jobs_top(list(job1, job2, job3), refresh = 3, nlog = 20)Display format: - Summary line: Count of jobs in each state (PENDING=1, RUNNING=2, etc.) - Job table: Compact view with name, job ID, state, CPU%, allocated CPUs, max memory, elapsed time, node - Live log tail: Recent output from the first running job
Flexible input formats:
Essential job management functions
Status checking: script_status()
Get current job state without launching a full monitor:
status <- script_status(job)
print(status)
# # A tibble: 1 × 5
# pending started running done error
# <int> <int> <int> <int> <int>
# 0 0 1 0 0
# Detailed view includes full batchtools information
detailed <- script_status(job, detail = TRUE)Log viewing: script_tail()
Display recent log output from a job:
# Show last 50 lines
script_tail(job, n = 50)
# Quick check of recent output
script_tail(job) # Default: 200 linesResource metrics: script_metrics()
Get detailed resource usage statistics:
metrics <- script_metrics(job)
print(metrics)
# $job_id
# [1] "12345"
#
# $state
# [1] "RUNNING"
#
# $cpu_pct
# [1] 87.3
#
# $max_rss
# [1] 1024000000 # bytesJob completion: script_done()
Check if a job has finished (successfully or with errors):
if (script_done(job)) {
cat("Job completed!\n")
# Process results...
} else {
cat("Job still running...\n")
}Advanced job management
Waiting for completion: script_await()
Block execution until job completes:
# Wait indefinitely
script_await(job)
# Wait with timeout (5 minutes)
script_await(job, timeout = 300)
# Custom polling interval
script_await(job, timeout = 600, poll = 30) # Check every 30 secondsFinding recent jobs: script_find_latest()
Locate recently submitted jobs when you don’t have the job object:
# Find 5 most recent jobs
recent <- script_find_latest(n = 5)
print(recent)
# Load a job from its registry path
job <- script_load(recent$registry[1])Function submission with slurm_call()
New in parade 0.12.0, slurm_call() allows you to submit
R functions directly to SLURM without creating script files. This is
ideal for interactive development, parameter sweeps, and functional
programming workflows.
Basic function submission
# Submit a simple function
job <- slurm_call(
function(x, y) {
result <- x^2 + y^2
message("Computed: ", result)
return(result)
},
x = 3,
y = 4,
name = "pythagorean"
)
# Monitor just like any other job
script_top(job)Resource profiles
Parade 0.12.0 introduces resource profiles for easier resource management:
# Use built-in profiles
job <- slurm_call(my_function, x = 1, resources = "gpu")
job <- slurm_call(my_function, x = 1, resources = "highmem")
# Create custom profiles with chaining
my_profile <- profile() %>%
res_time("8:00:00") %>%
mem("32G") %>%
cpus(16) %>%
partition("compute")
job <- slurm_call(my_function, x = 1, resources = my_profile)
# Register profiles for reuse
profile_register("ml_training",
profile() %>%
res_time("24:00:00") %>%
mem("64G") %>%
cpus(32) %>%
gpus(2)
)
# Use registered profile by name
job <- slurm_call(train_model, data = data, resources = "ml_training")Saving results to artifacts
Use write_result to persist function output:
# Results are automatically saved to the specified path
job <- slurm_call(
function(size) {
matrix(runif(size * size), nrow = size)
},
size = 1000,
write_result = "artifacts://matrices/random_1000.rds",
resources = list(mem = "8G")
)
# Wait for job to complete
script_await(job)
# Check job status and load result safely
status <- script_status(job, detail = TRUE)
if (status$done == 1 && status$error == 0) {
# Job succeeded - load the result
if (!is.null(job$result_path) && file.exists(job$result_path)) {
mat <- readRDS(job$result_path)
dim(mat) # [1] 1000 1000
} else {
warning("Result file not found at expected path")
}
} else {
# Job failed - check logs for errors
warning("Job failed - check logs with script_tail(job)")
}Parameter sweeps
Combine with lapply for parallel parameter exploration:
# Submit multiple function calls with different parameters
parameters <- expand.grid(
alpha = c(0.01, 0.1, 1.0),
beta = c(0.5, 1.0, 2.0)
)
jobs <- mapply(function(a, b) {
slurm_call(
function(alpha, beta) {
# Your model fitting code here
result <- fit_model(alpha, beta)
return(list(alpha = alpha, beta = beta, score = result$score))
},
alpha = a,
beta = b,
name = sprintf("model-a%.2f-b%.2f", a, b),
write_result = sprintf("artifacts://models/fit_a%.2f_b%.2f.rds", a, b)
)
}, parameters$alpha, parameters$beta, SIMPLIFY = FALSE)
# Monitor all parameter sweep jobs
jobs_top(jobs)Closures and captured variables
slurm_call() serializes the function’s environment, so
closures work naturally:
# Configuration captured in closure
config <- list(
iterations = 1000,
tolerance = 1e-6,
method = "newton"
)
optimizer <- function(data) {
# config is available here
optimize_with_config(data, config)
}
job <- slurm_call(
optimizer,
data = my_dataset,
name = "optimization",
resources = list(time = "4:00:00")
)Important considerations
Serialization size: The function and its environment are serialized with
saveRDS(). Large captured objects increase overhead.-
Working directory: Functions execute in a temporary staging directory. Use parade’s path system for data access:
slurm_call( function() { data <- readRDS(resolve_path("data://input.rds")) result <- process(data) saveRDS(result, resolve_path("artifacts://output.rds")) } ) -
Comparison with submit_slurm():
- Use
submit_slurm()for: existing scripts, complex workflows, shell integration - Use
slurm_call()for: interactive development, parameter sweeps, functional pipelines
- Use
Practical monitoring scenarios
Scenario 1: Long-running training job
# Submit training job with generous time limit
job <- submit_slurm("train_model.R",
resources = list(time = "24:00:00", mem = "32G"))
# Quick status check
if (script_status(job)$running > 0) {
cat("Training started successfully\n")
# Monitor for a few minutes, then leave it running
script_top(job, refresh = 5, nlog = 20)
} else {
cat("Job may be queued or failed\n")
script_tail(job, 100) # Check for error messages
}Scenario 2: Batch processing pipeline
# Submit preprocessing jobs for multiple datasets
datasets <- c("dataset_A", "dataset_B", "dataset_C")
prep_jobs <- lapply(datasets, function(d) {
submit_slurm("preprocess.R", args = c("--input", d))
})
# Monitor all preprocessing
jobs_top(prep_jobs, refresh = 5)
# Wait for all to complete
lapply(prep_jobs, script_await)
# Submit analysis job that depends on preprocessing
analysis_job <- submit_slurm("analyze_results.R")
script_top(analysis_job)Scenario 3: Troubleshooting failed jobs
job <- submit_slurm("problematic_script.R")
# Check if job completed
if (script_done(job)) {
status <- script_status(job)
if (status$error > 0) {
cat("Job failed! Checking logs...\n")
# View full log output for debugging
script_tail(job, n = 500)
# Get log file paths for detailed analysis
logs <- script_logs(job)
cat("Log files:", logs$path, "\n")
} else {
cat("Job completed successfully\n")
}
}Tips for efficient monitoring
Resource optimization
- Monitor CPU usage: Look for jobs using less than expected CPU% - may indicate I/O bottlenecks
- Track memory patterns: MaxRSS shows peak memory usage; compare against requested memory
- Watch for memory leaks: Increasing AveRSS over time may indicate memory management issues
Interactive monitoring best practices
- Use appropriate refresh rates: Fast updates (1-2s) for active debugging, slower (5-10s) for long jobs
-
Adjust log lines: More lines
(
nlog = 100) for detailed debugging, fewer (nlog = 10) for overview -
Background monitoring: Use
clear = FALSEwhen capturing output or running non-interactively
Multi-job management
-
Group related jobs: Monitor job families together
with
jobs_top() - Stagger job submission: Avoid overwhelming the scheduler with simultaneous submissions
- Use descriptive names: Job names appear in monitoring displays—make them informative
Error handling and troubleshooting
Common monitoring issues
“Cannot fetch metrics” error: - Ensure SLURM
commands (squeue, sstat, sacct)
are available - Check that job ID is valid and job hasn’t been purged
from SLURM records - Verify SLURM permissions and cluster
connectivity
Empty log output: - Job may not have started writing output yet - Check job status—may be pending in queue - Verify output redirection in SLURM template
Memory metrics showing NA: - Some metrics
unavailable until job starts running - SLURM accounting may not be
enabled on cluster - Try script_metrics() directly to see
raw data
Recovery strategies
Lost job objects:
# Find recent jobs
recent <- script_find_latest(pattern = "train")
job <- script_load(recent$registry[1])Monitor jobs from different R sessions:
# Jobs persist across R sessions via registry
job_path <- "registry://script-abc123"
job <- script_load(job_path)
script_top(job)This comprehensive monitoring system makes SLURM job management as convenient as local R execution while providing the scalability and resource management benefits of cluster computing.