Pipeline YAML Specification • plsrri

library(plsrri)
library(yaml)

The YAML file is the durable contract for the non-GUI pipeline. It is what prepare_firstlevel(), prepare_pls(), run_pls(), plscli, and the Shiny import/export path all share.

This article is a reference for that contract. It focuses on structure, required fields, and common variants. For the scripted workflow itself, see vignette("scripted-workflows").

The intended starting point is a scaffold, not a blank file:

write_pipeline_template("study.yml")

plscli template --out study.yml

What sections does the spec contain?

The scaffold written by write_pipeline_template() gives the intended shape.

required_sections
#> [1] "dataset"     "design"      "first_level" "pls"         "execution"  
#> [6] "outputs"

Those sections have distinct roles:

dataset: where the BIDS data live and how subjects/tasks are selected
design: how first-level regressors are built
first_level: what first-level outputs to write
pls: how those outputs are mapped into a PLS analysis
execution: local versus array execution settings
outputs: where artifacts are written

What is the smallest valid spec?

The minimum viable spec is deliberately small. You need:

a BIDS directory
at least one task label
a first-level design formula
a PLS method
an output root

cat(as.yaml(minimal_spec))
#> dataset:
#>   bids_dir: /tmp/Rtmpdh62kP/plsrri-yaml-28aa4b1e905e/bids
#>   task: stroop
#> design:
#>   formula: onset ~ hrf(condition, basis = 'spmg1')
#> first_level:
#>   output:
#>     type: estimates
#>     statistics: estimate
#> pls:
#>   method: task
#>   nperm: 0
#>   nboot: 0
#> outputs:
#>   root: /tmp/Rtmpdh62kP/plsrri-yaml-28aa4b1e905e/out

That small object already validates and picks up defaults for execution.mode, execution.parallelism, first_level.strategy, and pls.input.

How should you read the top-level sections?

`dataset`

dataset answers: what study is this, and which task/run space should be used?

Common fields:

bids_dir
task
space
group_column
optional subject/session/run filters

dataset:
  bids_dir: /path/to/bids
  task: stroop
  space: MNI152NLin2009cAsym
  group_column: group

`design`

design defines the first-level model, not the PLS contrast. The key field is formula.

design:
  formula: onset ~ hrf(condition, basis = 'spmg1')
  block: ~ run

That formula is where you choose single-df HRFs versus basis expansions such as FIR or tent-style models.

`first_level`

first_level controls how the GLM stage is run and what maps are written.

first_level:
  strategy: runwise
  nchunks: 1
  output:
    type: estimates
    statistics: [estimate]

The important distinction is:

type = estimates: write condition-level beta-like maps
type = contrasts: write named contrast maps
type = F: write F-statistic outputs

`pls`

pls tells the second stage how to interpret first-level outputs.

pls:
  method: task
  input:
    type: estimates
    statistic: estimate
  nperm: 1000
  nboot: 500

The method field maps onto supported plsrri methods such as:

task
task_nonrotated
behavior
behavior_nonrotated
multiblock
multiblock_nonrotated

`execution`

execution controls how staged commands are scheduled, not the statistical analysis itself.

execution:
  mode: array
  parallelism: 8

`outputs`

outputs.root is the artifact root that the CLI, scripted R API, report layer, and Shiny attach mode all reuse.

outputs:
  root: plscli-out

How do basis-expanded first-level outputs fit into the spec?

The YAML needs two pieces of information when first-level output labels encode basis functions such as FIR bins or tent functions:

how first-level labels are written
how PLS should fold those labels back into a basis-aware manifest

cat(as.yaml(basis_spec))
#> dataset:
#>   bids_dir: /tmp/Rtmpdh62kP/plsrri-yaml-28aa4b1e905e/bids
#>   task: stroop
#> design:
#>   formula: onset ~ hrf(condition, basis = 'fir', K = 4)
#> first_level:
#>   output:
#>     type: estimates
#>     statistics: estimate
#>     basis_pattern: ^(.*)_bin([0-9]+)$
#>     basis_order:
#>     - bin1
#>     - bin2
#>     - bin3
#>     - bin4
#> pls:
#>   method: task
#>   input:
#>     type: estimates
#>     statistic: estimate
#>     basis_pattern: ^(.*)_bin([0-9]+)$
#>     condition_group: 1
#>     basis_group: 2
#>     basis_order:
#>     - '1'
#>     - '2'
#>     - '3'
#>     - '4'
#>   nperm: 0
#>   nboot: 0
#> outputs:
#>   root: /tmp/Rtmpdh62kP/plsrri-yaml-28aa4b1e905e/fir-out

The important point is that basis handling belongs in both places:

design.formula determines what the first-level model estimates
pls.input.* tells the PLS stage how to reinterpret those basis-labelled maps

How does the spec map to CLI stages?

The YAML is consumed incrementally. Not every stage needs every section.

CLI stage	Main sections used
`plscli validate`	all
`plscli discover`	`dataset`, `outputs`
`plscli firstlevel-plan`	`dataset`, `design`, `first_level`, `outputs`
`plscli firstlevel-run`	first-level plan artifacts plus `execution`
`plscli pls-plan`	`pls`, `outputs`
`plscli pls-run`	`pls`, planned manifests, `outputs`
`plscli report`	`outputs` or an existing artifact root

That split is why the same YAML can drive:

a local end-to-end run
an HPC array workflow
a Shiny-exported pipeline configuration

What should you treat as stable?

The stable contract is:

the YAML structure described here
the artifact root under outputs.root
the public helpers:
- write_pipeline_template()
- read_pipeline_spec()
- validate_pipeline_spec()
- prepare_firstlevel()
- prepare_pls()
- run_pls()
- render_pls_report()

Where should you go next?

Use vignette("scripted-workflows") for the staged R and CLI workflow, and ?write_pipeline_template / ?read_pipeline_spec for the corresponding help pages.