Config Reference#

The current YAML schema is schema_version: 2. The loader is strict: unknown fields are errors at parse time. See MIGRATION.md for the schema-versioning policy and v1 → v2 migration history.

`runpod-deploy run` invocation modes#

Mode	What runs	Use case
(default)	Full lifecycle: `runpodctl pod create`, SSH wait, rsync push, remote commands, artifact pull, `runpodctl pod stop`. Real money.	Production runs.
`--dry-run`	Read-only externals (`runpodctl datacenter list`, GraphQL pricing if `--max-gpu-price`) + all side-effecting commands are mocked + logged.	“Will my config find a GPU in stock right now?”
`--offline-dry-run`	No external calls at all; synthetic GPU/DC selection + mocked side-effecting commands. Implies `--dry-run`.	CI tests, offline config iteration, validation against the lifecycle without a RunPod account.

For the per-phase breakdown of what runs in each mode, see lifecycle.md.

Annotated minimal config#

A working v2 config with every required section and the most common optional fields. Every other field defaults to a sane value (see sections below for full details).

schema_version: 2                     # required (current schema)
name: my-job                          # required; pod's --name is the *rendered* form (v0.4.0)
run_id_prefix: my-job                 # optional; defaults to `name`
state_file: ~/.runpod-my-job-current  # optional; tracks active pod_id for `stop`

local:                                # optional block; controls local validation + paths
  project_root: ../..                 # relative to THIS yaml's directory
  required_paths:                     # validate fails if any are missing locally
    - pyproject.toml

pod:                                  # required block
  image: runpod/pytorch:2.4.0-py3.13-cuda12.4.1-devel-ubuntu22.04
  datacenters: [EU-RO-1, US-CA-2]     # ordered failover list
  gpu_order:                          # ordered failover list
    - NVIDIA H100 80GB HBM3
    - NVIDIA A100-SXM4-80GB
  cloud_type: SECURE                  # SECURE | COMMUNITY (default SECURE; required for network volumes)
  # python_version: "3.13.5"          # OPTIONAL (v0.5.0); see "Pod field: python_version"

storage:                              # required block
  mode: ephemeral                     # ephemeral | network_volume
  volume_gb: 20

ssh:                                  # optional block
  key_path: ~/.ssh/id_ed25519

setup:                                # optional list; runs BEFORE staging
  - command: "which rsync || apt-get install -y rsync"
    timeout_sec: 300

staging:                              # optional list; rsync push local → pod
  - label: source
    source: "{project_root}/"
    destination: "/workspace/repo/"
    excludes_default: true            # v0.4.0: opt in to .git/.venv/caches preset
    excludes_extra: ["evals/", "artifacts/"]

preflight:                            # optional list; runs AFTER staging, BEFORE the run script
  - command: "cd /workspace/repo && uv sync --extra dev"
    timeout_sec: 1800
    with_env: true

run:                                  # required block
  script_path: /workspace/run.sh
  log_path: /workspace/run.log
  success_marker: "[my-job] DONE"
  body: |
    set -euo pipefail
    cd /workspace/repo
    uv run python -m my_package.main

artifacts:                            # optional list; pulled after the run
  - label: results
    remote_path: "/workspace/repo/results/"
    local_path: "{project_root}/results/"
    required: true

lifecycle:                            # optional; defaults shown below
  on_success: delete                  # release volume disk on success
  on_failure: stop                    # preserve paused pod for SSH forensics

`lifecycle:` actions#

on_success accepts one of four strings; on_failure accepts the first three (recycle on the failure path is rejected at validation):

delete — call runpodctl pod delete. Tears down both compute and volume disk. Default for on_success. Use this on the failure path too if you don’t need SSH forensics.
stop — call runpodctl pod stop. Pauses compute, but volume disk continues billing at ~$0.10/GB·month indefinitely until released. Default for on_failure. Pair with runpod-deploy cleanup --state-file <path> --mode delete once forensics is complete. See lifecycle.md §7b “Cost discipline”.
preserve — no-op. Pod keeps running and bills full GPU rate. Rare; useful only if you intend to SSH in immediately after the job completes.
recycle — success-path only. Same wire call as stop (pauses the pod), but the state-file is intentionally preserved so the next runpod-deploy run with the same state_file: finds the paused pod and resumes it via runpodctl pod start <id> instead of provisioning a fresh one. Saves image-pull + cold-boot cost per recurring run. Drift detection (image/GPU/datacenter mismatch) falls through to fresh-create with a WARNING. Bypass for one invocation with runpod-deploy run --force-fresh. See recipes/recycle-pod-for-fast-iteration.md.

The legacy bool form (stop: {on_success: true, on_failure: true}) is still accepted with a deprecation warning. See migration-v3.md for the mapping and migration path.

Required top-level fields#

schema_version: 2
name
pod
storage
run

Common optional fields:

run_id_prefix
state_file
local
ssh
budget
remote_env
setup
preflight
staging
artifacts
stop
variables

`budget:` block — runtime + cost ceilings#

budget:
  cost_cap_usd: 10.0           # default — pod is stopped when projected $/runtime reaches the cap
  assumed_hourly_rate_usd: 1.65 # default — used in cost projections + budget eta
  max_runtime_minutes: null     # default — uncapped; derived from cost_cap_usd / hourly_rate
  poll_interval_sec: 60         # default — cadence for the run-monitor tail loop
  ssh_ready_timeout_sec: 900    # default — deadline for SSH info to populate post `pod create`

ssh_ready_timeout_sec is the wait between runpodctl pod create returning a pod_id and the pod’s SSH proxy publishing a usable {host, port}. The default 900 s covers cold-pull of large pytorch/cudnn-devel images (~6–12 GB) on first-touch GPUs. For one-off bumps without editing YAML, use runpod-deploy run --ssh-ready-timeout-sec <N>. If the wait exceeds 60 s, an INFO heartbeat with status, ssh.error, and uptimeSeconds is emitted every 30 s. On timeout, the orchestrator deletes the orphaned pod before raising (see lifecycle.md and troubleshooting.md for the failure workflow).

Template variables#

Every string field in the config can reference template variables via Python str.format syntax ({name}). Built-ins:

Variable	What it is	Example use
`{config_dir}`	Directory containing this YAML file (absolute).	`local.project_root: "{config_dir}/../.."`
`{project_root}`	Resolved `local.project_root` (absolute).	`staging.source: "{project_root}/"`
`{run_dir}`	This run’s local artifacts dir (e.g. `<project_root>/artifacts/runpod/20260515T120000Z`).	`artifacts.local_path: "{run_dir}/data"`
`{run_id}`	The rendered prefix + timestamp (`<run_id_prefix>-<ts>`).	(used internally by `runpodctl pod create --name`)
`{job_name}`	The rendered `name` field.	`run.body` `echo "starting {job_name}"`
`{volume_mount}`	`storage.volume_mount` (defaults `/workspace`).	`run.script_path: "{volume_mount}/run.sh"`

Custom variables are declared under variables: (YAML) or --var KEY=VALUE / --vars-file PATH (CLI). YAML and CLI variables can reference both built-ins and earlier-defined variables (the rendering is two-pass — see recipes/multi-config-sweep.md for the sweep pattern).

variables:
  hf_dataset: "anthropic/persuasion"
  data_dir: "{project_root}/data"   # references built-in
  run_label: "{job_name}-{hf_dataset}"  # references built-in + earlier var

CLI:

runpod-deploy run --config foo.yaml \
  --var seed=42 \
  --var out_dir="{project_root}/seed42"  # CLI vars can reference built-ins too

CLI --var overrides YAML variables: on key collision.

Storage Modes#

Network volume:

storage:
  mode: network_volume
  volume_name: pid-workspace-100gb
  volume_mount: /workspace

Ephemeral volume:

storage:
  mode: ephemeral
  volume_gb: 200
  volume_mount: /workspace

Commands#

setup and preflight are lists:

preflight:
  - command: "cd {remote_repo} && uv sync --extra dev"
    timeout_sec: 1800
    with_env: true

with_env: true prepends remote_env.source_files and remote_env.exports.

Recommended `remote_env.exports` for uv-driven pods on /workspace#

If your preflight: or run.body runs uv sync and your storage.volume_mount is /workspace (RunPod’s distributed FUSE filesystem), pin uv’s write-heavy working directories to the pod’s overlay disk and export UV_LINK_MODE=copy. On FUSE, three distinct uv/HF-Trainer code paths can stall on MooseFS F_SETLKW exclusive-lock acquisition (uv git resolution, uv install-phase atomic writes, and HF Trainer atomic checkpoint save) — see troubleshooting.md “uv sync hangs silently”

the two follow-up sections for failure signatures.

remote_env:
  source_files:
    - /workspace/secrets/env       # if you stage secrets via the secrets: block
  exports:
    HF_HOME: /workspace/hf_cache              # FUSE OK — HF caches are read-mostly
    HUGGINGFACE_HUB_CACHE: /workspace/hf_cache
    UV_CACHE_DIR: /root/uv_cache              # overlay disk, NOT FUSE — git ops + atomic writes
    UV_PROJECT_ENVIRONMENT: /root/.venv       # overlay disk, NOT FUSE — install atomic writes
    UV_LINK_MODE: copy                         # defense-in-depth for residual /workspace touchpoints

/root is the container’s overlay disk (verify with df -hT /root — type overlay, NOT fuse). POSIX locks work normally there. The uv cache and venv are ephemeral anyway (re-populated each fire); putting them on /root sacrifices nothing.

If your training framework writes checkpoints (HF Trainer’s save_strategy, PyTorch Lightning’s ModelCheckpoint, etc.) and the default output_dir is under /workspace/, the atomic-save protocol can stall on the same FUSE bug. Pin checkpoint output_dir to /root/checkpoints/ and rsync back to the volume as a run.body trailer for orchestrator artifact pull. See troubleshooting.md “HF Trainer checkpoint save hangs”.

For genuinely separate network-stall symptoms (single wheel download stuck mid-stream rather than a stalled stat()/flock()), also add UV_HTTP_TIMEOUT: "120" (defense against Fastly/CDN HTTP stalls) and optionally UV_CONCURRENT_DOWNLOADS: "4" (caps concurrency; default 50 amplifies head-of-line blocking on stalled sockets).

Pod field: `python_version` (optional, default: unset)#

pod:
  image: runpod/pytorch:2.4.0
  datacenters: [EU-RO-1]
  gpu_order: ["NVIDIA H100 80GB HBM3"]
  python_version: "3.13.5"   # ← pin via uv

When set, the orchestrator auto-injects a preflight step that runs uv python install <ver> && cd <first-staging-destination> && uv python pin <ver>. This installs the requested CPython interpreter on the pod and writes a .python-version file into the staged project directory so subsequent uv sync invocations honor the pin.

Format: 3.MINOR or 3.MINOR.PATCH (e.g. "3.13" or "3.13.5"). Pre-release suffixes are rejected — the field exists for reproducibility, not for chasing alphas.

Failure mode: a non-zero exit from the install or pin aborts the run before the user’s preflight or run-body executes. Surfaces the issue cheaply (~30s of pod time) rather than letting a later uv sync fall back to the base-image interpreter.

Why preflight, not setup: the pin must write .python-version into the staged repo dir, which doesn’t exist until after _push_workspace runs. Injecting at preflight[0] is the correctness-preserving placement.

See docs/recipes/reproducibility.md for the trade-offs.

Staging (rsync push)#

staging is a list of local-to-remote rsync transfers:

staging:
  - label: source
    source: "{project_root}/"
    destination: "{remote_repo}/"
    excludes_default: true              # opt in to the hygiene preset
    excludes_extra: ["evals/", "artifacts/"]
    delete: true

Per-entry fields:

label (required)
source (required) — local path; template variables rendered
destination (required) — remote path; template variables rendered
excludes (optional) — explicit list of rsync --exclude patterns
excludes_default (optional, default false) — when true, prepend the hygiene preset (.git/, .venv/, **/__pycache__/, **/*.pyc, .pytest_cache/, .ruff_cache/, .mypy_cache/). Saves repeating these across configs that share repo conventions.
excludes_extra (optional) — additional patterns appended after excludes_default + excludes. Useful for project-specific add-ons like evals/, artifacts/, data/.
delete (optional, default true)

The effective --exclude list passed to rsync is the concatenation: DEFAULT_STAGING_EXCLUDES (if excludes_default) + excludes + excludes_extra, in that order. Existing configs that set only excludes are unaffected.

Artifacts#

Artifacts are pulled after a successful run, and best-effort after failure:

artifacts:
  - label: models
    remote_path: "{remote_repo}/artifacts/models/"
    local_path: "{project_root}/artifacts/models"
    excludes: ["**/_trainer/checkpoint-*"]
    required: true
    delete: true