Config Reference#

The current YAML schema is schema_version: 2. The loader is strict: unknown fields are errors at parse time. See MIGRATION.md for the schema-versioning policy and v1 → v2 migration history.

runpod-deploy run invocation modes#

Mode

What runs

Use case

(default)

Full lifecycle: runpodctl pod create, SSH wait, rsync push, remote commands, artifact pull, runpodctl pod stop. Real money.

Production runs.

--dry-run

Read-only externals (runpodctl datacenter list, GraphQL pricing if --max-gpu-price) + all side-effecting commands are mocked + logged.

“Will my config find a GPU in stock right now?”

--offline-dry-run

No external calls at all; synthetic GPU/DC selection + mocked side-effecting commands. Implies --dry-run.

CI tests, offline config iteration, validation against the lifecycle without a RunPod account.

For the per-phase breakdown of what runs in each mode, see lifecycle.md.

Annotated minimal config#

A working v2 config with every required section and the most common optional fields. Every other field defaults to a sane value (see sections below for full details).

schema_version: 2                     # required (current schema)
name: my-job                          # required; pod's --name is the *rendered* form (v0.4.0)
run_id_prefix: my-job                 # optional; defaults to `name`
state_file: ~/.runpod-my-job-current  # optional; tracks active pod_id for `stop`

local:                                # optional block; controls local validation + paths
  project_root: ../..                 # relative to THIS yaml's directory
  required_paths:                     # validate fails if any are missing locally
    - pyproject.toml

pod:                                  # required block
  image: runpod/pytorch:2.4.0-py3.13-cuda12.4.1-devel-ubuntu22.04
  datacenters: [EU-RO-1, US-CA-2]     # ordered failover list
  gpu_order:                          # ordered failover list
    - NVIDIA H100 80GB HBM3
    - NVIDIA A100-SXM4-80GB
  cloud_type: SECURE                  # SECURE | COMMUNITY (default SECURE; required for network volumes)
  # python_version: "3.13.5"          # OPTIONAL (v0.5.0); see "Pod field: python_version"

storage:                              # required block
  mode: ephemeral                     # ephemeral | network_volume
  volume_gb: 20

ssh:                                  # optional block
  key_path: ~/.ssh/id_ed25519

setup:                                # optional list; runs BEFORE staging
  - command: "which rsync || apt-get install -y rsync"
    timeout_sec: 300

staging:                              # optional list; rsync push local → pod
  - label: source
    source: "{project_root}/"
    destination: "/workspace/repo/"
    excludes_default: true            # v0.4.0: opt in to .git/.venv/caches preset
    excludes_extra: ["evals/", "artifacts/"]

preflight:                            # optional list; runs AFTER staging, BEFORE the run script
  - command: "cd /workspace/repo && uv sync --extra dev"
    timeout_sec: 1800
    with_env: true

run:                                  # required block
  script_path: /workspace/run.sh
  log_path: /workspace/run.log
  success_marker: "[my-job] DONE"
  body: |
    set -euo pipefail
    cd /workspace/repo
    uv run python -m my_package.main

artifacts:                            # optional list; pulled after the run
  - label: results
    remote_path: "/workspace/repo/results/"
    local_path: "{project_root}/results/"
    required: true

lifecycle:                            # optional; defaults shown below
  on_success: delete                  # release volume disk on success
  on_failure: stop                    # preserve paused pod for SSH forensics

lifecycle: actions#

on_success accepts one of four strings; on_failure accepts the first three (recycle on the failure path is rejected at validation):

  • delete — call runpodctl pod delete. Tears down both compute and volume disk. Default for on_success. Use this on the failure path too if you don’t need SSH forensics.

  • stop — call runpodctl pod stop. Pauses compute, but volume disk continues billing at ~$0.10/GB·month indefinitely until released. Default for on_failure. Pair with runpod-deploy cleanup --state-file <path> --mode delete once forensics is complete. See lifecycle.md §7b “Cost discipline”.

  • preserve — no-op. Pod keeps running and bills full GPU rate. Rare; useful only if you intend to SSH in immediately after the job completes.

  • recyclesuccess-path only. Same wire call as stop (pauses the pod), but the state-file is intentionally preserved so the next runpod-deploy run with the same state_file: finds the paused pod and resumes it via runpodctl pod start <id> instead of provisioning a fresh one. Saves image-pull + cold-boot cost per recurring run. Drift detection (image/GPU/datacenter mismatch) falls through to fresh-create with a WARNING. Bypass for one invocation with runpod-deploy run --force-fresh. See recipes/recycle-pod-for-fast-iteration.md.

The legacy bool form (stop: {on_success: true, on_failure: true}) is still accepted with a deprecation warning. See migration-v3.md for the mapping and migration path.

Required top-level fields#

  • schema_version: 2

  • name

  • pod

  • storage

  • run

Common optional fields:

  • run_id_prefix

  • state_file

  • local

  • ssh

  • budget

  • remote_env

  • setup

  • preflight

  • staging

  • artifacts

  • stop

  • variables

budget: block — runtime + cost ceilings#

budget:
  cost_cap_usd: 10.0           # default — pod is stopped when projected $/runtime reaches the cap
  assumed_hourly_rate_usd: 1.65 # default — used in cost projections + budget eta
  max_runtime_minutes: null     # default — uncapped; derived from cost_cap_usd / hourly_rate
  poll_interval_sec: 60         # default — cadence for the run-monitor tail loop
  ssh_ready_timeout_sec: 900    # default — deadline for SSH info to populate post `pod create`

ssh_ready_timeout_sec is the wait between runpodctl pod create returning a pod_id and the pod’s SSH proxy publishing a usable {host, port}. The default 900 s covers cold-pull of large pytorch/cudnn-devel images (~6–12 GB) on first-touch GPUs. For one-off bumps without editing YAML, use runpod-deploy run --ssh-ready-timeout-sec <N>. If the wait exceeds 60 s, an INFO heartbeat with status, ssh.error, and uptimeSeconds is emitted every 30 s. On timeout, the orchestrator deletes the orphaned pod before raising (see lifecycle.md and troubleshooting.md for the failure workflow).

Template variables#

Every string field in the config can reference template variables via Python str.format syntax ({name}). Built-ins:

Variable

What it is

Example use

{config_dir}

Directory containing this YAML file (absolute).

local.project_root: "{config_dir}/../.."

{project_root}

Resolved local.project_root (absolute).

staging.source: "{project_root}/"

{run_dir}

This run’s local artifacts dir (e.g. <project_root>/artifacts/runpod/20260515T120000Z).

artifacts.local_path: "{run_dir}/data"

{run_id}

The rendered prefix + timestamp (<run_id_prefix>-<ts>).

(used internally by runpodctl pod create --name)

{job_name}

The rendered name field.

run.body echo "starting {job_name}"

{volume_mount}

storage.volume_mount (defaults /workspace).

run.script_path: "{volume_mount}/run.sh"

Custom variables are declared under variables: (YAML) or --var KEY=VALUE / --vars-file PATH (CLI). YAML and CLI variables can reference both built-ins and earlier-defined variables (the rendering is two-pass — see recipes/multi-config-sweep.md for the sweep pattern).

variables:
  hf_dataset: "anthropic/persuasion"
  data_dir: "{project_root}/data"   # references built-in
  run_label: "{job_name}-{hf_dataset}"  # references built-in + earlier var

CLI:

runpod-deploy run --config foo.yaml \
  --var seed=42 \
  --var out_dir="{project_root}/seed42"  # CLI vars can reference built-ins too

CLI --var overrides YAML variables: on key collision.

Storage Modes#

Network volume:

storage:
  mode: network_volume
  volume_name: pid-workspace-100gb
  volume_mount: /workspace

Ephemeral volume:

storage:
  mode: ephemeral
  volume_gb: 200
  volume_mount: /workspace

Commands#

setup and preflight are lists:

preflight:
  - command: "cd {remote_repo} && uv sync --extra dev"
    timeout_sec: 1800
    with_env: true

with_env: true prepends remote_env.source_files and remote_env.exports.

Pod field: python_version (optional, default: unset)#

pod:
  image: runpod/pytorch:2.4.0
  datacenters: [EU-RO-1]
  gpu_order: ["NVIDIA H100 80GB HBM3"]
  python_version: "3.13.5"   # ← pin via uv

When set, the orchestrator auto-injects a preflight step that runs uv python install <ver> && cd <first-staging-destination> && uv python pin <ver>. This installs the requested CPython interpreter on the pod and writes a .python-version file into the staged project directory so subsequent uv sync invocations honor the pin.

Format: 3.MINOR or 3.MINOR.PATCH (e.g. "3.13" or "3.13.5"). Pre-release suffixes are rejected — the field exists for reproducibility, not for chasing alphas.

Failure mode: a non-zero exit from the install or pin aborts the run before the user’s preflight or run-body executes. Surfaces the issue cheaply (~30s of pod time) rather than letting a later uv sync fall back to the base-image interpreter.

Why preflight, not setup: the pin must write .python-version into the staged repo dir, which doesn’t exist until after _push_workspace runs. Injecting at preflight[0] is the correctness-preserving placement.

See docs/recipes/reproducibility.md for the trade-offs.

Staging (rsync push)#

staging is a list of local-to-remote rsync transfers:

staging:
  - label: source
    source: "{project_root}/"
    destination: "{remote_repo}/"
    excludes_default: true              # opt in to the hygiene preset
    excludes_extra: ["evals/", "artifacts/"]
    delete: true

Per-entry fields:

  • label (required)

  • source (required) — local path; template variables rendered

  • destination (required) — remote path; template variables rendered

  • excludes (optional) — explicit list of rsync --exclude patterns

  • excludes_default (optional, default false) — when true, prepend the hygiene preset (.git/, .venv/, **/__pycache__/, **/*.pyc, .pytest_cache/, .ruff_cache/, .mypy_cache/). Saves repeating these across configs that share repo conventions.

  • excludes_extra (optional) — additional patterns appended after excludes_default + excludes. Useful for project-specific add-ons like evals/, artifacts/, data/.

  • delete (optional, default true)

The effective --exclude list passed to rsync is the concatenation: DEFAULT_STAGING_EXCLUDES (if excludes_default) + excludes + excludes_extra, in that order. Existing configs that set only excludes are unaffected.

Artifacts#

Artifacts are pulled after a successful run, and best-effort after failure:

artifacts:
  - label: models
    remote_path: "{remote_repo}/artifacts/models/"
    local_path: "{project_root}/artifacts/models"
    excludes: ["**/_trainer/checkpoint-*"]
    required: true
    delete: true