# Config Reference The current YAML schema is `schema_version: 2`. The loader is strict: unknown fields are errors at parse time. See [`MIGRATION.md`](https://github.com/brandon-behring/runpod-deploy/blob/main/MIGRATION.md) for the schema-versioning policy and v1 → v2 migration history. ## `runpod-deploy run` invocation modes | Mode | What runs | Use case | |---|---|---| | (default) | Full lifecycle: `runpodctl pod create`, SSH wait, rsync push, remote commands, artifact pull, `runpodctl pod stop`. Real money. | Production runs. | | `--dry-run` | Read-only externals (`runpodctl datacenter list`, GraphQL pricing if `--max-gpu-price`) **+** all side-effecting commands are mocked + logged. | "Will my config find a GPU in stock right now?" | | `--offline-dry-run` | **No external calls at all**; synthetic GPU/DC selection + mocked side-effecting commands. Implies `--dry-run`. | CI tests, offline config iteration, validation against the lifecycle without a RunPod account. | For the per-phase breakdown of what runs in each mode, see [`lifecycle.md`](lifecycle.md). ## Annotated minimal config A working v2 config with every required section and the most common optional fields. Every other field defaults to a sane value (see sections below for full details). ```yaml schema_version: 2 # required (current schema) name: my-job # required; pod's --name is the *rendered* form (v0.4.0) run_id_prefix: my-job # optional; defaults to `name` state_file: ~/.runpod-my-job-current # optional; tracks active pod_id for `stop` local: # optional block; controls local validation + paths project_root: ../.. # relative to THIS yaml's directory required_paths: # validate fails if any are missing locally - pyproject.toml pod: # required block image: runpod/pytorch:2.4.0-py3.13-cuda12.4.1-devel-ubuntu22.04 datacenters: [EU-RO-1, US-CA-2] # ordered failover list gpu_order: # ordered failover list - NVIDIA H100 80GB HBM3 - NVIDIA A100-SXM4-80GB cloud_type: SECURE # SECURE | COMMUNITY (default SECURE; required for network volumes) # python_version: "3.13.5" # OPTIONAL (v0.5.0); see "Pod field: python_version" storage: # required block mode: ephemeral # ephemeral | network_volume volume_gb: 20 ssh: # optional block key_path: ~/.ssh/id_ed25519 setup: # optional list; runs BEFORE staging - command: "which rsync || apt-get install -y rsync" timeout_sec: 300 staging: # optional list; rsync push local → pod - label: source source: "{project_root}/" destination: "/workspace/repo/" excludes_default: true # v0.4.0: opt in to .git/.venv/caches preset excludes_extra: ["evals/", "artifacts/"] preflight: # optional list; runs AFTER staging, BEFORE the run script - command: "cd /workspace/repo && uv sync --extra dev" timeout_sec: 1800 with_env: true run: # required block script_path: /workspace/run.sh log_path: /workspace/run.log success_marker: "[my-job] DONE" body: | set -euo pipefail cd /workspace/repo uv run python -m my_package.main artifacts: # optional list; pulled after the run - label: results remote_path: "/workspace/repo/results/" local_path: "{project_root}/results/" required: true lifecycle: # optional; defaults shown below on_success: delete # release volume disk on success on_failure: stop # preserve paused pod for SSH forensics ``` ### `lifecycle:` actions `on_success` accepts one of four strings; `on_failure` accepts the first three (`recycle` on the failure path is rejected at validation): - `delete` — call `runpodctl pod delete`. Tears down both compute and volume disk. **Default for `on_success`.** Use this on the failure path too if you don't need SSH forensics. - `stop` — call `runpodctl pod stop`. Pauses compute, but **volume disk continues billing at ~$0.10/GB·month indefinitely** until released. **Default for `on_failure`.** Pair with `runpod-deploy cleanup --state-file --mode delete` once forensics is complete. See [`lifecycle.md` §7b "Cost discipline"](lifecycle.md#7b-cost-discipline-cleaning-up-after-forensics). - `preserve` — no-op. Pod keeps running and bills full GPU rate. Rare; useful only if you intend to SSH in immediately after the job completes. - `recycle` — **success-path only.** Same wire call as `stop` (pauses the pod), but the state-file is intentionally preserved so the next `runpod-deploy run` with the same `state_file:` finds the paused pod and resumes it via `runpodctl pod start ` instead of provisioning a fresh one. Saves image-pull + cold-boot cost per recurring run. Drift detection (image/GPU/datacenter mismatch) falls through to fresh-create with a WARNING. Bypass for one invocation with `runpod-deploy run --force-fresh`. See [`recipes/recycle-pod-for-fast-iteration.md`](recipes/recycle-pod-for-fast-iteration.md). The legacy bool form (`stop: {on_success: true, on_failure: true}`) is still accepted with a deprecation warning. See [`migration-v3.md`](migration-v3.md) for the mapping and migration path. ## Required top-level fields - `schema_version: 2` - `name` - `pod` - `storage` - `run` Common optional fields: - `run_id_prefix` - `state_file` - `local` - `ssh` - `budget` - `remote_env` - `setup` - `preflight` - `staging` - `artifacts` - `stop` - `variables` ## `budget:` block — runtime + cost ceilings ```yaml budget: cost_cap_usd: 10.0 # default — pod is stopped when projected $/runtime reaches the cap assumed_hourly_rate_usd: 1.65 # default — used in cost projections + budget eta max_runtime_minutes: null # default — uncapped; derived from cost_cap_usd / hourly_rate poll_interval_sec: 60 # default — cadence for the run-monitor tail loop ssh_ready_timeout_sec: 900 # default — deadline for SSH info to populate post `pod create` ``` `ssh_ready_timeout_sec` is the wait between `runpodctl pod create` returning a pod_id and the pod's SSH proxy publishing a usable `{host, port}`. The default 900 s covers cold-pull of large pytorch/cudnn-devel images (~6–12 GB) on first-touch GPUs. For one-off bumps without editing YAML, use `runpod-deploy run --ssh-ready-timeout-sec `. If the wait exceeds 60 s, an INFO heartbeat with `status`, `ssh.error`, and `uptimeSeconds` is emitted every 30 s. On timeout, the orchestrator deletes the orphaned pod before raising (see [`lifecycle.md`](lifecycle.md) and [`troubleshooting.md`](troubleshooting.md) for the failure workflow). ## Template variables Every string field in the config can reference template variables via Python `str.format` syntax (`{name}`). Built-ins: | Variable | What it is | Example use | |---|---|---| | `{config_dir}` | Directory containing this YAML file (absolute). | `local.project_root: "{config_dir}/../.."` | | `{project_root}` | Resolved `local.project_root` (absolute). | `staging.source: "{project_root}/"` | | `{run_dir}` | This run's local artifacts dir (e.g. `/artifacts/runpod/20260515T120000Z`). | `artifacts.local_path: "{run_dir}/data"` | | `{run_id}` | The rendered prefix + timestamp (`-`). | (used internally by `runpodctl pod create --name`) | | `{job_name}` | The rendered `name` field. | `run.body` `echo "starting {job_name}"` | | `{volume_mount}` | `storage.volume_mount` (defaults `/workspace`). | `run.script_path: "{volume_mount}/run.sh"` | Custom variables are declared under `variables:` (YAML) or `--var KEY=VALUE` / `--vars-file PATH` (CLI). YAML and CLI variables can reference both built-ins and earlier-defined variables (the rendering is two-pass — see [`recipes/multi-config-sweep.md`](recipes/multi-config-sweep.md) for the sweep pattern). ```yaml variables: hf_dataset: "anthropic/persuasion" data_dir: "{project_root}/data" # references built-in run_label: "{job_name}-{hf_dataset}" # references built-in + earlier var ``` CLI: ```sh runpod-deploy run --config foo.yaml \ --var seed=42 \ --var out_dir="{project_root}/seed42" # CLI vars can reference built-ins too ``` CLI `--var` overrides YAML `variables:` on key collision. ## Storage Modes Network volume: ```yaml storage: mode: network_volume volume_name: pid-workspace-100gb volume_mount: /workspace ``` Ephemeral volume: ```yaml storage: mode: ephemeral volume_gb: 200 volume_mount: /workspace ``` ## Commands `setup` and `preflight` are lists: ```yaml preflight: - command: "cd {remote_repo} && uv sync --extra dev" timeout_sec: 1800 with_env: true ``` `with_env: true` prepends `remote_env.source_files` and `remote_env.exports`. ### Recommended `remote_env.exports` for uv-driven pods on /workspace If your `preflight:` or `run.body` runs `uv sync` and your `storage.volume_mount` is `/workspace` (RunPod's distributed FUSE filesystem), pin uv's write-heavy working directories to the pod's overlay disk and export `UV_LINK_MODE=copy`. On FUSE, three distinct uv/HF-Trainer code paths can stall on MooseFS `F_SETLKW` exclusive-lock acquisition (uv git resolution, uv install-phase atomic writes, and HF Trainer atomic checkpoint save) — see [troubleshooting.md "uv sync hangs silently"](troubleshooting.md#uv-sync-hangs-silently-with-venv-partially-populated) + the two follow-up sections for failure signatures. ```yaml remote_env: source_files: - /workspace/secrets/env # if you stage secrets via the secrets: block exports: HF_HOME: /workspace/hf_cache # FUSE OK — HF caches are read-mostly HUGGINGFACE_HUB_CACHE: /workspace/hf_cache UV_CACHE_DIR: /root/uv_cache # overlay disk, NOT FUSE — git ops + atomic writes UV_PROJECT_ENVIRONMENT: /root/.venv # overlay disk, NOT FUSE — install atomic writes UV_LINK_MODE: copy # defense-in-depth for residual /workspace touchpoints ``` `/root` is the container's overlay disk (verify with `df -hT /root` — type `overlay`, NOT `fuse`). POSIX locks work normally there. The uv cache and venv are ephemeral anyway (re-populated each fire); putting them on `/root` sacrifices nothing. **If your training framework writes checkpoints** (HF Trainer's `save_strategy`, PyTorch Lightning's `ModelCheckpoint`, etc.) and the default `output_dir` is under `/workspace/`, the atomic-save protocol can stall on the same FUSE bug. Pin checkpoint `output_dir` to `/root/checkpoints/` and rsync back to the volume as a `run.body` trailer for orchestrator artifact pull. See [troubleshooting.md "HF Trainer checkpoint save hangs"](troubleshooting.md#hf-trainer-checkpoint-save-hangs-on-fuse-backed-output_dir). For genuinely separate network-stall symptoms (single wheel download stuck mid-stream rather than a stalled `stat()`/`flock()`), also add `UV_HTTP_TIMEOUT: "120"` (defense against Fastly/CDN HTTP stalls) and optionally `UV_CONCURRENT_DOWNLOADS: "4"` (caps concurrency; default 50 amplifies head-of-line blocking on stalled sockets). ## Pod field: `python_version` (optional, default: unset) ```yaml pod: image: runpod/pytorch:2.4.0 datacenters: [EU-RO-1] gpu_order: ["NVIDIA H100 80GB HBM3"] python_version: "3.13.5" # ← pin via uv ``` When set, the orchestrator auto-injects a preflight step that runs `uv python install && cd && uv python pin `. This installs the requested CPython interpreter on the pod and writes a `.python-version` file into the staged project directory so subsequent `uv sync` invocations honor the pin. **Format**: `3.MINOR` or `3.MINOR.PATCH` (e.g. `"3.13"` or `"3.13.5"`). Pre-release suffixes are rejected — the field exists for reproducibility, not for chasing alphas. **Failure mode**: a non-zero exit from the install or pin aborts the run before the user's `preflight` or run-body executes. Surfaces the issue cheaply (~30s of pod time) rather than letting a later `uv sync` fall back to the base-image interpreter. **Why preflight, not setup**: the pin must write `.python-version` into the staged repo dir, which doesn't exist until after `_push_workspace` runs. Injecting at preflight[0] is the correctness-preserving placement. See [`docs/recipes/reproducibility.md`](recipes/reproducibility.md) for the trade-offs. ## Staging (rsync push) `staging` is a list of local-to-remote rsync transfers: ```yaml staging: - label: source source: "{project_root}/" destination: "{remote_repo}/" excludes_default: true # opt in to the hygiene preset excludes_extra: ["evals/", "artifacts/"] delete: true ``` Per-entry fields: - `label` (required) - `source` (required) — local path; template variables rendered - `destination` (required) — remote path; template variables rendered - `excludes` (optional) — explicit list of rsync `--exclude` patterns - `excludes_default` (optional, default `false`) — when `true`, prepend the hygiene preset (`.git/`, `.venv/`, `**/__pycache__/`, `**/*.pyc`, `.pytest_cache/`, `.ruff_cache/`, `.mypy_cache/`). Saves repeating these across configs that share repo conventions. - `excludes_extra` (optional) — additional patterns appended after `excludes_default` + `excludes`. Useful for project-specific add-ons like `evals/`, `artifacts/`, `data/`. - `delete` (optional, default `true`) The effective `--exclude` list passed to rsync is the concatenation: `DEFAULT_STAGING_EXCLUDES` (if `excludes_default`) + `excludes` + `excludes_extra`, in that order. Existing configs that set only `excludes` are unaffected. ## Artifacts Artifacts are pulled after a successful run, and best-effort after failure: ```yaml artifacts: - label: models remote_path: "{remote_repo}/artifacts/models/" local_path: "{project_root}/artifacts/models" excludes: ["**/_trainer/checkpoint-*"] required: true delete: true ```