The runpod-deploy run lifecycle#

What actually happens between runpod-deploy run --config foo.yaml and the artifacts landing in artifacts/runpod/<ts>/. This doc maps every YAML section to its phase, the side effects that happen at each step, and the inputs/outputs the manifest captures.

The flow is linear (no DAG, no retry-in-process) and runs once per runpod-deploy run invocation. For parallel sweeps, the consumer runs multiple invocations from a bash/Make driver (see recipes/multi-config-sweep.md).

Phase overview#

┌─────────────┐   ┌─────────────┐   ┌────────────┐   ┌───────────┐
│ 1. Validate │ → │ 2. Provision│ → │ 3. SSH wait│ → │ 4. Setup  │
└─────────────┘   └─────────────┘   └────────────┘   └───────────┘
                                                            │
                                                            ▼
┌─────────────┐   ┌─────────────┐   ┌────────────┐   ┌───────────┐
│ 8. Manifest │ ← │ 7. Stop pod │ ← │ 6. Pull    │ ← │ 5. Stage  │
└─────────────┘   └─────────────┘   │  artifacts │   │ + preflight│
                                    └────────────┘   │ + launch  │
                                                     │ + monitor │
                                                     └───────────┘

Each phase corresponds to a function in src/runpod_deploy/orchestrator.py and a YAML config section.

1. Validate#

Entrypoint: runpod-deploy validate --config foo.yaml (optionally --all for the heavy checks).

What runs:

load_job_spec(path) parses the YAML strictly. Unknown root keys raise; missing required fields raise. No template rendering at this step (raw values are stored).
build_job_context(spec, path) resolves local.project_root (relative to the config file’s parent dir) and computes the run-dir path. Raises if project_root resolves to $HOME (the v0.1.x typo guard from issue #5a).
validate_local_paths(ctx) (when --check-local or --all): verifies each local.required_paths entry exists locally.
preflight.check_gpu_availability(ctx) (--check-availability): live-queries runpodctl datacenter list to assert at least one pod.gpu_order entry is in-stock in pod.datacenters. Closes the failure mode from issue #1.
preflight.scan_consumer_pyproject(ctx) + .scan_staged_payloads_for_absolute_paths(ctx) (--scan-consumer): scans for runpod-deploy declared as a runtime dep (issue #4) and /Users/... / /home/... hardcoded paths in staged source.

The validate command is purely read-only — no pod is created.

YAML inputs: every section.

Outputs: exit 0 on pass; raises ValueError / FileNotFoundError / RuntimeError with a diagnostic message on failure.

2. Provision#

Entrypoint: triggered by runpod-deploy run (no separate CLI).

What runs:

_capture_deploy_metadata(spec, ctx): snapshots local git SHA + dirty flag + uv.lock hash. Written to the manifest later.
_resolve_gpu_id_and_dc(spec, offline, on_failover, max_gpu_price_usd) picks (gpu_id, datacenter_id) from pod.gpu_order × pod.datacenters, walking failover events through on_failover for telemetry capture. Honors --max-gpu-price <float> via the GraphQL prices fetched from pricing.fetch_gpu_prices.
provider.provision_pod(ctx, volume_id, gpu_id, datacenter_id, dry_run, ssh_ready_timeout_sec): builds the runpodctl pod create argv via provider._build_pod_create_argv (gates --spot / --min-vcpu-count / --min-memory-in-gb via the v0.3.2 feature-detection probe), shells out, then polls runpodctl pod get until the pod publishes a usable ssh.{ip, port}. Returns a PodConnection with host, port, pod_id. Bounded by budget.ssh_ready_timeout_sec (default 900 s; also overridable per-run via runpod-deploy run --ssh-ready-timeout-sec <N>). On timeout, the orphaned pod is deleted before re-raising (see troubleshooting.md for the failure workflow). Waits longer than 60 s emit a periodic INFO heartbeat with status, ssh.error, and uptimeSeconds so operators don’t stare at a silent terminal.

The pod’s --name is set to ctx.run_id — which is rendered (v0.4.0 PR-C), so a YAML with name: demo-{seed} produces demo-42-<ts> when invoked with --var seed=42.

YAML inputs: pod, storage, budget, top-level name / run_id_prefix.

Outputs: provisioned pod with a public SSH port; pod_id stored in spec.resolved_state_file for later runpod-deploy cleanup --state-file recovery.

3. SSH wait#

_wait_for_sshd(runner) polls the pod’s SSH port until ready (default ~10 min timeout). The pod’s runpod/pytorch base image usually boots in 30–90 seconds. Failures here raise RuntimeError and the pod is acted on per lifecycle.on_failure (default stop; set to preserve to keep the pod for SSH forensics).

4. Setup commands#

_run_commands(runner, ctx, spec.setup, label="setup") runs each setup[*].command over SSH in order. Each command’s stdout is logged at INFO; non-zero exits raise RemoteRunError and abort the run.

Typical contents:

Install missing system packages (rsync, git)
Install uv if the base image doesn’t include it
mkdir -p {remote_repo} for the staging destination

YAML inputs: setup, remote_env (when with_env: true on a command).

Note: setup runs before staging. That’s why the v0.5 pod.python_version pin gets auto-injected at preflight (phase 5), not setup — it needs the staged project dir to write .python-version into. See recipes/reproducibility.md.

5. Stage + secrets + preflight + launch + monitor#

The “hot path” — five sub-phases that happen back-to-back inside the try: block in run_job.

5a. Stage secrets#

_stage_secrets(runner, ctx) walks spec.secrets. For each entry:

source_env: reads the named local env var, writes KEY=value lines to a tempfile, rsyncs to destination with --chmod=F<mode>.
source_file: rsyncs the local file directly.

Values are never logged. See issue #2 for the ephemeral-storage motivation.

5b. Push workspace#

_push_workspace(runner, ctx) walks spec.staging. For each RsyncPushSpec:

Renders source and destination through ctx.render (template variables expand).
Computes effective_excludes: DEFAULT_STAGING_EXCLUDES (when excludes_default: true) + excludes + excludes_extra, in that order. New in v0.4.0 — see config-reference.md.
Each pattern in the effective list is also rendered through ctx.render, so excludes_extra: ["{job_name}-tmp/"] works.
Shells out to rsync via RsyncTransfer.

5c. Preflight commands#

_run_commands(runner, ctx, _build_python_pin_preflight(spec) + spec.preflight, label="preflight").

When pod.python_version is set (v0.5.0 PR-G), the orchestrator auto-prepends a single command to the preflight tuple:

uv python install <ver> && cd <first-staging-destination> && uv python pin <ver>

This runs after staging (so .python-version lands in the staged project dir) and before the user’s own preflight commands.

User preflight then runs (typical contents: uv sync --extra dev, data-availability checks, environment fingerprinting).

5d. Launch remote job#

_launch_remote_job(runner, ctx):

Writes run.script_path on the pod with the rendered run.body.
Detaches the script via nohup ... &; the SSH command returns immediately so the orchestrator can poll the log without holding the connection.
If --print-run-dir was set, emits RUN_DIR=<ctx.run_dir> on stdout before the SSH call (v0.4.0 PR-B). Parallel-sweep drivers grep this line to know which artifacts/runpod/<ts>/ dir belongs to this shard.

5e. Monitor remote log#

_monitor_remote_log(runner, ctx, tel=tel) polls the pod’s log file (run.log_path) for either:

run.success_marker → run-ok, exit poll.
Any string in run.failure_markers → raise.
Timeout (budget.timeout_sec) → raise.

Telemetry samples (nvidia-smi, pip freeze) run at the configured interval throughout the poll.

6. Artifact pull#

After the run script exits (success or failure), _pull_artifacts_and_log walks spec.artifacts:

For each entry: rsync from remote_path to local_path with excludes honored.
required: true failures raise; required: false failures log a WARNING and continue.

The pod’s run log (run.log_path) is also pulled to <run_dir>/run.log regardless of success/failure (as long as the run script started — i.e., not preflight-failure).

7. Lifecycle action (cleanup)#

After the run completes (or fails), the orchestrator calls cleanup_pod(pod_id, action, dry_run, state_file, volume_in_gb) with one of four actions, controlled by the YAML lifecycle: block:

lifecycle:
  on_success: delete    # default — release volume disk on success
  on_failure: stop      # default — preserve for SSH forensics

Action semantics:

action	runpodctl call	compute billing	volume disk billing	state file	next run
`preserve`	(none)	continues at GPU rate	continues at volume rate	preserved	—
`stop`	`pod stop <id>`	stops	continues at ~$0.10/GB·month indefinitely	preserved	—
`delete`	`pod delete <id>`	stops	stops	unlinked	fresh pod
`recycle`	`pod stop <id>`	stops	continues at ~$0.10/GB·month (kept on purpose)	preserved	resumes paused pod via `pod start <id>`

recycle is the success-path-only “warm cache” action: the pod is paused (same wire call as stop) AND the state-file is preserved so the next runpod-deploy run with the same state_file: finds the paused pod, validates image/GPU/datacenter compatibility, and calls runpodctl pod start <id> instead of pod create. Saves the image-pull + cold-boot cost per recurring run (typically 3–5 min). See recipes/recycle-pod-for-fast-iteration.md.

The defaults on_success: delete and on_failure: stop encode the operational discipline that prevents storage leaks while preserving forensic access:

Success path → delete releases the volume disk. The run is done, all artifacts have been pulled, the manifest is written. There is nothing left worth paying storage for.

Failure path → stop keeps the pod paused so the operator can runpodctl pod start <id> later and SSH in for post-mortem. The orchestrator emits a multi-line WARNING with the exact release command so the operator never has to remember the cleanup syntax:

[lifecycle] pod 'abc123' stopped for forensics.
  Volume disk (50 GB) continues billing at ~$0.17/day (~$5.00/mo) until released.
  When done investigating, release with:
      runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete
  Or audit all stale pods:
      runpod-deploy ls-stale

If you don’t want SSH-forensics for failed runs, set lifecycle: {on_failure: delete} to release disk on every run regardless of outcome.

7b. Cost discipline: cleaning up after forensics#

The on_failure: stop default trades a small ongoing storage cost (~$0.17/day per 50 GB pod) for the option to SSH in after a failure. This trade-off is only sustainable if you *actually* clean up the preserved pods. **Stopped pods continue billing the volume disk indefinitely** at RunPod's $0.10/GB·month preserved-volume rate.

Why this section exists#

On 2026-05-17, this repo had 76 EXITED pods totaling 3,930 GB — nothing actively running, but **$1.10/hr (~$26/day, ~$393/month)** of idle storage burn. The leak existed because `stop_pod` only paused pods (never deleted them) and operators assumed "stop" meant "terminated". After releasing those 76 pods, idle burn dropped 110× to $0.01/hr. The current schema (lifecycle.on_success: delete by default + the failure-path WARNING + ls-stale audit + cleanup --all-stopped bulk-release) is the structural fix that prevents the same gap reopening.

The forensics-then-cleanup workflow#

After a failed run that preserved a pod:

Audit: runpod-deploy ls-stale lists every EXITED pod on the account with its volume size and estimated daily/monthly cost, plus a TOTAL footer. Read-only; safe to run anytime.
Investigate: inspect the pulled manifest in artifacts/runpod/<ts>/runpod_pull_manifest.json for most root-cause analysis (most failure modes are visible in the captured logs and telemetry without needing SSH). If you do need SSH: runpodctl pod start <id>, then runpodctl pod ssh <id>.
Release (single pod):
```
runpod-deploy cleanup --state-file <path> --mode delete
```
The state file path is logged by the per-run WARNING. The default --mode is delete — you almost always want to release disk after forensics. --mode stop re-pauses (rarely useful); --mode preserve is a no-op.
Release (bulk, after a sweep of failed runs):
```
runpod-deploy cleanup --all-stopped         # interactive y/N prompt
runpod-deploy cleanup --all-stopped --yes   # non-interactive
```
Equivalent to runpodctl pod list -a --status EXITED -o json | jq -r '.[].id' | xargs -I{} runpodctl pod delete {} but ships in the SDK with failure collection (one bad delete doesn’t abort the rest).

Recommended hygiene#

Add runpod-deploy ls-stale to a weekly cron / CI job to detect drift before it becomes a leak. See recipes/stale-pod-audit.md.
When iterating on a workflow that fails repeatedly during development, set lifecycle: {on_failure: delete} in that config so failed runs don’t accumulate. Revert to on_failure: stop only when you genuinely want SSH access on failure.
For workflows where the staging payload is the slow part of every run (large repos, repeated rsyncs), consider switching storage.mode: network_volume so the volume persists across pods and rsync is incremental. See recipes/payload-reuse-via-network-volume.md.

8. Manifest write#

write_pull_manifest(ctx, failed, pod, datacenter_id, deploy_metadata, artifact_results, telemetry_files, wall_time_sec, gpu_price_per_hour_usd, gpu_price_source, pod_final_state) serializes everything to <run_dir>/runpod_deploy_pull_manifest.json (schema v2).

Captured fields:

Provenance: job_name, run_id, schema_version=v2, pod_id, gpu_id, datacenter_id, image, storage_mode.
Cost/timing: wall_time_sec, gpu_price_per_hour_usd, gpu_price_source (e.g. pod_describe), estimated_cost_usd, cost_cap_usd.
Pod final state: pod_final_state (e.g. EXITED, pod_killed_unexpected).
Deploy metadata: local_git_sha, local_git_dirty, payload_lockfile.
Artifacts: list of {label, status, duration_sec, bytes_transferred}.
Telemetry files: list of telemetry-snapshot filenames.

This manifest is the source of truth for forensic queries (runpod-deploy ls-runs, manifest-summary, compare-runs, events-query).

Where each YAML section maps#

YAML section	Phase	Function
`schema_version`, `name`, `run_id_prefix`, `state_file`	All	`build_job_context`
`local`	1	`validate_local_paths`, `build_job_context`
`pod` (incl. `python_version`)	2	`provision_pod`, `_build_python_pin_preflight`
`storage`	2	`resolve_volume`, `provision_pod`
`ssh`	3	`RemoteRunner` construction
`budget`	2 + 5e	cost cap + monitor timeout
`remote_env`	4 + 5c	`_remote_env_prefix` (when `with_env: true`)
`setup`	4	`_run_commands(label="setup")`
`staging`	5b	`_push_workspace`
`secrets`	5a	`_stage_secrets`
`preflight`	5c	`_run_commands(label="preflight")`
`run`	5d + 5e	`_launch_remote_job`, `_monitor_remote_log`
`artifacts`	6	`_pull_artifacts_and_log`
`stop`	7	`stop_pod`
`telemetry`	5d + 5e	`telemetry.start_session`, `tel.start_sampling()`
`variables` (+ `--var` / `--vars-file`)	1 (resolution)	`build_job_context` two-pass render

Failure handling#

The orchestrator’s try/except/finally block is the canonical failure flow:

Exception before run_started = True (phases 3–5c): no artifact pulls (the run script never executed). Act on pod per lifecycle.on_failure. Log a WARNING.
Exception after run_started = True (phases 5d/5e/6): best-effort artifact pull (suppresses second-order exceptions) + tel.capture_end(). Act on pod per lifecycle.on_failure.
Manifest always writes in the finally block (suppressed exception during the write itself just logs).

For deeper failure-mode debugging see troubleshooting.md.

`--dry-run` vs `--offline-dry-run`#

Both flags walk the lifecycle without provisioning, but they differ in whether external read-only queries are made:

Flag	External calls?	Use case
`--offline-dry-run`	None — no `runpodctl`, no SSH, no rsync. GPU/DC selection uses synthetic sentinels.	CI tests, fast config iteration, validation when you’re offline or don’t have a RunPod account
`--dry-run`	Read-only only — `runpodctl datacenter list` is queried so live GPU stock info is reflected; `runpod-deploy gpu-prices` is queried if `--max-gpu-price` is set. Pod create / SSH / rsync are mocked.	“Will this config actually find a GPU in stock right now?” without provisioning

In code: --offline-dry-run implies dry_run=True in run_job, and additionally passes offline=True to _resolve_gpu_id_and_dc and _resolve_volume_id. The CLI gates the external calls via that offline flag.