The runpod-deploy run lifecycle#
What actually happens between runpod-deploy run --config foo.yaml
and the artifacts landing in artifacts/runpod/<ts>/. This doc maps
every YAML section to its phase, the side effects that happen at each
step, and the inputs/outputs the manifest captures.
The flow is linear (no DAG, no retry-in-process) and runs once per
runpod-deploy run invocation. For parallel sweeps, the consumer
runs multiple invocations from a bash/Make driver
(see recipes/multi-config-sweep.md).
Phase overview#
┌─────────────┐ ┌─────────────┐ ┌────────────┐ ┌───────────┐
│ 1. Validate │ → │ 2. Provision│ → │ 3. SSH wait│ → │ 4. Setup │
└─────────────┘ └─────────────┘ └────────────┘ └───────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌────────────┐ ┌───────────┐
│ 8. Manifest │ ← │ 7. Stop pod │ ← │ 6. Pull │ ← │ 5. Stage │
└─────────────┘ └─────────────┘ │ artifacts │ │ + preflight│
└────────────┘ │ + launch │
│ + monitor │
└───────────┘
Each phase corresponds to a function in
src/runpod_deploy/orchestrator.py
and a YAML config section.
1. Validate#
Entrypoint: runpod-deploy validate --config foo.yaml
(optionally --all for the heavy checks).
What runs:
load_job_spec(path)parses the YAML strictly. Unknown root keys raise; missing required fields raise. No template rendering at this step (raw values are stored).build_job_context(spec, path)resolveslocal.project_root(relative to the config file’s parent dir) and computes the run-dir path. Raises ifproject_rootresolves to$HOME(the v0.1.x typo guard from issue #5a).validate_local_paths(ctx)(when--check-localor--all): verifies eachlocal.required_pathsentry exists locally.preflight.check_gpu_availability(ctx)(--check-availability): live-queriesrunpodctl datacenter listto assert at least onepod.gpu_orderentry is in-stock inpod.datacenters. Closes the failure mode from issue #1.preflight.scan_consumer_pyproject(ctx)+.scan_staged_payloads_for_absolute_paths(ctx)(--scan-consumer): scans forrunpod-deploydeclared as a runtime dep (issue #4) and/Users/...//home/...hardcoded paths in staged source.
The validate command is purely read-only — no pod is created.
YAML inputs: every section.
Outputs: exit 0 on pass; raises ValueError / FileNotFoundError
/ RuntimeError with a diagnostic message on failure.
2. Provision#
Entrypoint: triggered by runpod-deploy run (no separate CLI).
What runs:
_capture_deploy_metadata(spec, ctx): snapshots local git SHA + dirty flag +uv.lockhash. Written to the manifest later._resolve_gpu_id_and_dc(spec, offline, on_failover, max_gpu_price_usd)picks(gpu_id, datacenter_id)frompod.gpu_order×pod.datacenters, walking failover events throughon_failoverfor telemetry capture. Honors--max-gpu-price <float>via the GraphQL prices fetched frompricing.fetch_gpu_prices.provider.provision_pod(ctx, volume_id, gpu_id, datacenter_id, dry_run, ssh_ready_timeout_sec): builds therunpodctl pod createargv viaprovider._build_pod_create_argv(gates--spot/--min-vcpu-count/--min-memory-in-gbvia the v0.3.2 feature-detection probe), shells out, then pollsrunpodctl pod getuntil the pod publishes a usablessh.{ip, port}. Returns aPodConnectionwithhost,port,pod_id. Bounded bybudget.ssh_ready_timeout_sec(default 900 s; also overridable per-run viarunpod-deploy run --ssh-ready-timeout-sec <N>). On timeout, the orphaned pod is deleted before re-raising (seetroubleshooting.mdfor the failure workflow). Waits longer than 60 s emit a periodic INFO heartbeat withstatus,ssh.error, anduptimeSecondsso operators don’t stare at a silent terminal.
The pod’s --name is set to ctx.run_id — which is rendered
(v0.4.0 PR-C), so a YAML with name: demo-{seed} produces
demo-42-<ts> when invoked with --var seed=42.
YAML inputs: pod, storage, budget, top-level name /
run_id_prefix.
Outputs: provisioned pod with a public SSH port; pod_id stored
in spec.resolved_state_file for later runpod-deploy cleanup --state-file
recovery.
3. SSH wait#
_wait_for_sshd(runner) polls the pod’s SSH port until ready (default
~10 min timeout). The pod’s runpod/pytorch base image usually boots
in 30–90 seconds. Failures here raise RuntimeError and the pod is
acted on per lifecycle.on_failure (default stop; set to preserve
to keep the pod for SSH forensics).
4. Setup commands#
_run_commands(runner, ctx, spec.setup, label="setup") runs each
setup[*].command over SSH in order. Each command’s stdout is logged
at INFO; non-zero exits raise RemoteRunError and abort the run.
Typical contents:
Install missing system packages (
rsync,git)Install
uvif the base image doesn’t include itmkdir -p {remote_repo}for the staging destination
YAML inputs: setup, remote_env (when with_env: true on a
command).
Note: setup runs before staging. That’s why the v0.5
pod.python_version pin gets auto-injected at preflight (phase 5),
not setup — it needs the staged project dir to write .python-version
into. See recipes/reproducibility.md.
5. Stage + secrets + preflight + launch + monitor#
The “hot path” — five sub-phases that happen back-to-back inside the
try: block in run_job.
5a. Stage secrets#
_stage_secrets(runner, ctx) walks spec.secrets. For each entry:
source_env: reads the named local env var, writesKEY=valuelines to a tempfile, rsyncs todestinationwith--chmod=F<mode>.source_file: rsyncs the local file directly.
Values are never logged. See issue #2 for the ephemeral-storage motivation.
5b. Push workspace#
_push_workspace(runner, ctx) walks spec.staging. For each
RsyncPushSpec:
Renders
sourceanddestinationthroughctx.render(template variables expand).Computes
effective_excludes:DEFAULT_STAGING_EXCLUDES(whenexcludes_default: true) +excludes+excludes_extra, in that order. New in v0.4.0 — seeconfig-reference.md.Each pattern in the effective list is also rendered through
ctx.render, soexcludes_extra: ["{job_name}-tmp/"]works.Shells out to
rsyncviaRsyncTransfer.
5c. Preflight commands#
_run_commands(runner, ctx, _build_python_pin_preflight(spec) + spec.preflight, label="preflight").
When pod.python_version is set (v0.5.0 PR-G), the orchestrator
auto-prepends a single command to the preflight tuple:
uv python install <ver> && cd <first-staging-destination> && uv python pin <ver>
This runs after staging (so .python-version lands in the staged
project dir) and before the user’s own preflight commands.
User preflight then runs (typical contents: uv sync --extra dev,
data-availability checks, environment fingerprinting).
5d. Launch remote job#
_launch_remote_job(runner, ctx):
Writes
run.script_pathon the pod with the renderedrun.body.Detaches the script via
nohup ... &; the SSH command returns immediately so the orchestrator can poll the log without holding the connection.If
--print-run-dirwas set, emitsRUN_DIR=<ctx.run_dir>on stdout before the SSH call (v0.4.0 PR-B). Parallel-sweep drivers grep this line to know whichartifacts/runpod/<ts>/dir belongs to this shard.
5e. Monitor remote log#
_monitor_remote_log(runner, ctx, tel=tel) polls the pod’s log file
(run.log_path) for either:
run.success_marker→ run-ok, exit poll.Any string in
run.failure_markers→ raise.Timeout (
budget.timeout_sec) → raise.
Telemetry samples (nvidia-smi, pip freeze) run at the configured
interval throughout the poll.
6. Artifact pull#
After the run script exits (success or failure), _pull_artifacts_and_log
walks spec.artifacts:
For each entry: rsync from
remote_pathtolocal_pathwithexcludeshonored.required: truefailures raise;required: falsefailures log a WARNING and continue.
The pod’s run log (run.log_path) is also pulled to
<run_dir>/run.log regardless of success/failure (as long as the
run script started — i.e., not preflight-failure).
7. Lifecycle action (cleanup)#
After the run completes (or fails), the orchestrator calls
cleanup_pod(pod_id, action, dry_run, state_file, volume_in_gb) with
one of four actions, controlled by the YAML lifecycle: block:
lifecycle:
on_success: delete # default — release volume disk on success
on_failure: stop # default — preserve for SSH forensics
Action semantics:
action |
runpodctl call |
compute billing |
volume disk billing |
state file |
next run |
|---|---|---|---|---|---|
|
(none) |
continues at GPU rate |
continues at volume rate |
preserved |
— |
|
|
stops |
continues at ~$0.10/GB·month indefinitely |
preserved |
— |
|
|
stops |
stops |
unlinked |
fresh pod |
|
|
stops |
continues at ~$0.10/GB·month (kept on purpose) |
preserved |
resumes paused pod via |
recycle is the success-path-only “warm cache” action: the pod is
paused (same wire call as stop) AND the state-file is preserved so
the next runpod-deploy run with the same state_file: finds the
paused pod, validates image/GPU/datacenter compatibility, and calls
runpodctl pod start <id> instead of pod create. Saves the
image-pull + cold-boot cost per recurring run (typically 3–5 min).
See recipes/recycle-pod-for-fast-iteration.md.
The defaults on_success: delete and on_failure: stop encode the
operational discipline that prevents storage leaks while preserving
forensic access:
Success path →
deletereleases the volume disk. The run is done, all artifacts have been pulled, the manifest is written. There is nothing left worth paying storage for.Failure path →
stopkeeps the pod paused so the operator canrunpodctl pod start <id>later and SSH in for post-mortem. The orchestrator emits a multi-line WARNING with the exact release command so the operator never has to remember the cleanup syntax:[lifecycle] pod 'abc123' stopped for forensics. Volume disk (50 GB) continues billing at ~$0.17/day (~$5.00/mo) until released. When done investigating, release with: runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete Or audit all stale pods: runpod-deploy ls-stale
If you don’t want SSH-forensics for failed runs, set
lifecycle: {on_failure: delete} to release disk on every run
regardless of outcome.
7b. Cost discipline: cleaning up after forensics#
The on_failure: stop default trades a small ongoing storage cost
(~\(0.17/day per 50 GB pod) for the option to SSH in after a failure.
This trade-off is only sustainable if you *actually* clean up the
preserved pods. **Stopped pods continue billing the volume disk
indefinitely** at RunPod's \)0.10/GB·month preserved-volume rate.
Why this section exists#
On 2026-05-17, this repo had 76 EXITED pods totaling 3,930 GB —
nothing actively running, but **\(1.10/hr (~\)26/day, ~\(393/month)** of
idle storage burn. The leak existed because `stop_pod` only paused
pods (never deleted them) and operators assumed "stop" meant
"terminated". After releasing those 76 pods, idle burn dropped 110×
to \)0.01/hr. The current schema (lifecycle.on_success: delete by
default + the failure-path WARNING + ls-stale audit + cleanup --all-stopped bulk-release) is the structural fix that prevents the
same gap reopening.
The forensics-then-cleanup workflow#
After a failed run that preserved a pod:
Audit:
runpod-deploy ls-stalelists every EXITED pod on the account with its volume size and estimated daily/monthly cost, plus a TOTAL footer. Read-only; safe to run anytime.Investigate: inspect the pulled manifest in
artifacts/runpod/<ts>/runpod_pull_manifest.jsonfor most root-cause analysis (most failure modes are visible in the captured logs and telemetry without needing SSH). If you do need SSH:runpodctl pod start <id>, thenrunpodctl pod ssh <id>.Release (single pod):
runpod-deploy cleanup --state-file <path> --mode delete
The state file path is logged by the per-run WARNING. The default
--modeisdelete— you almost always want to release disk after forensics.--mode stopre-pauses (rarely useful);--mode preserveis a no-op.Release (bulk, after a sweep of failed runs):
runpod-deploy cleanup --all-stopped # interactive y/N prompt runpod-deploy cleanup --all-stopped --yes # non-interactive
Equivalent to
runpodctl pod list -a --status EXITED -o json | jq -r '.[].id' | xargs -I{} runpodctl pod delete {}but ships in the SDK with failure collection (one bad delete doesn’t abort the rest).
Recommended hygiene#
Add
runpod-deploy ls-staleto a weekly cron / CI job to detect drift before it becomes a leak. Seerecipes/stale-pod-audit.md.When iterating on a workflow that fails repeatedly during development, set
lifecycle: {on_failure: delete}in that config so failed runs don’t accumulate. Revert toon_failure: stoponly when you genuinely want SSH access on failure.For workflows where the staging payload is the slow part of every run (large repos, repeated rsyncs), consider switching
storage.mode: network_volumeso the volume persists across pods andrsyncis incremental. Seerecipes/payload-reuse-via-network-volume.md.
8. Manifest write#
write_pull_manifest(ctx, failed, pod, datacenter_id, deploy_metadata, artifact_results, telemetry_files, wall_time_sec, gpu_price_per_hour_usd, gpu_price_source, pod_final_state)
serializes everything to
<run_dir>/runpod_deploy_pull_manifest.json (schema v2).
Captured fields:
Provenance:
job_name,run_id,schema_version=v2,pod_id,gpu_id,datacenter_id,image,storage_mode.Cost/timing:
wall_time_sec,gpu_price_per_hour_usd,gpu_price_source(e.g.pod_describe),estimated_cost_usd,cost_cap_usd.Pod final state:
pod_final_state(e.g.EXITED,pod_killed_unexpected).Deploy metadata:
local_git_sha,local_git_dirty,payload_lockfile.Artifacts: list of
{label, status, duration_sec, bytes_transferred}.Telemetry files: list of telemetry-snapshot filenames.
This manifest is the source of truth for forensic queries
(runpod-deploy ls-runs, manifest-summary, compare-runs,
events-query).
Where each YAML section maps#
YAML section |
Phase |
Function |
|---|---|---|
|
All |
|
|
1 |
|
|
2 |
|
|
2 |
|
|
3 |
|
|
2 + 5e |
cost cap + monitor timeout |
|
4 + 5c |
|
|
4 |
|
|
5b |
|
|
5a |
|
|
5c |
|
|
5d + 5e |
|
|
6 |
|
|
7 |
|
|
5d + 5e |
|
|
1 (resolution) |
|
Failure handling#
The orchestrator’s try/except/finally block is the canonical
failure flow:
Exception before
run_started = True(phases 3–5c): no artifact pulls (the run script never executed). Act on pod perlifecycle.on_failure. Log a WARNING.Exception after
run_started = True(phases 5d/5e/6): best-effort artifact pull (suppresses second-order exceptions) +tel.capture_end(). Act on pod perlifecycle.on_failure.Manifest always writes in the
finallyblock (suppressed exception during the write itself just logs).
For deeper failure-mode debugging see
troubleshooting.md.
--dry-run vs --offline-dry-run#
Both flags walk the lifecycle without provisioning, but they differ in whether external read-only queries are made:
Flag |
External calls? |
Use case |
|---|---|---|
|
None — no |
CI tests, fast config iteration, validation when you’re offline or don’t have a RunPod account |
|
Read-only only — |
“Will this config actually find a GPU in stock right now?” without provisioning |
In code: --offline-dry-run implies dry_run=True in run_job,
and additionally passes offline=True to
_resolve_gpu_id_and_dc and _resolve_volume_id. The CLI gates the
external calls via that offline flag.