# The runpod-deploy run lifecycle What actually happens between `runpod-deploy run --config foo.yaml` and the artifacts landing in `artifacts/runpod//`. This doc maps every YAML section to its phase, the side effects that happen at each step, and the inputs/outputs the manifest captures. The flow is **linear** (no DAG, no retry-in-process) and runs once per `runpod-deploy run` invocation. For parallel sweeps, the consumer runs multiple invocations from a bash/Make driver (see [`recipes/multi-config-sweep.md`](recipes/multi-config-sweep.md)). --- ## Phase overview ``` ┌─────────────┐ ┌─────────────┐ ┌────────────┐ ┌───────────┐ │ 1. Validate │ → │ 2. Provision│ → │ 3. SSH wait│ → │ 4. Setup │ └─────────────┘ └─────────────┘ └────────────┘ └───────────┘ │ ▼ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ ┌───────────┐ │ 8. Manifest │ ← │ 7. Stop pod │ ← │ 6. Pull │ ← │ 5. Stage │ └─────────────┘ └─────────────┘ │ artifacts │ │ + preflight│ └────────────┘ │ + launch │ │ + monitor │ └───────────┘ ``` Each phase corresponds to a function in [`src/runpod_deploy/orchestrator.py`](https://github.com/brandon-behring/runpod-deploy/blob/main/src/runpod_deploy/orchestrator.py) and a YAML config section. --- ## 1. Validate **Entrypoint**: `runpod-deploy validate --config foo.yaml` (optionally `--all` for the heavy checks). What runs: - `load_job_spec(path)` parses the YAML strictly. Unknown root keys raise; missing required fields raise. No template rendering at this step (raw values are stored). - `build_job_context(spec, path)` resolves `local.project_root` (relative to the config file's parent dir) and computes the run-dir path. Raises if `project_root` resolves to `$HOME` (the v0.1.x typo guard from issue #5a). - `validate_local_paths(ctx)` (when `--check-local` or `--all`): verifies each `local.required_paths` entry exists locally. - `preflight.check_gpu_availability(ctx)` (`--check-availability`): live-queries `runpodctl datacenter list` to assert at least one `pod.gpu_order` entry is in-stock in `pod.datacenters`. Closes the failure mode from issue #1. - `preflight.scan_consumer_pyproject(ctx)` + `.scan_staged_payloads_for_absolute_paths(ctx)` (`--scan-consumer`): scans for `runpod-deploy` declared as a runtime dep (issue #4) and `/Users/...` / `/home/...` hardcoded paths in staged source. The `validate` command is purely read-only — no pod is created. **YAML inputs**: every section. **Outputs**: exit 0 on pass; raises `ValueError` / `FileNotFoundError` / `RuntimeError` with a diagnostic message on failure. --- ## 2. Provision **Entrypoint**: triggered by `runpod-deploy run` (no separate CLI). What runs: - `_capture_deploy_metadata(spec, ctx)`: snapshots local git SHA + dirty flag + `uv.lock` hash. Written to the manifest later. - `_resolve_gpu_id_and_dc(spec, offline, on_failover, max_gpu_price_usd)` picks `(gpu_id, datacenter_id)` from `pod.gpu_order` × `pod.datacenters`, walking failover events through `on_failover` for telemetry capture. Honors `--max-gpu-price ` via the GraphQL prices fetched from `pricing.fetch_gpu_prices`. - `provider.provision_pod(ctx, volume_id, gpu_id, datacenter_id, dry_run, ssh_ready_timeout_sec)`: builds the `runpodctl pod create` argv via `provider._build_pod_create_argv` (gates `--spot` / `--min-vcpu-count` / `--min-memory-in-gb` via the v0.3.2 feature-detection probe), shells out, then polls `runpodctl pod get` until the pod publishes a usable `ssh.{ip, port}`. Returns a `PodConnection` with `host`, `port`, `pod_id`. Bounded by `budget.ssh_ready_timeout_sec` (default 900 s; also overridable per-run via `runpod-deploy run --ssh-ready-timeout-sec `). On timeout, the orphaned pod is deleted before re-raising (see [`troubleshooting.md`](troubleshooting.md) for the failure workflow). Waits longer than 60 s emit a periodic INFO heartbeat with `status`, `ssh.error`, and `uptimeSeconds` so operators don't stare at a silent terminal. The pod's `--name` is set to `ctx.run_id` — which is *rendered* (v0.4.0 PR-C), so a YAML with `name: demo-{seed}` produces `demo-42-` when invoked with `--var seed=42`. **YAML inputs**: `pod`, `storage`, `budget`, top-level `name` / `run_id_prefix`. **Outputs**: provisioned pod with a public SSH port; pod_id stored in `spec.resolved_state_file` for later `runpod-deploy cleanup --state-file` recovery. --- ## 3. SSH wait `_wait_for_sshd(runner)` polls the pod's SSH port until ready (default ~10 min timeout). The pod's `runpod/pytorch` base image usually boots in 30–90 seconds. Failures here raise `RuntimeError` and the pod is acted on per `lifecycle.on_failure` (default `stop`; set to `preserve` to keep the pod for SSH forensics). --- ## 4. Setup commands `_run_commands(runner, ctx, spec.setup, label="setup")` runs each `setup[*].command` over SSH in order. Each command's stdout is logged at INFO; non-zero exits raise `RemoteRunError` and abort the run. Typical contents: - Install missing system packages (`rsync`, `git`) - Install `uv` if the base image doesn't include it - `mkdir -p {remote_repo}` for the staging destination **YAML inputs**: `setup`, `remote_env` (when `with_env: true` on a command). **Note**: setup runs *before* staging. That's why the v0.5 `pod.python_version` pin gets auto-injected at preflight (phase 5), not setup — it needs the staged project dir to write `.python-version` into. See [`recipes/reproducibility.md`](recipes/reproducibility.md). --- ## 5. Stage + secrets + preflight + launch + monitor The "hot path" — five sub-phases that happen back-to-back inside the `try:` block in `run_job`. ### 5a. Stage secrets `_stage_secrets(runner, ctx)` walks `spec.secrets`. For each entry: - `source_env`: reads the named local env var, writes `KEY=value` lines to a tempfile, rsyncs to `destination` with `--chmod=F`. - `source_file`: rsyncs the local file directly. Values are never logged. See [issue #2](https://github.com/brandon-behring/runpod-deploy/issues/2) for the ephemeral-storage motivation. ### 5b. Push workspace `_push_workspace(runner, ctx)` walks `spec.staging`. For each `RsyncPushSpec`: - Renders `source` and `destination` through `ctx.render` (template variables expand). - Computes `effective_excludes`: `DEFAULT_STAGING_EXCLUDES` (when `excludes_default: true`) + `excludes` + `excludes_extra`, in that order. New in v0.4.0 — see [`config-reference.md`](config-reference.md#staging-rsync-push). - Each pattern in the effective list is also rendered through `ctx.render`, so `excludes_extra: ["{job_name}-tmp/"]` works. - Shells out to `rsync` via `RsyncTransfer`. ### 5c. Preflight commands `_run_commands(runner, ctx, _build_python_pin_preflight(spec) + spec.preflight, label="preflight")`. When `pod.python_version` is set (v0.5.0 PR-G), the orchestrator **auto-prepends a single command** to the preflight tuple: ``` uv python install && cd && uv python pin ``` This runs *after* staging (so `.python-version` lands in the staged project dir) and *before* the user's own preflight commands. User preflight then runs (typical contents: `uv sync --extra dev`, data-availability checks, environment fingerprinting). ### 5d. Launch remote job `_launch_remote_job(runner, ctx)`: - Writes `run.script_path` on the pod with the rendered `run.body`. - Detaches the script via `nohup ... &`; the SSH command returns immediately so the orchestrator can poll the log without holding the connection. - If `--print-run-dir` was set, emits `RUN_DIR=` on stdout *before* the SSH call (v0.4.0 PR-B). Parallel-sweep drivers grep this line to know which `artifacts/runpod//` dir belongs to this shard. ### 5e. Monitor remote log `_monitor_remote_log(runner, ctx, tel=tel)` polls the pod's log file (`run.log_path`) for either: - `run.success_marker` → run-ok, exit poll. - Any string in `run.failure_markers` → raise. - Timeout (`budget.timeout_sec`) → raise. Telemetry samples (`nvidia-smi`, `pip freeze`) run at the configured interval throughout the poll. --- ## 6. Artifact pull After the run script exits (success or failure), `_pull_artifacts_and_log` walks `spec.artifacts`: - For each entry: rsync from `remote_path` to `local_path` with `excludes` honored. - `required: true` failures raise; `required: false` failures log a WARNING and continue. The pod's run log (`run.log_path`) is also pulled to `/run.log` regardless of success/failure (as long as the run script started — i.e., not preflight-failure). --- ## 7. Lifecycle action (cleanup) After the run completes (or fails), the orchestrator calls `cleanup_pod(pod_id, action, dry_run, state_file, volume_in_gb)` with one of four actions, controlled by the YAML `lifecycle:` block: ```yaml lifecycle: on_success: delete # default — release volume disk on success on_failure: stop # default — preserve for SSH forensics ``` Action semantics: | action | runpodctl call | compute billing | volume disk billing | state file | next run | | ----------- | ---------------------- | --------------- | ------------------- | --- | --- | | `preserve` | _(none)_ | continues at GPU rate | continues at volume rate | preserved | — | | `stop` | `pod stop ` | stops | **continues at ~$0.10/GB·month indefinitely** | preserved | — | | `delete` | `pod delete ` | stops | stops | unlinked | fresh pod | | `recycle` | `pod stop ` | stops | continues at ~$0.10/GB·month (kept on purpose) | preserved | **resumes paused pod** via `pod start ` | `recycle` is the success-path-only "warm cache" action: the pod is paused (same wire call as `stop`) AND the state-file is preserved so the next `runpod-deploy run` with the same `state_file:` finds the paused pod, validates image/GPU/datacenter compatibility, and calls `runpodctl pod start ` instead of `pod create`. Saves the image-pull + cold-boot cost per recurring run (typically 3–5 min). See [`recipes/recycle-pod-for-fast-iteration.md`](recipes/recycle-pod-for-fast-iteration.md). The defaults `on_success: delete` and `on_failure: stop` encode the operational discipline that prevents storage leaks while preserving forensic access: - **Success path** → `delete` releases the volume disk. The run is done, all artifacts have been pulled, the manifest is written. There is nothing left worth paying storage for. - **Failure path** → `stop` keeps the pod paused so the operator can `runpodctl pod start ` later and SSH in for post-mortem. The orchestrator emits a multi-line WARNING with the exact release command so the operator never has to remember the cleanup syntax: ``` [lifecycle] pod 'abc123' stopped for forensics. Volume disk (50 GB) continues billing at ~$0.17/day (~$5.00/mo) until released. When done investigating, release with: runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete Or audit all stale pods: runpod-deploy ls-stale ``` If you don't want SSH-forensics for failed runs, set `lifecycle: {on_failure: delete}` to release disk on every run regardless of outcome. --- ## 7b. Cost discipline: cleaning up after forensics The `on_failure: stop` default trades a small ongoing storage cost (~$0.17/day per 50 GB pod) for the option to SSH in after a failure. This trade-off is only sustainable if you *actually* clean up the preserved pods. **Stopped pods continue billing the volume disk indefinitely** at RunPod's $0.10/GB·month preserved-volume rate. ### Why this section exists On 2026-05-17, this repo had **76 EXITED pods totaling 3,930 GB** — nothing actively running, but **$1.10/hr (~$26/day, ~$393/month)** of idle storage burn. The leak existed because `stop_pod` only paused pods (never deleted them) and operators assumed "stop" meant "terminated". After releasing those 76 pods, idle burn dropped 110× to $0.01/hr. The current schema (`lifecycle.on_success: delete` by default + the failure-path WARNING + `ls-stale` audit + `cleanup --all-stopped` bulk-release) is the structural fix that prevents the same gap reopening. ### The forensics-then-cleanup workflow After a failed run that preserved a pod: 1. **Audit**: `runpod-deploy ls-stale` lists every EXITED pod on the account with its volume size and estimated daily/monthly cost, plus a TOTAL footer. Read-only; safe to run anytime. 2. **Investigate**: inspect the pulled manifest in `artifacts/runpod//runpod_pull_manifest.json` for most root-cause analysis (most failure modes are visible in the captured logs and telemetry without needing SSH). If you do need SSH: `runpodctl pod start `, then `runpodctl pod ssh `. 3. **Release** (single pod): ```bash runpod-deploy cleanup --state-file --mode delete ``` The state file path is logged by the per-run WARNING. The default `--mode` is `delete` — you almost always want to release disk after forensics. `--mode stop` re-pauses (rarely useful); `--mode preserve` is a no-op. 4. **Release** (bulk, after a sweep of failed runs): ```bash runpod-deploy cleanup --all-stopped # interactive y/N prompt runpod-deploy cleanup --all-stopped --yes # non-interactive ``` Equivalent to `runpodctl pod list -a --status EXITED -o json | jq -r '.[].id' | xargs -I{} runpodctl pod delete {}` but ships in the SDK with failure collection (one bad delete doesn't abort the rest). ### Recommended hygiene - Add `runpod-deploy ls-stale` to a weekly cron / CI job to detect drift before it becomes a leak. See [`recipes/stale-pod-audit.md`](recipes/stale-pod-audit.md). - When iterating on a workflow that fails repeatedly during development, set `lifecycle: {on_failure: delete}` in that config so failed runs don't accumulate. Revert to `on_failure: stop` only when you genuinely want SSH access on failure. - For workflows where the staging payload is the slow part of every run (large repos, repeated rsyncs), consider switching `storage.mode: network_volume` so the volume persists across pods and `rsync` is incremental. See [`recipes/payload-reuse-via-network-volume.md`](recipes/payload-reuse-via-network-volume.md). --- ## 8. Manifest write `write_pull_manifest(ctx, failed, pod, datacenter_id, deploy_metadata, artifact_results, telemetry_files, wall_time_sec, gpu_price_per_hour_usd, gpu_price_source, pod_final_state)` serializes everything to `/runpod_deploy_pull_manifest.json` (schema v2). Captured fields: - **Provenance**: `job_name`, `run_id`, `schema_version=v2`, `pod_id`, `gpu_id`, `datacenter_id`, `image`, `storage_mode`. - **Cost/timing**: `wall_time_sec`, `gpu_price_per_hour_usd`, `gpu_price_source` (e.g. `pod_describe`), `estimated_cost_usd`, `cost_cap_usd`. - **Pod final state**: `pod_final_state` (e.g. `EXITED`, `pod_killed_unexpected`). - **Deploy metadata**: `local_git_sha`, `local_git_dirty`, `payload_lockfile`. - **Artifacts**: list of `{label, status, duration_sec, bytes_transferred}`. - **Telemetry files**: list of telemetry-snapshot filenames. This manifest is the source of truth for forensic queries (`runpod-deploy ls-runs`, `manifest-summary`, `compare-runs`, `events-query`). --- ## Where each YAML section maps | YAML section | Phase | Function | |---|---|---| | `schema_version`, `name`, `run_id_prefix`, `state_file` | All | `build_job_context` | | `local` | 1 | `validate_local_paths`, `build_job_context` | | `pod` (incl. `python_version`) | 2 | `provision_pod`, `_build_python_pin_preflight` | | `storage` | 2 | `resolve_volume`, `provision_pod` | | `ssh` | 3 | `RemoteRunner` construction | | `budget` | 2 + 5e | cost cap + monitor timeout | | `remote_env` | 4 + 5c | `_remote_env_prefix` (when `with_env: true`) | | `setup` | 4 | `_run_commands(label="setup")` | | `staging` | 5b | `_push_workspace` | | `secrets` | 5a | `_stage_secrets` | | `preflight` | 5c | `_run_commands(label="preflight")` | | `run` | 5d + 5e | `_launch_remote_job`, `_monitor_remote_log` | | `artifacts` | 6 | `_pull_artifacts_and_log` | | `stop` | 7 | `stop_pod` | | `telemetry` | 5d + 5e | `telemetry.start_session`, `tel.start_sampling()` | | `variables` (+ `--var` / `--vars-file`) | 1 (resolution) | `build_job_context` two-pass render | --- ## Failure handling The orchestrator's `try`/`except`/`finally` block is the canonical failure flow: - **Exception before `run_started = True`** (phases 3–5c): no artifact pulls (the run script never executed). Act on pod per `lifecycle.on_failure`. Log a WARNING. - **Exception after `run_started = True`** (phases 5d/5e/6): best-effort artifact pull (suppresses second-order exceptions) + `tel.capture_end()`. Act on pod per `lifecycle.on_failure`. - **Manifest always writes** in the `finally` block (suppressed exception during the write itself just logs). For deeper failure-mode debugging see [`troubleshooting.md`](troubleshooting.md). --- ## `--dry-run` vs `--offline-dry-run` Both flags walk the lifecycle without provisioning, but they differ in whether external read-only queries are made: | Flag | External calls? | Use case | |---|---|---| | `--offline-dry-run` | **None** — no `runpodctl`, no SSH, no rsync. GPU/DC selection uses synthetic sentinels. | CI tests, fast config iteration, validation when you're offline or don't have a RunPod account | | `--dry-run` | **Read-only only** — `runpodctl datacenter list` is queried so live GPU stock info is reflected; `runpod-deploy gpu-prices` is queried if `--max-gpu-price` is set. Pod create / SSH / rsync are mocked. | "Will this config actually find a GPU in stock right now?" without provisioning | In code: `--offline-dry-run` implies `dry_run=True` in `run_job`, and additionally passes `offline=True` to `_resolve_gpu_id_and_dc` and `_resolve_volume_id`. The CLI gates the external calls via that `offline` flag.