# Troubleshooting When something goes wrong. Each entry: **Symptom** (what you see) → **Diagnosis** (what's happening underneath) → **Fix** (what to change). Organized by phase of the lifecycle. Cross-reference [`lifecycle.md`](lifecycle.md) for the full pipeline; this doc is strictly "things have broken, what now." --- ## Provisioning failures ### `runpodctl pod create` fails with `unknown flag: --min-vcpu-count` **Symptom**: pod create errors with `{"error":"unknown flag: --min-vcpu-count"}` (or `--spot`, or `--min-memory-in-gb`). **Diagnosis**: your installed `runpodctl` version doesn't support the flag. Modern `runpod-deploy` (any current release) probes `runpodctl pod create --help` once per process and *skips* unsupported flags with a WARNING rather than emitting them blindly. If you're on an older `runpod-deploy` paired with a fresh `runpodctl` install, or on a newer `runpod-deploy` paired with an old `runpodctl`, the probe narrows down the mismatch — but the underlying constraint is whatever `runpodctl pod create --help` advertises. **Fix**: check your tooling versions first. `runpodctl version` shows the locally-installed binary; `pip show runpod-deploy` shows the Python package. Then either upgrade `runpod-deploy` to get the auto-skip behavior, or upgrade `runpodctl` itself to gain the underlying flag. When the probe skips a flag, the WARNING reads: `runpodctl pod create does not support -- in the locally-installed version; skipping ...` — the pod still launches without the flag. If a flag is essential for your workload (e.g., `--spot`), upgrading `runpodctl` is the only path; the runpod-deploy probe can detect absence but can't synthesize the underlying feature. --- ### `RuntimeError: pod did not become SSH-ready` **Symptom**: `runpodctl pod create` succeeds, the pod transitions to `RUNNING`, but the SSH proxy never publishes a host/port within the deadline. The trimmed error message looks like: ``` RuntimeError: pod abc123 did not become SSH-ready in 900s; last={'desiredStatus': 'RUNNING', 'ssh': {'error': 'pod not ready', 'status': 'RUNNING'}, 'uptimeSeconds': 0} ``` **Diagnosis**: image pull/extract on a cold host (no local cache) takes longer than `budget.ssh_ready_timeout_sec`. Common with cudnn-devel pytorch images (~6–12 GB) in datacenters or on GPU classes you don't use often. If the wait was longer than 60 s, you should also see periodic heartbeat INFO logs like `[pod] abc123 waiting for SSH; T=120s status='RUNNING' ssh.error='pod not ready' uptimeSeconds=0` — that confirms the diagnosis is "still pulling, not stuck". **Fix (persistent)** — bump the timeout in YAML: ```yaml budget: ssh_ready_timeout_sec: 1500 # 25 min; default is 900 ``` **Fix (one-off debugging)** — use the CLI flag without editing YAML: ```bash runpod-deploy run --config foo.yaml --ssh-ready-timeout-sec 1500 ``` **Safety**: when the timeout expires, the orchestrator deletes the orphaned pod before re-raising (see PR #89 / `cleanup_pod` orphan hook). The longer timeout does not leak billing — it just fails the run later. --- ### Pod stuck at `uptimeSeconds: 0` forever — phantom image tag **Symptom**: `runpodctl pod create` succeeds, the pod transitions to `desiredStatus: RUNNING`, but `uptimeSeconds` stays at 0 indefinitely and `ssh.error` is `pod not ready` for the full timeout window. The trimmed error eventually surfaces as the previous entry's `RuntimeError: pod did not become SSH-ready`. **Diagnosis**: RunPod's API accepts any image string at pod-create time without validating against the registry. A typo'd or stale `pod.image` tag (e.g. a phantom version like `runpod/pytorch:2.5.0-py3.13-cuda12.4.1-cudnn9-devel` that does not exist on Docker Hub) provisions a pod whose container runtime then sits in image-pull-backoff forever. The SSH proxy never publishes because the container never starts. Consumer evidence (2026-05-17): two pods burned ~$0.62 before diagnosis on a tag that didn't exist. **Diagnostic command**: ```bash runpod-deploy validate --config foo.yaml --check-image-registry ``` This HEAD-checks `pod.image` against Docker Hub's tag API. A 404 prints a loud WARNING with the exact phantom-tag string. Wired into `--all` so the full pre-flight catches this: ```bash runpod-deploy validate --config foo.yaml --all ``` **Fix**: open Docker Hub and confirm the tag exists at `https://hub.docker.com/r///tags`. Update `pod.image` to a tag that does exist. **Offline / CI workflows**: pass `--skip-registry-check` to suppress the HEAD lookup when Docker Hub is unreachable. **Limits**: only Docker Hub is checked. Non-Docker-Hub registries (ghcr.io, quay.io, private registries with a `/` prefix) are detected and skipped silently — we have no portable cross-registry tag API. --- ### `no configured GPU is available` post-provision **Symptom**: the orchestrator emits `RuntimeError: no configured GPU is available in EU-RO-1; observed={...}` after the pod create call but before SSH waits. **Diagnosis**: one of two things: 1. **Name mismatch** — your YAML's `gpu_order` lists `NVIDIA RTX 4090`, but the actual RunPod API name is `NVIDIA GeForce RTX 4090`. The live datacenter dict doesn't have your key. 2. **Real stock-out** — every entry in `pod.gpu_order` is empty-stock in every datacenter in `pod.datacenters`. **Fix**: - Run `runpod-deploy validate --check-availability` (or `--all`) before `run` — it surfaces the mismatch + stock state upfront. - Use `runpod-deploy gpu-list --datacenter EU-RO-1` to see exact names + current stock + per-hour prices. - Widen `pod.gpu_order` to span more classes (the failover walks them in order); widen `pod.datacenters` for DC-level stock-out resilience. --- ### `Permission denied (publickey,password)` on SSH **Symptom**: pod creates, but `_wait_for_sshd` retries indefinitely or fails with auth errors. **Diagnosis**: `runpodctl doctor`'s `ssh_key.synced_to_cloud: true` only means *some* ed25519 key is synced — not necessarily your local `~/.ssh/id_ed25519`. The pod's `authorized_keys` is populated from the account-wide registered keys, and existing pods don't pick up newly-added keys. **Fix**: - `runpodctl ssh list-keys` and match the pubkey content against `cat ~/.ssh/id_ed25519.pub`. - If absent: `runpodctl ssh add-key --key-file ~/.ssh/id_ed25519.pub`, then **stop the current pod** and `runpod-deploy run` again. New pods get the updated keys. --- ### Network volume not mountable **Symptom**: pod creation succeeds but no `/workspace/` directory. **Diagnosis**: `storage.mode: network_volume` requires `pod.cloud_type: SECURE`. Community pods can't mount network volumes. Also, network volumes pin the pod to *one* datacenter — failover across `pod.datacenters` is effectively single-element when you're using a network volume. **Fix**: - Switch to `cloud_type: SECURE`, OR - Switch to `storage.mode: ephemeral` and stage your data via `staging:` instead. - `runpod-deploy validate` warns when `network_volume` is paired with `len(pod.datacenters) > 1`. --- ## Staging failures ### `Distribution not found at: file:///workspace/runpod-deploy` **Symptom**: pod-side `uv sync` errors trying to install `runpod-deploy` as a dep. **Diagnosis**: the consumer pyproject lists `runpod-deploy` in `[project.dependencies]` (often with `[tool.uv.sources]` pointing at a local path). But `runpod-deploy` is a **local-only orchestrator** — the pod runs the consumer's code, not the orchestrator. The pod doesn't need it. **Fix**: - Remove `runpod-deploy` from `[project.dependencies]` and any matching `[tool.uv.sources]` entry in the consumer pyproject. - `runpod-deploy validate --scan-consumer` (or `--all`) catches this statically before the pod runs. --- ### `FileNotFoundError: /Users//...` on pod **Symptom**: pod runs, but immediately fails reading a file at a path like `/Users/brandonbehring/foo/bar.yaml`. **Diagnosis**: consumer code has hardcoded a local absolute path. Works on the dev machine; breaks on every pod. **Fix**: - Refactor to use `Path(__file__).parent / "..."` or an explicit `staging:` entry that pushes the data file under `{remote_repo}`. - `runpod-deploy validate --scan-consumer` (or `--all`) greps the staged payload for `/Users/`, `/home/`, `C:\Users\` patterns and WARNs before the pod runs. --- ### `project_root resolved to $HOME — this would stage your entire home directory` **Symptom**: `validate` raises `ValueError: project_root resolved to $HOME (...)`. **Diagnosis**: `local.project_root: ../../..` (one `..` too many) when the YAML lives at `/configs/runpod/.yaml`. The correct value is `../..` — one to escape `runpod/`, one to escape `configs/`. **Fix**: - Set `local.project_root: ../..` for the standard `/configs/runpod/.yaml` layout. - The guard prevents a catastrophic `rsync -a $HOME/ pod:/workspace/repo/` from running. --- ### Stock `runpod/pytorch:*` images have no `rsync` **Symptom**: first staging step errors with `bash: rsync: command not found`. **Diagnosis**: RunPod's stock PyTorch images ship without `rsync`. runpod-deploy uses `rsync --info=progress2` for the staging push; if the binary is missing the SSH command fails. **Fix**: install rsync in a `setup:` command before any staging: ```yaml setup: - command: | which rsync >/dev/null 2>&1 || { apt-get update -qq && \ DEBIAN_FRONTEND=noninteractive apt-get install -y -qq rsync } timeout_sec: 300 ``` --- ### Default rsync excludes silently drop data files **Symptom**: pod-side code looks for `evals/` or `artifacts/` data, file not found. **Diagnosis**: `staging[].excludes_default: true` (v0.4.0) adds the hygiene preset (`.git/`, `.venv/`, caches). It does NOT exclude `evals/`, `artifacts/`, or data dirs — but if you ALSO set `excludes_extra: ["evals/", "artifacts/"]` you'll drop those. **Fix**: review the merged exclude list at the entry's `effective_excludes` property. Move data dirs out of `excludes_extra` or use a stricter glob (`evals/tmp/` instead of `evals/`). --- ## Setup failures ### `CUDA initialization: NVIDIA driver too old` **Symptom**: pod runs, `nvidia-smi` works, but `torch.cuda.is_available()` returns False with `UserWarning: CUDA initialization: The NVIDIA driver on your system is too old`. **Diagnosis**: `uv sync` installed a torch wheel that requires a newer CUDA than the pod's driver provides. As of 2026-05, RunPod ships drivers supporting CUDA up to 12.8; default PyPI torch wheels may require CUDA 12.9+. **Fix**: pin torch to a CUDA-specific wheel index in the **consumer** `pyproject.toml`: ```toml [tool.uv] environments = ["sys_platform == 'linux'"] [tool.uv.sources] torch = { index = "pytorch-cu128" } [[tool.uv.index]] name = "pytorch-cu128" url = "https://download.pytorch.org/whl/cu128" explicit = true ``` When RunPod ships pods with newer drivers, bump the index URL. --- ### `uv python install` fails on the pod **Symptom**: with `pod.python_version` set (v0.5.0), the auto-injected preflight step exits non-zero. The run aborts before user preflight. **Diagnosis** (per the v0.5.0 PR-G "fail-fast" decision): one of - `uv` itself isn't installed on the base image (most likely — install it in your `setup:` before the python_version pin tries to use it). - Network blip during the interpreter download. - The requested version doesn't exist (e.g., `python_version: "3.99"`). **Fix**: - Ensure `setup:` includes `curl -LsSf https://astral.sh/uv/install.sh | sh` before staging happens. - Run `uv python list` locally to confirm the version string exists. - For transient network failures, simply retry the `runpod-deploy run`. --- ### `uv sync` hangs silently with `.venv` partially populated **Symptom**: pod-side `uv sync` hangs after starting wheel installation. `ps` shows the uv PID alive at 0% CPU; `/workspace/.venv` is frozen at a few MB (not growing); `lsof -p ` reveals open file descriptors under `/workspace/uv_cache/.tmp*` unpack dirs. No error message; no stack trace; preflight times out after the configured `timeout_sec` and the orchestrator aborts. **Diagnosis**: RunPod mounts `/workspace` as a distributed FUSE filesystem (confirm with `df -hT /workspace` — returns `mfs#.runpod.net:9421 type fuse`). uv's default `--link-mode=hardlink` triggers `Stale file handle (os error 116)` errors when installing many wheels onto this FS in tight loops. uv either retries silently or stalls on a stat() call. The hang is indistinguishable from a slow network read in `ps`/`lsof`. **Fix**: add `UV_LINK_MODE=copy` to your `remote_env.exports`: ```yaml remote_env: exports: UV_LINK_MODE: copy # avoid stale-file-handle on FUSE-mounted /workspace ``` uv falls back to full-file copy mode (adds ~10-30s to a typical venv populate; eliminates the hardlink hang). **If `UV_LINK_MODE=copy` alone is insufficient**, two FUSE-related failure modes can hit before or after the wheel-install phase that copy-mode does NOT address: 1. `uv sync` stalled in `git reset --hard` during resolution (see ["`uv sync` hangs in `git reset --hard`"](#uv-sync-hangs-in-git-reset---hard-during-resolution-phase) below) — fix: pin `UV_CACHE_DIR` to `/root/uv_cache` (overlay disk). 2. HF Trainer checkpoint save stalled (see ["HF Trainer checkpoint save hangs"](#hf-trainer-checkpoint-save-hangs-on-fuse-backed-output_dir) below) — fix: put `output_dir` on `/root` and rsync checkpoints back in your `run.body` trailer. Both fixes pin write-heavy directories to the overlay disk (where POSIX locks work normally) rather than fighting FUSE's F_SETLKW behavior. See [`uv#17801`](https://github.com/astral-sh/uv/issues/17801), [MooseFS discussion #380](https://github.com/moosefs/moosefs/discussions/380), and the [Linux kernel `request_wait_answer` hang patch (2025-12-23)](https://lkml.org/lkml/2025/12/23/264) for upstream context. For genuinely separate network-stall symptoms (single wheel download stuck mid-stream rather than a stalled `stat()` / `flock()`), also add `UV_HTTP_TIMEOUT=120` (bounds any single HTTP read at 120s) and optionally `UV_CONCURRENT_DOWNLOADS=4` (caps concurrent downloads; default 50 amplifies head-of-line blocking on stalled sockets). --- ### `uv sync` hangs in `git reset --hard` during resolution phase **Symptom**: `uv sync` hangs BEFORE installing wheels, while resolving `git+https://...` dependencies. `ps` shows two stuck processes: a `git reset --hard ` PID in `D` (uninterruptible) state with WCHAN `request_wait_answer`, and the parent `uv sync` PID in `futex_wait_queue`. `lsof` shows the git PID holding `/workspace/uv_cache/git-v0/checkouts/.../.git/index.lock`. uv has not yet started populating `.venv/` — the hang is during resolution, not install. **Diagnosis**: with `UV_CACHE_DIR: /workspace/uv_cache`, uv clones `git+https://...` deps (e.g. consumer-side `[project.optional-dependencies] dev` that references your own toolkits via `git+https`) into the FUSE-backed cache. Each clone runs `git reset --hard ` to materialize the pinned revision; `git` acquires an `flock()` exclusive lock on `.git/index.lock` via `F_SETLKW`. MooseFS's F_SETLKW path is unreliable on FUSE (see [MooseFS discussion #380](https://github.com/moosefs/moosefs/discussions/380)) and the syscall stalls indefinitely in `request_wait_answer`. This happens BEFORE the wheel-install phase, so `UV_LINK_MODE=copy` (which only affects the install-phase hardlink path) does not prevent it. **Fix**: move `UV_CACHE_DIR` off `/workspace` onto the pod's overlay disk: ```yaml remote_env: exports: UV_CACHE_DIR: /root/uv_cache # overlay disk; not FUSE UV_LINK_MODE: copy # still good as defense-in-depth ``` `/root` is the container's overlay disk (verify with `df -hT /root` — type `overlay`, NOT `fuse`). POSIX locks work normally there. uv_cache is ephemeral anyway (re-populated each fire); putting it on `/root` sacrifices nothing. --- ### HF Trainer checkpoint save hangs on FUSE-backed `output_dir` **Symptom**: Hugging Face Trainer completes a training step successfully, then hangs in `model.save_pretrained()` or `Trainer._save()`. tqdm bar shows `Writing model shards: 0%|`. The main `.safetensors` shard may write successfully (large file, ~300 MB), but subsequent small files (`optimizer.pt`, `scheduler.pt`, `tokenizer.json`, `trainer_state.json`, `config.json`) never appear. `ps` shows the trainer PID alive at moderate CPU (50-90%) with one thread on WCHAN `request_wait_answer`. The hang typically resolves within 10 minutes (FUSE eventually grants the lock) or times out the run. **Diagnosis**: same MooseFS F_SETLKW class as the git-resolution and install-phase hangs, but here the lock holder is HF Trainer's atomic-save protocol. `Trainer._save()` writes each checkpoint file to a tempname then atomically renames into place, with intermediate `flock()` / POSIX locks for crash-consistency. On FUSE-backed `output_dir`, the lock acquisition stalls. **Fix**: keep checkpoint `output_dir` on the pod's overlay disk too. Two options depending on whether you want to ship checkpoints back: 1. **Train on `/root`, rsync checkpoints back as a `run.body` trailer** — set the Trainer's `output_dir` (configurable via your training script or `TrainingArguments.output_dir`) to e.g. `/root/checkpoints/`, then in `run.body` after the training command: ```bash uv run python scripts/train.py --output-dir /root/checkpoints rsync -az /root/checkpoints/ /workspace// ``` Best of both worlds: locks work during training; final checkpoints land on the volume for orchestrator artifact pull. 2. **Disable per-epoch checkpoint save entirely** — set `save_strategy: "no"` in your `TrainingArguments`. Only viable if predictions parquets are your real analysis input and you don't need re-runnable checkpoints. Predictions parquets written via custom callbacks usually don't trigger this because (a) they're written as a single `pq.write_table()` call rather than a multi-file atomic-rename dance, and (b) they're small enough that any FUSE-write race resolves before the next epoch starts. --- ## Run failures ### Secrets unavailable on ephemeral pods **Symptom**: pod runs, code that needs `HF_TOKEN` or `OPENAI_API_KEY` exits with auth errors. `network_volume` configs work; `ephemeral` configs don't. **Diagnosis**: pre-v0.2.x patterns staged secrets to `/workspace/secrets/env` on the network volume. With ephemeral storage the volume is fresh every run, so the secret file isn't there. **Fix**: use the explicit `secrets:` block (shipped in the v0.2.x cycle): ```yaml secrets: - name: hf source_env: HF_TOKEN # read from your local env var destination: /workspace/secrets/env var_name: HF_TOKEN mode: "0600" ``` The orchestrator reads the named local env var, writes `HF_TOKEN=value` to a tempfile, and rsyncs it to the pod with restrictive perms. Never logged. See [`config-reference.md`](config-reference.md). --- ### `ValueError: flash_attention_2 is not supported` **Symptom**: transformer scorer loads cleanly on H100; fails on A6000 / RTX A4000 with this error. **Diagnosis**: `flash_attention_2` isn't supported across all GPU generations. With `pod.gpu_order` listing multiple classes (the recommended pattern for stock-out resilience), some shards land on GPUs that don't support it. **Fix**: wrap the model load with a try/except per [`recipes/flash-attention-fallback.md`](recipes/flash-attention-fallback.md): ```python notest try: encoder = AutoModel.from_pretrained( model_id, attn_implementation="flash_attention_2", ... ) except (ValueError, ImportError): encoder = AutoModel.from_pretrained(model_id, ...) ``` --- ### Pod killed mid-run; manifest shows `pod_final_state=pod_killed_unexpected` **Symptom**: the run script started, ran for a while, then was killed. The manifest's `pod_final_state` is `pod_killed_unexpected` rather than `EXITED`. **Diagnosis**: RunPod-side kill (OOM, host issue, spot-pod preemption, or quota). Telemetry's pod-kill detector caught it and recorded the state. **Fix**: - For OOM: scale `pod.container_disk_gb`, scale model precision, or pick a higher-VRAM GPU class. - For spot preemption: drop `pod.spot: true` (if you opted in). - For host issues: retry. The sweep recipe pattern handles this with exponential backoff ([`recipes/multi-config-sweep.md`](recipes/multi-config-sweep.md)). - To investigate across many runs: `runpod-deploy events-query --filter event=pod_killed_unexpected --since 30d --json`. --- ## Sweep-driver failures ### Driver appears to "exit cleanly" but only 1 of N shards launched **Symptom**: `bash driver.sh | tee log` exits 0, but `log` shows the driver died mid-script after launching the first shard. The `tee`-fronted pipeline returned `tee`'s success code, masking the bash failure. **Diagnosis**: `pipefail` was missing from the `set` line. Without it, the pipeline's exit code is `tee`'s, not `bash`'s. **Fix**: every sweep-driver header must include `pipefail`: ```sh set -euo pipefail # ← all three; -e + -u + -o pipefail ``` See [`recipes/multi-config-sweep.md`](recipes/multi-config-sweep.md) "Pitfalls" §1. --- ### `wait -n` + `set -e` kills the driver on the first failed shard **Symptom**: parallel sweep launches the first failure, then the driver dies before subsequent shards run. **Diagnosis**: `wait -n` returns the exit code of the finishing background job. With `set -e`, a non-zero exit from `wait -n` kills the script immediately. **Fix**: wrap `wait -n` to suppress its exit code; collect failures per-shard inside the launcher function instead: ```sh while [ "$(jobs -rp | wc -l)" -ge "$MAX_PARALLEL" ]; do wait -n 2>/dev/null || true done ``` See [`recipes/multi-config-sweep.md`](recipes/multi-config-sweep.md) "Pitfalls" §2 for the full corrected pattern. --- ### `ls -td artifacts/runpod/* | head -1` returns the wrong shard's run-dir **Symptom**: failure classifier reads a healthy sibling's `events.jsonl` and misclassifies a transient pod-kill as a non-retryable training failure. **Diagnosis**: at `MAX_PARALLEL > 1`, multiple `runpod-deploy run` invocations write concurrent `artifacts/runpod//` dirs. `ls -td` returns whichever sibling started last — racing the failed shard you wanted to inspect. **Fix**: use `runpod-deploy run --print-run-dir` (v0.4.0 PR-B) which emits a single `RUN_DIR=` line on stdout. Capture per-attempt stdout via `tee` and grep for the line: ```sh local stdout_log="/tmp/sweep_attempt_${seed}.log" if runpod-deploy run --print-run-dir ... 2>&1 | tee "$stdout_log"; then return 0 fi local this_run_dir this_run_dir=$(grep -oE '^RUN_DIR=.*' "$stdout_log" | head -1 | cut -d= -f2-) ``` See [`recipes/multi-config-sweep.md`](recipes/multi-config-sweep.md) "Pitfalls" §3. --- ## Forensic recovery When something failed and you want to know what — these are the tools. ### "Which DCs failed over most often this month?" ```sh runpod-deploy events-query --filter event=datacenter_failover --since 30d --json ``` ### "Show me every pod killed unexpectedly in the last week" ```sh runpod-deploy events-query --filter event=pod_killed_unexpected --since 7d --json ``` ### "What did this specific run do, minute by minute?" ```sh runpod-deploy events artifacts/runpod/20260515T120000Z ``` ### "Aggregate cost + failure rate across a multi-shard sweep" ```sh runpod-deploy manifest-summary --root artifacts/runpod ``` Outputs per-run summaries plus a `== TOTALS ==` footer with the manifest count, failure count, summed wall time, summed estimated cost. ### "Compare two runs side-by-side" ```sh runpod-deploy compare-runs artifacts/runpod/20260515T120000Z \ artifacts/runpod/20260515T130000Z ``` Exit 1 if either manifest has `failed: true` — pairs well with CI gating in driver scripts. ### "List recent runs" ```sh runpod-deploy ls-runs --limit 20 ``` Pulled-back-to-local table of recent run-dir manifests with pod_id, GPU, datacenter, wall time, failure flag, estimated cost. --- ## Cost / cleanup These are the symptoms of the 2026-05-17 leak — and how to recognize recurrences early. ### Stale paused pods are billing indefinitely **Symptom**: `runpodctl user` shows `currentSpendPerHr > 0` despite no apparent activity. `runpodctl pod list` (no `-a`) returns `[]` (no RUNNING pods), but `runpodctl pod list -a` shows many EXITED entries. **Diagnosis**: stopped pods retain their volume disk at **~$0.10/GB·month** indefinitely. `runpodctl pod stop` only pauses compute — it does *not* release storage. The leak is silent: no GPU bill, just slow accumulation on the volume side. **Fix**: ```bash # Audit (read-only): inventory + estimated daily cost runpod-deploy ls-stale # Release every paused pod (irreversible) runpod-deploy cleanup --all-stopped --yes ``` > **Backstory**: On 2026-05-17 this repo's account had 76 EXITED pods > totaling 3,930 GB ≈ **$26/day idle burn**. Account balance was 12 h > from negative when caught. Read [`lifecycle.md` §7b](lifecycle.md#7b-cost-discipline-cleaning-up-after-forensics) > for the post-mortem and the hygiene workflow. To prevent recurrence: the v0.9 schema defaults to `lifecycle.on_success: delete`, so successful runs release disk automatically. Failed runs still preserve a paused pod for SSH forensics (`on_failure: stop`); the orchestrator emits a multi-line WARNING with the exact release command so the operator is never expected to remember the cleanup syntax. --- ### My failed run preserved a pod and I want to release it **Symptom**: After a failed `runpod-deploy run`, you see a WARNING like: ``` [lifecycle] pod 'abc123' stopped for forensics. Volume disk (50 GB) continues billing at ~$0.17/day (~$5.00/mo) until released. When done investigating, release with: runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete Or audit all stale pods: runpod-deploy ls-stale ``` **Diagnosis**: the run failed and the `on_failure: stop` default paused the pod for SSH forensics. You've finished investigating and want the volume disk back. **Fix**: copy the `runpod-deploy cleanup ...` command from the WARNING and run it. The default `--mode` is `delete` so the disk is released. The state file is unlinked on success. If you didn't actually need SSH forensics for this workflow, switch to `lifecycle: {on_failure: delete}` in the config so failed runs release disk automatically (skip the manual cleanup step). --- ### I want to keep payload state between runs (avoid re-uploading) **Symptom**: every `runpod-deploy run` re-pulls the Docker image (~2–5 min), re-runs `setup:` (apt install, uv venv), and re-rsyncs the staging payload (1–5 min for typical repos). Across a 100-job sweep that's hours of wall time. **Diagnosis**: with `storage.mode: ephemeral`, the volume is destroyed when the pod is destroyed. Every successful run with `lifecycle.on_success: delete` (the new default) starts over from a fresh image. **Fix**: switch the workflow to `storage.mode: network_volume` with a named, pre-created volume. The volume persists across pods; rsync becomes incremental (only changed bytes go over); image layer cache and uv venv survive in `/workspace`. See [`recipes/payload-reuse-via-network-volume.md`](recipes/payload-reuse-via-network-volume.md) for the step-by-step. Trade-off: a 100 GB network volume costs ~$7/month sitting idle; network volumes are pinned to one datacenter. --- ## Predictions discipline (consumer-side gotcha) This isn't a runpod-deploy bug — it's a recurring pattern in consumer-repo design that costs real money when missed. **Symptom**: post-hoc you want to recompute a metric, ECE/Brier calibration, or paired-bootstrap delta. The eval pipeline only pulled summary metrics. You have to re-run inference on a fresh pod (~$5, ~30–80 min). **Diagnosis**: only summary metrics (PR-AUC, ROC-AUC, recall@FPR at fixed pinpoints) were persisted. Per-row predictions / `y_score` were generated, used, and discarded with the pod. **Fix**: persist per-row predictions alongside summary metrics. The pattern is documented in [`recipes/predictions-only-eval.md`](recipes/predictions-only-eval.md). Pull the parquet via `artifacts:` before pod teardown: ```yaml artifacts: - label: predictions remote_path: "{remote_repo}/evals/v5_canonical/predictions/" local_path: "{project_root}/evals/v5_canonical/predictions/" required: true ``` For trained adapters / LoRA checkpoints, also push to HF Hub before pod teardown — local pod artifacts are destroyed with the pod. --- ## Still stuck? - Re-run with `--verbose` to see DEBUG output: `runpod-deploy run --verbose --config foo.yaml` - Use `--offline-dry-run` to walk the lifecycle without provisioning; catches config issues for free. - Inspect the manifest: `runpod-deploy manifest-summary /...json` for the full reproducibility record. - Reach the maintainer with the run-dir tarball (`tar czf rundir.tgz artifacts/runpod//`).