Troubleshooting#

When something goes wrong. Each entry: Symptom (what you see) → Diagnosis (what’s happening underneath) → Fix (what to change).

Organized by phase of the lifecycle. Cross-reference lifecycle.md for the full pipeline; this doc is strictly “things have broken, what now.”

Provisioning failures#

`runpodctl pod create` fails with `unknown flag: --min-vcpu-count`#

Symptom: pod create errors with {"error":"unknown flag: --min-vcpu-count"} (or --spot, or --min-memory-in-gb).

Diagnosis: your installed runpodctl version doesn’t support the flag. Modern runpod-deploy (any current release) probes runpodctl pod create --help once per process and skips unsupported flags with a WARNING rather than emitting them blindly. If you’re on an older runpod-deploy paired with a fresh runpodctl install, or on a newer runpod-deploy paired with an old runpodctl, the probe narrows down the mismatch — but the underlying constraint is whatever runpodctl pod create --help advertises.

Fix: check your tooling versions first. runpodctl version shows the locally-installed binary; pip show runpod-deploy shows the Python package. Then either upgrade runpod-deploy to get the auto-skip behavior, or upgrade runpodctl itself to gain the underlying flag.

When the probe skips a flag, the WARNING reads: runpodctl pod create does not support --<flag> in the locally-installed version; skipping ... — the pod still launches without the flag.

If a flag is essential for your workload (e.g., --spot), upgrading runpodctl is the only path; the runpod-deploy probe can detect absence but can’t synthesize the underlying feature.

`RuntimeError: pod <id> did not become SSH-ready`#

Symptom: runpodctl pod create succeeds, the pod transitions to RUNNING, but the SSH proxy never publishes a host/port within the deadline. The trimmed error message looks like:

RuntimeError: pod abc123 did not become SSH-ready in 900s;
  last={'desiredStatus': 'RUNNING', 'ssh': {'error': 'pod not ready', 'status': 'RUNNING'},
        'uptimeSeconds': 0}

Diagnosis: image pull/extract on a cold host (no local cache) takes longer than budget.ssh_ready_timeout_sec. Common with cudnn-devel pytorch images (~6–12 GB) in datacenters or on GPU classes you don’t use often.

If the wait was longer than 60 s, you should also see periodic heartbeat INFO logs like [pod] abc123 waiting for SSH; T=120s status='RUNNING' ssh.error='pod not ready' uptimeSeconds=0 — that confirms the diagnosis is “still pulling, not stuck”.

Fix (persistent) — bump the timeout in YAML:

budget:
  ssh_ready_timeout_sec: 1500   # 25 min; default is 900

Fix (one-off debugging) — use the CLI flag without editing YAML:

runpod-deploy run --config foo.yaml --ssh-ready-timeout-sec 1500

Safety: when the timeout expires, the orchestrator deletes the orphaned pod before re-raising (see PR #89 / cleanup_pod orphan hook). The longer timeout does not leak billing — it just fails the run later.

Pod stuck at `uptimeSeconds: 0` forever — phantom image tag#

Symptom: runpodctl pod create succeeds, the pod transitions to desiredStatus: RUNNING, but uptimeSeconds stays at 0 indefinitely and ssh.error is pod not ready for the full timeout window. The trimmed error eventually surfaces as the previous entry’s RuntimeError: pod <id> did not become SSH-ready.

Diagnosis: RunPod’s API accepts any image string at pod-create time without validating against the registry. A typo’d or stale pod.image tag (e.g. a phantom version like runpod/pytorch:2.5.0-py3.13-cuda12.4.1-cudnn9-devel that does not exist on Docker Hub) provisions a pod whose container runtime then sits in image-pull-backoff forever. The SSH proxy never publishes because the container never starts. Consumer evidence (2026-05-17): two pods burned ~$0.62 before diagnosis on a tag that didn’t exist.

Diagnostic command:

runpod-deploy validate --config foo.yaml --check-image-registry

This HEAD-checks pod.image against Docker Hub’s tag API. A 404 prints a loud WARNING with the exact phantom-tag string. Wired into --all so the full pre-flight catches this:

runpod-deploy validate --config foo.yaml --all

Fix: open Docker Hub and confirm the tag exists at https://hub.docker.com/r/<owner>/<image>/tags. Update pod.image to a tag that does exist.

Offline / CI workflows: pass --skip-registry-check to suppress the HEAD lookup when Docker Hub is unreachable.

Limits: only Docker Hub is checked. Non-Docker-Hub registries (ghcr.io, quay.io, private registries with a <host>/ prefix) are detected and skipped silently — we have no portable cross-registry tag API.

`no configured GPU is available` post-provision#

Symptom: the orchestrator emits RuntimeError: no configured GPU is available in EU-RO-1; observed={...} after the pod create call but before SSH waits.

Diagnosis: one of two things:

Name mismatch — your YAML’s gpu_order lists NVIDIA RTX 4090, but the actual RunPod API name is NVIDIA GeForce RTX 4090. The live datacenter dict doesn’t have your key.
Real stock-out — every entry in pod.gpu_order is empty-stock in every datacenter in pod.datacenters.

Fix:

Run runpod-deploy validate --check-availability (or --all) before run — it surfaces the mismatch + stock state upfront.
Use runpod-deploy gpu-list --datacenter EU-RO-1 to see exact names + current stock + per-hour prices.
Widen pod.gpu_order to span more classes (the failover walks them in order); widen pod.datacenters for DC-level stock-out resilience.

`Permission denied (publickey,password)` on SSH#

Symptom: pod creates, but _wait_for_sshd retries indefinitely or fails with auth errors.

Diagnosis: runpodctl doctor’s ssh_key.synced_to_cloud: true only means some ed25519 key is synced — not necessarily your local ~/.ssh/id_ed25519. The pod’s authorized_keys is populated from the account-wide registered keys, and existing pods don’t pick up newly-added keys.

Fix:

runpodctl ssh list-keys and match the pubkey content against cat ~/.ssh/id_ed25519.pub.
If absent: runpodctl ssh add-key --key-file ~/.ssh/id_ed25519.pub, then stop the current pod and runpod-deploy run again. New pods get the updated keys.

Network volume not mountable#

Symptom: pod creation succeeds but no /workspace/ directory.

Diagnosis: storage.mode: network_volume requires pod.cloud_type: SECURE. Community pods can’t mount network volumes. Also, network volumes pin the pod to one datacenter — failover across pod.datacenters is effectively single-element when you’re using a network volume.

Fix:

Switch to cloud_type: SECURE, OR
Switch to storage.mode: ephemeral and stage your data via staging: instead.
runpod-deploy validate warns when network_volume is paired with len(pod.datacenters) > 1.

Staging failures#

`Distribution not found at: file:///workspace/runpod-deploy`#

Symptom: pod-side uv sync errors trying to install runpod-deploy as a dep.

Diagnosis: the consumer pyproject lists runpod-deploy in [project.dependencies] (often with [tool.uv.sources] pointing at a local path). But runpod-deploy is a local-only orchestrator — the pod runs the consumer’s code, not the orchestrator. The pod doesn’t need it.

Fix:

Remove runpod-deploy from [project.dependencies] and any matching [tool.uv.sources] entry in the consumer pyproject.
runpod-deploy validate --scan-consumer (or --all) catches this statically before the pod runs.

`FileNotFoundError: /Users/<name>/...` on pod#

Symptom: pod runs, but immediately fails reading a file at a path like /Users/brandonbehring/foo/bar.yaml.

Diagnosis: consumer code has hardcoded a local absolute path. Works on the dev machine; breaks on every pod.

Fix:

Refactor to use Path(__file__).parent / "..." or an explicit staging: entry that pushes the data file under {remote_repo}.
runpod-deploy validate --scan-consumer (or --all) greps the staged payload for /Users/, /home/, C:\Users\ patterns and WARNs before the pod runs.

`project_root resolved to $HOME — this would stage your entire home directory`#

Symptom: validate raises ValueError: project_root resolved to $HOME (...).

Diagnosis: local.project_root: ../../.. (one .. too many) when the YAML lives at <consumer>/configs/runpod/<job>.yaml. The correct value is ../.. — one to escape runpod/, one to escape configs/.

Fix:

Set local.project_root: ../.. for the standard <repo>/configs/runpod/<job>.yaml layout.
The guard prevents a catastrophic rsync -a $HOME/ pod:/workspace/repo/ from running.

Stock `runpod/pytorch:*` images have no `rsync`#

Symptom: first staging step errors with bash: rsync: command not found.

Diagnosis: RunPod’s stock PyTorch images ship without rsync. runpod-deploy uses rsync --info=progress2 for the staging push; if the binary is missing the SSH command fails.

Fix: install rsync in a setup: command before any staging:

setup:
  - command: |
      which rsync >/dev/null 2>&1 || {
        apt-get update -qq && \
        DEBIAN_FRONTEND=noninteractive apt-get install -y -qq rsync
      }
    timeout_sec: 300

Default rsync excludes silently drop data files#

Symptom: pod-side code looks for evals/ or artifacts/ data, file not found.

Diagnosis: staging[].excludes_default: true (v0.4.0) adds the hygiene preset (.git/, .venv/, caches). It does NOT exclude evals/, artifacts/, or data dirs — but if you ALSO set excludes_extra: ["evals/", "artifacts/"] you’ll drop those.

Fix: review the merged exclude list at the entry’s effective_excludes property. Move data dirs out of excludes_extra or use a stricter glob (evals/tmp/ instead of evals/).

Setup failures#

`CUDA initialization: NVIDIA driver too old`#

Symptom: pod runs, nvidia-smi works, but torch.cuda.is_available() returns False with UserWarning: CUDA initialization: The NVIDIA driver on your system is too old.

Diagnosis: uv sync installed a torch wheel that requires a newer CUDA than the pod’s driver provides. As of 2026-05, RunPod ships drivers supporting CUDA up to 12.8; default PyPI torch wheels may require CUDA 12.9+.

Fix: pin torch to a CUDA-specific wheel index in the consumer pyproject.toml:

[tool.uv]
environments = ["sys_platform == 'linux'"]

[tool.uv.sources]
torch = { index = "pytorch-cu128" }

[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true

When RunPod ships pods with newer drivers, bump the index URL.

`uv python install` fails on the pod#

Symptom: with pod.python_version set (v0.5.0), the auto-injected preflight step exits non-zero. The run aborts before user preflight.

Diagnosis (per the v0.5.0 PR-G “fail-fast” decision): one of

uv itself isn’t installed on the base image (most likely — install it in your setup: before the python_version pin tries to use it).
Network blip during the interpreter download.
The requested version doesn’t exist (e.g., python_version: "3.99").

Fix:

Ensure setup: includes curl -LsSf https://astral.sh/uv/install.sh | sh before staging happens.
Run uv python list locally to confirm the version string exists.
For transient network failures, simply retry the runpod-deploy run.

`uv sync` hangs silently with `.venv` partially populated#

Symptom: pod-side uv sync hangs after starting wheel installation. ps shows the uv PID alive at 0% CPU; /workspace/.venv is frozen at a few MB (not growing); lsof -p <uv_pid> reveals open file descriptors under /workspace/uv_cache/.tmp* unpack dirs. No error message; no stack trace; preflight times out after the configured timeout_sec and the orchestrator aborts.

Diagnosis: RunPod mounts /workspace as a distributed FUSE filesystem (confirm with df -hT /workspace — returns mfs#<dc>.runpod.net:9421 type fuse). uv’s default --link-mode=hardlink triggers Stale file handle (os error 116) errors when installing many wheels onto this FS in tight loops. uv either retries silently or stalls on a stat() call. The hang is indistinguishable from a slow network read in ps/lsof.

Fix: add UV_LINK_MODE=copy to your remote_env.exports:

remote_env:
  exports:
    UV_LINK_MODE: copy    # avoid stale-file-handle on FUSE-mounted /workspace

uv falls back to full-file copy mode (adds ~10-30s to a typical venv populate; eliminates the hardlink hang).

If UV_LINK_MODE=copy alone is insufficient, two FUSE-related failure modes can hit before or after the wheel-install phase that copy-mode does NOT address:

uv sync stalled in git reset --hard during resolution (see “uv sync hangs in git reset --hard” below) — fix: pin UV_CACHE_DIR to /root/uv_cache (overlay disk).
HF Trainer checkpoint save stalled (see “HF Trainer checkpoint save hangs” below) — fix: put output_dir on /root and rsync checkpoints back in your run.body trailer.

Both fixes pin write-heavy directories to the overlay disk (where POSIX locks work normally) rather than fighting FUSE’s F_SETLKW behavior. See uv#17801, MooseFS discussion #380, and the Linux kernel request_wait_answer hang patch (2025-12-23) for upstream context.

For genuinely separate network-stall symptoms (single wheel download stuck mid-stream rather than a stalled stat() / flock()), also add UV_HTTP_TIMEOUT=120 (bounds any single HTTP read at 120s) and optionally UV_CONCURRENT_DOWNLOADS=4 (caps concurrent downloads; default 50 amplifies head-of-line blocking on stalled sockets).

`uv sync` hangs in `git reset --hard` during resolution phase#

Symptom: uv sync hangs BEFORE installing wheels, while resolving git+https://... dependencies. ps shows two stuck processes: a git reset --hard <sha> PID in D (uninterruptible) state with WCHAN request_wait_answer, and the parent uv sync PID in futex_wait_queue. lsof shows the git PID holding /workspace/uv_cache/git-v0/checkouts/.../.git/index.lock. uv has not yet started populating .venv/ — the hang is during resolution, not install.

Diagnosis: with UV_CACHE_DIR: /workspace/uv_cache, uv clones git+https://... deps (e.g. consumer-side [project.optional-dependencies] dev that references your own toolkits via git+https) into the FUSE-backed cache. Each clone runs git reset --hard <sha> to materialize the pinned revision; git acquires an flock() exclusive lock on .git/index.lock via F_SETLKW. MooseFS’s F_SETLKW path is unreliable on FUSE (see MooseFS discussion #380) and the syscall stalls indefinitely in request_wait_answer. This happens BEFORE the wheel-install phase, so UV_LINK_MODE=copy (which only affects the install-phase hardlink path) does not prevent it.

Fix: move UV_CACHE_DIR off /workspace onto the pod’s overlay disk:

remote_env:
  exports:
    UV_CACHE_DIR: /root/uv_cache        # overlay disk; not FUSE
    UV_LINK_MODE: copy                   # still good as defense-in-depth

/root is the container’s overlay disk (verify with df -hT /root — type overlay, NOT fuse). POSIX locks work normally there. uv_cache is ephemeral anyway (re-populated each fire); putting it on /root sacrifices nothing.

HF Trainer checkpoint save hangs on FUSE-backed `output_dir`#

Symptom: Hugging Face Trainer completes a training step successfully, then hangs in model.save_pretrained() or Trainer._save(). tqdm bar shows Writing model shards: 0%|. The main .safetensors shard may write successfully (large file, ~300 MB), but subsequent small files (optimizer.pt, scheduler.pt, tokenizer.json, trainer_state.json, config.json) never appear. ps shows the trainer PID alive at moderate CPU (50-90%) with one thread on WCHAN request_wait_answer. The hang typically resolves within 10 minutes (FUSE eventually grants the lock) or times out the run.

Diagnosis: same MooseFS F_SETLKW class as the git-resolution and install-phase hangs, but here the lock holder is HF Trainer’s atomic-save protocol. Trainer._save() writes each checkpoint file to a tempname then atomically renames into place, with intermediate flock() / POSIX locks for crash-consistency. On FUSE-backed output_dir, the lock acquisition stalls.

Fix: keep checkpoint output_dir on the pod’s overlay disk too. Two options depending on whether you want to ship checkpoints back:

Train on /root, rsync checkpoints back as a run.body trailer — set the Trainer’s output_dir (configurable via your training script or TrainingArguments.output_dir) to e.g. /root/checkpoints/, then in run.body after the training command:
```
uv run python scripts/train.py --output-dir /root/checkpoints
rsync -az /root/checkpoints/ /workspace/<artifact_dir>/
```
Best of both worlds: locks work during training; final checkpoints land on the volume for orchestrator artifact pull.
Disable per-epoch checkpoint save entirely — set save_strategy: "no" in your TrainingArguments. Only viable if predictions parquets are your real analysis input and you don’t need re-runnable checkpoints.

Predictions parquets written via custom callbacks usually don’t trigger this because (a) they’re written as a single pq.write_table() call rather than a multi-file atomic-rename dance, and (b) they’re small enough that any FUSE-write race resolves before the next epoch starts.

Run failures#

Secrets unavailable on ephemeral pods#

Symptom: pod runs, code that needs HF_TOKEN or OPENAI_API_KEY exits with auth errors. network_volume configs work; ephemeral configs don’t.

Diagnosis: pre-v0.2.x patterns staged secrets to /workspace/secrets/env on the network volume. With ephemeral storage the volume is fresh every run, so the secret file isn’t there.

Fix: use the explicit secrets: block (shipped in the v0.2.x cycle):

secrets:
  - name: hf
    source_env: HF_TOKEN           # read from your local env var
    destination: /workspace/secrets/env
    var_name: HF_TOKEN
    mode: "0600"

The orchestrator reads the named local env var, writes HF_TOKEN=value to a tempfile, and rsyncs it to the pod with restrictive perms. Never logged. See config-reference.md.

`ValueError: flash_attention_2 is not supported`#

Symptom: transformer scorer loads cleanly on H100; fails on A6000 / RTX A4000 with this error.

Diagnosis: flash_attention_2 isn’t supported across all GPU generations. With pod.gpu_order listing multiple classes (the recommended pattern for stock-out resilience), some shards land on GPUs that don’t support it.

Fix: wrap the model load with a try/except per recipes/flash-attention-fallback.md:

try:
    encoder = AutoModel.from_pretrained(
        model_id, attn_implementation="flash_attention_2", ...
    )
except (ValueError, ImportError):
    encoder = AutoModel.from_pretrained(model_id, ...)

Pod killed mid-run; manifest shows `pod_final_state=pod_killed_unexpected`#

Symptom: the run script started, ran for a while, then was killed. The manifest’s pod_final_state is pod_killed_unexpected rather than EXITED.

Diagnosis: RunPod-side kill (OOM, host issue, spot-pod preemption, or quota). Telemetry’s pod-kill detector caught it and recorded the state.

Fix:

For OOM: scale pod.container_disk_gb, scale model precision, or pick a higher-VRAM GPU class.
For spot preemption: drop pod.spot: true (if you opted in).
For host issues: retry. The sweep recipe pattern handles this with exponential backoff (recipes/multi-config-sweep.md).
To investigate across many runs: runpod-deploy events-query --filter event=pod_killed_unexpected --since 30d --json.

Sweep-driver failures#

Driver appears to “exit cleanly” but only 1 of N shards launched#

Symptom: bash driver.sh | tee log exits 0, but log shows the driver died mid-script after launching the first shard. The tee-fronted pipeline returned tee’s success code, masking the bash failure.

Diagnosis: pipefail was missing from the set line. Without it, the pipeline’s exit code is tee’s, not bash’s.

Fix: every sweep-driver header must include pipefail:

set -euo pipefail   # ← all three; -e + -u + -o pipefail

See recipes/multi-config-sweep.md “Pitfalls” §1.

`wait -n` + `set -e` kills the driver on the first failed shard#

Symptom: parallel sweep launches the first failure, then the driver dies before subsequent shards run.

Diagnosis: wait -n returns the exit code of the finishing background job. With set -e, a non-zero exit from wait -n kills the script immediately.

Fix: wrap wait -n to suppress its exit code; collect failures per-shard inside the launcher function instead:

while [ "$(jobs -rp | wc -l)" -ge "$MAX_PARALLEL" ]; do
  wait -n 2>/dev/null || true
done

See recipes/multi-config-sweep.md “Pitfalls” §2 for the full corrected pattern.

`ls -td artifacts/runpod/* | head -1` returns the wrong shard’s run-dir#

Symptom: failure classifier reads a healthy sibling’s events.jsonl and misclassifies a transient pod-kill as a non-retryable training failure.

Diagnosis: at MAX_PARALLEL > 1, multiple runpod-deploy run invocations write concurrent artifacts/runpod/<ts>/ dirs. ls -td returns whichever sibling started last — racing the failed shard you wanted to inspect.

Fix: use runpod-deploy run --print-run-dir (v0.4.0 PR-B) which emits a single RUN_DIR=<absolute-path> line on stdout. Capture per-attempt stdout via tee and grep for the line:

local stdout_log="/tmp/sweep_attempt_${seed}.log"
if runpod-deploy run --print-run-dir ... 2>&1 | tee "$stdout_log"; then
  return 0
fi
local this_run_dir
this_run_dir=$(grep -oE '^RUN_DIR=.*' "$stdout_log" | head -1 | cut -d= -f2-)

See recipes/multi-config-sweep.md “Pitfalls” §3.

Forensic recovery#

When something failed and you want to know what — these are the tools.

“Which DCs failed over most often this month?”#

runpod-deploy events-query --filter event=datacenter_failover --since 30d --json

“Show me every pod killed unexpectedly in the last week”#

runpod-deploy events-query --filter event=pod_killed_unexpected --since 7d --json

“What did this specific run do, minute by minute?”#

runpod-deploy events artifacts/runpod/20260515T120000Z

“Aggregate cost + failure rate across a multi-shard sweep”#

runpod-deploy manifest-summary --root artifacts/runpod

Outputs per-run summaries plus a == TOTALS == footer with the manifest count, failure count, summed wall time, summed estimated cost.

“Compare two runs side-by-side”#

runpod-deploy compare-runs artifacts/runpod/20260515T120000Z \
                           artifacts/runpod/20260515T130000Z

Exit 1 if either manifest has failed: true — pairs well with CI gating in driver scripts.

“List recent runs”#

runpod-deploy ls-runs --limit 20

Pulled-back-to-local table of recent run-dir manifests with pod_id, GPU, datacenter, wall time, failure flag, estimated cost.

Cost / cleanup#

These are the symptoms of the 2026-05-17 leak — and how to recognize recurrences early.

Stale paused pods are billing indefinitely#

Symptom: runpodctl user shows currentSpendPerHr > 0 despite no apparent activity. runpodctl pod list (no -a) returns [] (no RUNNING pods), but runpodctl pod list -a shows many EXITED entries.

Diagnosis: stopped pods retain their volume disk at ~$0.10/GB·month indefinitely. runpodctl pod stop only pauses compute — it does not release storage. The leak is silent: no GPU bill, just slow accumulation on the volume side.

Fix:

# Audit (read-only): inventory + estimated daily cost
runpod-deploy ls-stale

# Release every paused pod (irreversible)
runpod-deploy cleanup --all-stopped --yes

Backstory: On 2026-05-17 this repo’s account had 76 EXITED pods totaling 3,930 GB ≈ $26/day idle burn. Account balance was 12 h from negative when caught. Read lifecycle.md §7b for the post-mortem and the hygiene workflow.

To prevent recurrence: the v0.9 schema defaults to lifecycle.on_success: delete, so successful runs release disk automatically. Failed runs still preserve a paused pod for SSH forensics (on_failure: stop); the orchestrator emits a multi-line WARNING with the exact release command so the operator is never expected to remember the cleanup syntax.

My failed run preserved a pod and I want to release it#

Symptom: After a failed runpod-deploy run, you see a WARNING like:

[lifecycle] pod 'abc123' stopped for forensics.
  Volume disk (50 GB) continues billing at ~$0.17/day (~$5.00/mo) until released.
  When done investigating, release with:
      runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete
  Or audit all stale pods:
      runpod-deploy ls-stale

Diagnosis: the run failed and the on_failure: stop default paused the pod for SSH forensics. You’ve finished investigating and want the volume disk back.

Fix: copy the runpod-deploy cleanup ... command from the WARNING and run it. The default --mode is delete so the disk is released. The state file is unlinked on success.

If you didn’t actually need SSH forensics for this workflow, switch to lifecycle: {on_failure: delete} in the config so failed runs release disk automatically (skip the manual cleanup step).

I want to keep payload state between runs (avoid re-uploading)#

Symptom: every runpod-deploy run re-pulls the Docker image (~2–5 min), re-runs setup: (apt install, uv venv), and re-rsyncs the staging payload (1–5 min for typical repos). Across a 100-job sweep that’s hours of wall time.

Diagnosis: with storage.mode: ephemeral, the volume is destroyed when the pod is destroyed. Every successful run with lifecycle.on_success: delete (the new default) starts over from a fresh image.

Fix: switch the workflow to storage.mode: network_volume with a named, pre-created volume. The volume persists across pods; rsync becomes incremental (only changed bytes go over); image layer cache and uv venv survive in /workspace. See recipes/payload-reuse-via-network-volume.md for the step-by-step.

Trade-off: a 100 GB network volume costs ~$7/month sitting idle; network volumes are pinned to one datacenter.

Predictions discipline (consumer-side gotcha)#

This isn’t a runpod-deploy bug — it’s a recurring pattern in consumer-repo design that costs real money when missed.

Symptom: post-hoc you want to recompute a metric, ECE/Brier calibration, or paired-bootstrap delta. The eval pipeline only pulled summary metrics. You have to re-run inference on a fresh pod (~$5, ~30–80 min).

Diagnosis: only summary metrics (PR-AUC, ROC-AUC, recall@FPR at fixed pinpoints) were persisted. Per-row predictions / y_score were generated, used, and discarded with the pod.

Fix: persist per-row predictions alongside summary metrics. The pattern is documented in recipes/predictions-only-eval.md. Pull the parquet via artifacts: before pod teardown:

artifacts:
  - label: predictions
    remote_path: "{remote_repo}/evals/v5_canonical/predictions/"
    local_path: "{project_root}/evals/v5_canonical/predictions/"
    required: true

For trained adapters / LoRA checkpoints, also push to HF Hub before pod teardown — local pod artifacts are destroyed with the pod.

Still stuck?#

Re-run with --verbose to see DEBUG output: runpod-deploy run --verbose --config foo.yaml
Use --offline-dry-run to walk the lifecycle without provisioning; catches config issues for free.
Inspect the manifest: runpod-deploy manifest-summary <run-dir>/...json for the full reproducibility record.
Reach the maintainer with the run-dir tarball (tar czf rundir.tgz artifacts/runpod/<ts>/).

Troubleshooting#

Provisioning failures#

runpodctl pod create fails with unknown flag: --min-vcpu-count#

RuntimeError: pod <id> did not become SSH-ready#

Pod stuck at uptimeSeconds: 0 forever — phantom image tag#

no configured GPU is available post-provision#

Permission denied (publickey,password) on SSH#

Network volume not mountable#

Staging failures#

Distribution not found at: file:///workspace/runpod-deploy#

FileNotFoundError: /Users/<name>/... on pod#

project_root resolved to $HOME — this would stage your entire home directory#

Stock runpod/pytorch:* images have no rsync#

Default rsync excludes silently drop data files#

Setup failures#

CUDA initialization: NVIDIA driver too old#

uv python install fails on the pod#

uv sync hangs silently with .venv partially populated#

uv sync hangs in git reset --hard during resolution phase#

HF Trainer checkpoint save hangs on FUSE-backed output_dir#

Run failures#

Secrets unavailable on ephemeral pods#

ValueError: flash_attention_2 is not supported#

Pod killed mid-run; manifest shows pod_final_state=pod_killed_unexpected#

Sweep-driver failures#

Driver appears to “exit cleanly” but only 1 of N shards launched#

wait -n + set -e kills the driver on the first failed shard#

ls -td artifacts/runpod/* | head -1 returns the wrong shard’s run-dir#

Forensic recovery#

“Which DCs failed over most often this month?”#

“Show me every pod killed unexpectedly in the last week”#

“What did this specific run do, minute by minute?”#

“Aggregate cost + failure rate across a multi-shard sweep”#

“Compare two runs side-by-side”#

“List recent runs”#

Cost / cleanup#

Stale paused pods are billing indefinitely#

My failed run preserved a pod and I want to release it#

I want to keep payload state between runs (avoid re-uploading)#

Predictions discipline (consumer-side gotcha)#

Still stuck?#

`runpodctl pod create` fails with `unknown flag: --min-vcpu-count`#

`RuntimeError: pod <id> did not become SSH-ready`#

Pod stuck at `uptimeSeconds: 0` forever — phantom image tag#

`no configured GPU is available` post-provision#

`Permission denied (publickey,password)` on SSH#

`Distribution not found at: file:///workspace/runpod-deploy`#

`FileNotFoundError: /Users/<name>/...` on pod#

`project_root resolved to $HOME — this would stage your entire home directory`#

Stock `runpod/pytorch:*` images have no `rsync`#

`CUDA initialization: NVIDIA driver too old`#

`uv python install` fails on the pod#

`uv sync` hangs silently with `.venv` partially populated#

`uv sync` hangs in `git reset --hard` during resolution phase#

HF Trainer checkpoint save hangs on FUSE-backed `output_dir`#

`ValueError: flash_attention_2 is not supported`#

Pod killed mid-run; manifest shows `pod_final_state=pod_killed_unexpected`#

`wait -n` + `set -e` kills the driver on the first failed shard#

`ls -td artifacts/runpod/* | head -1` returns the wrong shard’s run-dir#