Troubleshooting#
When something goes wrong. Each entry: Symptom (what you see) → Diagnosis (what’s happening underneath) → Fix (what to change).
Organized by phase of the lifecycle. Cross-reference
lifecycle.md for the full pipeline; this doc is
strictly “things have broken, what now.”
Provisioning failures#
runpodctl pod create fails with unknown flag: --min-vcpu-count#
Symptom: pod create errors with
{"error":"unknown flag: --min-vcpu-count"} (or --spot, or
--min-memory-in-gb).
Diagnosis: your installed runpodctl version doesn’t support the
flag. Modern runpod-deploy (any current release) probes
runpodctl pod create --help once per process and skips unsupported
flags with a WARNING rather than emitting them blindly. If you’re on
an older runpod-deploy paired with a fresh runpodctl install, or
on a newer runpod-deploy paired with an old runpodctl, the probe
narrows down the mismatch — but the underlying constraint is whatever
runpodctl pod create --help advertises.
Fix: check your tooling versions first. runpodctl version shows
the locally-installed binary; pip show runpod-deploy shows the
Python package. Then either upgrade runpod-deploy to get the
auto-skip behavior, or upgrade runpodctl itself to gain the
underlying flag.
When the probe skips a flag, the WARNING reads:
runpodctl pod create does not support --<flag> in the locally-installed version; skipping ... — the pod still launches without the flag.
If a flag is essential for your workload (e.g., --spot), upgrading
runpodctl is the only path; the runpod-deploy probe can detect
absence but can’t synthesize the underlying feature.
RuntimeError: pod <id> did not become SSH-ready#
Symptom: runpodctl pod create succeeds, the pod transitions to
RUNNING, but the SSH proxy never publishes a host/port within the
deadline. The trimmed error message looks like:
RuntimeError: pod abc123 did not become SSH-ready in 900s;
last={'desiredStatus': 'RUNNING', 'ssh': {'error': 'pod not ready', 'status': 'RUNNING'},
'uptimeSeconds': 0}
Diagnosis: image pull/extract on a cold host (no local cache)
takes longer than budget.ssh_ready_timeout_sec. Common with
cudnn-devel pytorch images (~6–12 GB) in datacenters or on GPU
classes you don’t use often.
If the wait was longer than 60 s, you should also see periodic
heartbeat INFO logs like
[pod] abc123 waiting for SSH; T=120s status='RUNNING' ssh.error='pod not ready' uptimeSeconds=0 —
that confirms the diagnosis is “still pulling, not stuck”.
Fix (persistent) — bump the timeout in YAML:
budget:
ssh_ready_timeout_sec: 1500 # 25 min; default is 900
Fix (one-off debugging) — use the CLI flag without editing YAML:
runpod-deploy run --config foo.yaml --ssh-ready-timeout-sec 1500
Safety: when the timeout expires, the orchestrator deletes the
orphaned pod before re-raising (see PR #89 / cleanup_pod orphan
hook). The longer timeout does not leak billing — it just fails the
run later.
Pod stuck at uptimeSeconds: 0 forever — phantom image tag#
Symptom: runpodctl pod create succeeds, the pod transitions to
desiredStatus: RUNNING, but uptimeSeconds stays at 0 indefinitely
and ssh.error is pod not ready for the full timeout window. The
trimmed error eventually surfaces as the previous entry’s
RuntimeError: pod <id> did not become SSH-ready.
Diagnosis: RunPod’s API accepts any image string at pod-create
time without validating against the registry. A typo’d or stale
pod.image tag (e.g. a phantom version like
runpod/pytorch:2.5.0-py3.13-cuda12.4.1-cudnn9-devel that does not
exist on Docker Hub) provisions a pod whose container runtime then
sits in image-pull-backoff forever. The SSH proxy never publishes
because the container never starts. Consumer evidence (2026-05-17):
two pods burned ~$0.62 before diagnosis on a tag that didn’t exist.
Diagnostic command:
runpod-deploy validate --config foo.yaml --check-image-registry
This HEAD-checks pod.image against Docker Hub’s tag API. A 404
prints a loud WARNING with the exact phantom-tag string. Wired into
--all so the full pre-flight catches this:
runpod-deploy validate --config foo.yaml --all
Fix: open Docker Hub and confirm the tag exists at
https://hub.docker.com/r/<owner>/<image>/tags. Update pod.image
to a tag that does exist.
Offline / CI workflows: pass --skip-registry-check to suppress
the HEAD lookup when Docker Hub is unreachable.
Limits: only Docker Hub is checked. Non-Docker-Hub registries
(ghcr.io, quay.io, private registries with a <host>/ prefix) are
detected and skipped silently — we have no portable cross-registry
tag API.
no configured GPU is available post-provision#
Symptom: the orchestrator emits
RuntimeError: no configured GPU is available in EU-RO-1; observed={...}
after the pod create call but before SSH waits.
Diagnosis: one of two things:
Name mismatch — your YAML’s
gpu_orderlistsNVIDIA RTX 4090, but the actual RunPod API name isNVIDIA GeForce RTX 4090. The live datacenter dict doesn’t have your key.Real stock-out — every entry in
pod.gpu_orderis empty-stock in every datacenter inpod.datacenters.
Fix:
Run
runpod-deploy validate --check-availability(or--all) beforerun— it surfaces the mismatch + stock state upfront.Use
runpod-deploy gpu-list --datacenter EU-RO-1to see exact names + current stock + per-hour prices.Widen
pod.gpu_orderto span more classes (the failover walks them in order); widenpod.datacentersfor DC-level stock-out resilience.
Permission denied (publickey,password) on SSH#
Symptom: pod creates, but _wait_for_sshd retries indefinitely
or fails with auth errors.
Diagnosis: runpodctl doctor’s ssh_key.synced_to_cloud: true
only means some ed25519 key is synced — not necessarily your local
~/.ssh/id_ed25519. The pod’s authorized_keys is populated from
the account-wide registered keys, and existing pods don’t pick up
newly-added keys.
Fix:
runpodctl ssh list-keysand match the pubkey content againstcat ~/.ssh/id_ed25519.pub.If absent:
runpodctl ssh add-key --key-file ~/.ssh/id_ed25519.pub, then stop the current pod andrunpod-deploy runagain. New pods get the updated keys.
Network volume not mountable#
Symptom: pod creation succeeds but no /workspace/ directory.
Diagnosis: storage.mode: network_volume requires
pod.cloud_type: SECURE. Community pods can’t mount network volumes.
Also, network volumes pin the pod to one datacenter — failover
across pod.datacenters is effectively single-element when you’re
using a network volume.
Fix:
Switch to
cloud_type: SECURE, ORSwitch to
storage.mode: ephemeraland stage your data viastaging:instead.runpod-deploy validatewarns whennetwork_volumeis paired withlen(pod.datacenters) > 1.
Staging failures#
Distribution not found at: file:///workspace/runpod-deploy#
Symptom: pod-side uv sync errors trying to install
runpod-deploy as a dep.
Diagnosis: the consumer pyproject lists runpod-deploy in
[project.dependencies] (often with [tool.uv.sources] pointing
at a local path). But runpod-deploy is a local-only orchestrator
— the pod runs the consumer’s code, not the orchestrator. The pod
doesn’t need it.
Fix:
Remove
runpod-deployfrom[project.dependencies]and any matching[tool.uv.sources]entry in the consumer pyproject.runpod-deploy validate --scan-consumer(or--all) catches this statically before the pod runs.
FileNotFoundError: /Users/<name>/... on pod#
Symptom: pod runs, but immediately fails reading a file at
a path like /Users/brandonbehring/foo/bar.yaml.
Diagnosis: consumer code has hardcoded a local absolute path. Works on the dev machine; breaks on every pod.
Fix:
Refactor to use
Path(__file__).parent / "..."or an explicitstaging:entry that pushes the data file under{remote_repo}.runpod-deploy validate --scan-consumer(or--all) greps the staged payload for/Users/,/home/,C:\Users\patterns and WARNs before the pod runs.
project_root resolved to $HOME — this would stage your entire home directory#
Symptom: validate raises ValueError: project_root resolved to $HOME (...).
Diagnosis: local.project_root: ../../.. (one .. too many)
when the YAML lives at <consumer>/configs/runpod/<job>.yaml. The
correct value is ../.. — one to escape runpod/, one to escape
configs/.
Fix:
Set
local.project_root: ../..for the standard<repo>/configs/runpod/<job>.yamllayout.The guard prevents a catastrophic
rsync -a $HOME/ pod:/workspace/repo/from running.
Stock runpod/pytorch:* images have no rsync#
Symptom: first staging step errors with bash: rsync: command not found.
Diagnosis: RunPod’s stock PyTorch images ship without rsync.
runpod-deploy uses rsync --info=progress2 for the staging push;
if the binary is missing the SSH command fails.
Fix: install rsync in a setup: command before any staging:
setup:
- command: |
which rsync >/dev/null 2>&1 || {
apt-get update -qq && \
DEBIAN_FRONTEND=noninteractive apt-get install -y -qq rsync
}
timeout_sec: 300
Default rsync excludes silently drop data files#
Symptom: pod-side code looks for evals/ or artifacts/ data,
file not found.
Diagnosis: staging[].excludes_default: true (v0.4.0) adds the
hygiene preset (.git/, .venv/, caches). It does NOT exclude
evals/, artifacts/, or data dirs — but if you ALSO set
excludes_extra: ["evals/", "artifacts/"] you’ll drop those.
Fix: review the merged exclude list at the entry’s
effective_excludes property. Move data dirs out of excludes_extra
or use a stricter glob (evals/tmp/ instead of evals/).
Setup failures#
CUDA initialization: NVIDIA driver too old#
Symptom: pod runs, nvidia-smi works, but
torch.cuda.is_available() returns False with
UserWarning: CUDA initialization: The NVIDIA driver on your system is too old.
Diagnosis: uv sync installed a torch wheel that requires a newer
CUDA than the pod’s driver provides. As of 2026-05, RunPod ships
drivers supporting CUDA up to 12.8; default PyPI torch wheels may
require CUDA 12.9+.
Fix: pin torch to a CUDA-specific wheel index in the consumer
pyproject.toml:
[tool.uv]
environments = ["sys_platform == 'linux'"]
[tool.uv.sources]
torch = { index = "pytorch-cu128" }
[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true
When RunPod ships pods with newer drivers, bump the index URL.
uv python install fails on the pod#
Symptom: with pod.python_version set (v0.5.0), the auto-injected
preflight step exits non-zero. The run aborts before user preflight.
Diagnosis (per the v0.5.0 PR-G “fail-fast” decision): one of
uvitself isn’t installed on the base image (most likely — install it in yoursetup:before the python_version pin tries to use it).Network blip during the interpreter download.
The requested version doesn’t exist (e.g.,
python_version: "3.99").
Fix:
Ensure
setup:includescurl -LsSf https://astral.sh/uv/install.sh | shbefore staging happens.Run
uv python listlocally to confirm the version string exists.For transient network failures, simply retry the
runpod-deploy run.
uv sync hangs silently with .venv partially populated#
Symptom: pod-side uv sync hangs after starting wheel installation.
ps shows the uv PID alive at 0% CPU; /workspace/.venv is frozen at
a few MB (not growing); lsof -p <uv_pid> reveals open file descriptors
under /workspace/uv_cache/.tmp* unpack dirs. No error message; no
stack trace; preflight times out after the configured timeout_sec
and the orchestrator aborts.
Diagnosis: RunPod mounts /workspace as a distributed FUSE
filesystem (confirm with df -hT /workspace — returns
mfs#<dc>.runpod.net:9421 type fuse). uv’s default
--link-mode=hardlink triggers Stale file handle (os error 116)
errors when installing many wheels onto this FS in tight loops. uv
either retries silently or stalls on a stat() call. The hang is
indistinguishable from a slow network read in ps/lsof.
Fix: add UV_LINK_MODE=copy to your remote_env.exports:
remote_env:
exports:
UV_LINK_MODE: copy # avoid stale-file-handle on FUSE-mounted /workspace
uv falls back to full-file copy mode (adds ~10-30s to a typical venv populate; eliminates the hardlink hang).
If UV_LINK_MODE=copy alone is insufficient, two FUSE-related failure
modes can hit before or after the wheel-install phase that copy-mode does
NOT address:
uv syncstalled ingit reset --hardduring resolution (see “uv synchangs ingit reset --hard” below) — fix: pinUV_CACHE_DIRto/root/uv_cache(overlay disk).HF Trainer checkpoint save stalled (see “HF Trainer checkpoint save hangs” below) — fix: put
output_diron/rootand rsync checkpoints back in yourrun.bodytrailer.
Both fixes pin write-heavy directories to the overlay disk (where POSIX
locks work normally) rather than fighting FUSE’s F_SETLKW behavior. See
uv#17801,
MooseFS discussion #380,
and the Linux kernel request_wait_answer hang patch (2025-12-23)
for upstream context.
For genuinely separate network-stall symptoms (single wheel download stuck
mid-stream rather than a stalled stat() / flock()), also add
UV_HTTP_TIMEOUT=120 (bounds any single HTTP read at 120s) and optionally
UV_CONCURRENT_DOWNLOADS=4 (caps concurrent downloads; default 50
amplifies head-of-line blocking on stalled sockets).
uv sync hangs in git reset --hard during resolution phase#
Symptom: uv sync hangs BEFORE installing wheels, while resolving
git+https://... dependencies. ps shows two stuck processes: a git reset --hard <sha> PID in D (uninterruptible) state with WCHAN
request_wait_answer, and the parent uv sync PID in futex_wait_queue.
lsof shows the git PID holding
/workspace/uv_cache/git-v0/checkouts/.../.git/index.lock. uv has not
yet started populating .venv/ — the hang is during resolution, not
install.
Diagnosis: with UV_CACHE_DIR: /workspace/uv_cache, uv clones
git+https://... deps (e.g. consumer-side [project.optional-dependencies] dev that references your own toolkits via git+https) into the
FUSE-backed cache. Each clone runs git reset --hard <sha> to materialize
the pinned revision; git acquires an flock() exclusive lock on
.git/index.lock via F_SETLKW. MooseFS’s F_SETLKW path is unreliable
on FUSE (see MooseFS discussion #380)
and the syscall stalls indefinitely in request_wait_answer. This happens
BEFORE the wheel-install phase, so UV_LINK_MODE=copy (which only affects
the install-phase hardlink path) does not prevent it.
Fix: move UV_CACHE_DIR off /workspace onto the pod’s overlay disk:
remote_env:
exports:
UV_CACHE_DIR: /root/uv_cache # overlay disk; not FUSE
UV_LINK_MODE: copy # still good as defense-in-depth
/root is the container’s overlay disk (verify with df -hT /root —
type overlay, NOT fuse). POSIX locks work normally there. uv_cache
is ephemeral anyway (re-populated each fire); putting it on /root
sacrifices nothing.
HF Trainer checkpoint save hangs on FUSE-backed output_dir#
Symptom: Hugging Face Trainer completes a training step successfully,
then hangs in model.save_pretrained() or Trainer._save(). tqdm bar
shows Writing model shards: 0%|. The main .safetensors shard may
write successfully (large file, ~300 MB), but subsequent small files
(optimizer.pt, scheduler.pt, tokenizer.json, trainer_state.json,
config.json) never appear. ps shows the trainer PID alive at moderate
CPU (50-90%) with one thread on WCHAN request_wait_answer. The hang
typically resolves within 10 minutes (FUSE eventually grants the lock)
or times out the run.
Diagnosis: same MooseFS F_SETLKW class as the git-resolution and
install-phase hangs, but here the lock holder is HF Trainer’s atomic-save
protocol. Trainer._save() writes each checkpoint file to a tempname
then atomically renames into place, with intermediate flock() / POSIX
locks for crash-consistency. On FUSE-backed output_dir, the lock
acquisition stalls.
Fix: keep checkpoint output_dir on the pod’s overlay disk too. Two
options depending on whether you want to ship checkpoints back:
Train on
/root, rsync checkpoints back as arun.bodytrailer — set the Trainer’soutput_dir(configurable via your training script orTrainingArguments.output_dir) to e.g./root/checkpoints/, then inrun.bodyafter the training command:uv run python scripts/train.py --output-dir /root/checkpoints rsync -az /root/checkpoints/ /workspace/<artifact_dir>/
Best of both worlds: locks work during training; final checkpoints land on the volume for orchestrator artifact pull.
Disable per-epoch checkpoint save entirely — set
save_strategy: "no"in yourTrainingArguments. Only viable if predictions parquets are your real analysis input and you don’t need re-runnable checkpoints.
Predictions parquets written via custom callbacks usually don’t trigger
this because (a) they’re written as a single pq.write_table() call
rather than a multi-file atomic-rename dance, and (b) they’re small
enough that any FUSE-write race resolves before the next epoch starts.
Run failures#
ValueError: flash_attention_2 is not supported#
Symptom: transformer scorer loads cleanly on H100; fails on A6000 / RTX A4000 with this error.
Diagnosis: flash_attention_2 isn’t supported across all GPU
generations. With pod.gpu_order listing multiple classes (the
recommended pattern for stock-out resilience), some shards land on
GPUs that don’t support it.
Fix: wrap the model load with a try/except per
recipes/flash-attention-fallback.md:
try:
encoder = AutoModel.from_pretrained(
model_id, attn_implementation="flash_attention_2", ...
)
except (ValueError, ImportError):
encoder = AutoModel.from_pretrained(model_id, ...)
Pod killed mid-run; manifest shows pod_final_state=pod_killed_unexpected#
Symptom: the run script started, ran for a while, then was killed.
The manifest’s pod_final_state is pod_killed_unexpected rather
than EXITED.
Diagnosis: RunPod-side kill (OOM, host issue, spot-pod preemption, or quota). Telemetry’s pod-kill detector caught it and recorded the state.
Fix:
For OOM: scale
pod.container_disk_gb, scale model precision, or pick a higher-VRAM GPU class.For spot preemption: drop
pod.spot: true(if you opted in).For host issues: retry. The sweep recipe pattern handles this with exponential backoff (
recipes/multi-config-sweep.md).To investigate across many runs:
runpod-deploy events-query --filter event=pod_killed_unexpected --since 30d --json.
Sweep-driver failures#
Driver appears to “exit cleanly” but only 1 of N shards launched#
Symptom: bash driver.sh | tee log exits 0, but log shows the
driver died mid-script after launching the first shard. The
tee-fronted pipeline returned tee’s success code, masking the
bash failure.
Diagnosis: pipefail was missing from the set line. Without
it, the pipeline’s exit code is tee’s, not bash’s.
Fix: every sweep-driver header must include pipefail:
set -euo pipefail # ← all three; -e + -u + -o pipefail
See recipes/multi-config-sweep.md
“Pitfalls” §1.
wait -n + set -e kills the driver on the first failed shard#
Symptom: parallel sweep launches the first failure, then the driver dies before subsequent shards run.
Diagnosis: wait -n returns the exit code of the finishing
background job. With set -e, a non-zero exit from wait -n kills
the script immediately.
Fix: wrap wait -n to suppress its exit code; collect failures
per-shard inside the launcher function instead:
while [ "$(jobs -rp | wc -l)" -ge "$MAX_PARALLEL" ]; do
wait -n 2>/dev/null || true
done
See recipes/multi-config-sweep.md
“Pitfalls” §2 for the full corrected pattern.
ls -td artifacts/runpod/* | head -1 returns the wrong shard’s run-dir#
Symptom: failure classifier reads a healthy sibling’s
events.jsonl and misclassifies a transient pod-kill as a
non-retryable training failure.
Diagnosis: at MAX_PARALLEL > 1, multiple runpod-deploy run
invocations write concurrent artifacts/runpod/<ts>/ dirs. ls -td
returns whichever sibling started last — racing the failed shard
you wanted to inspect.
Fix: use runpod-deploy run --print-run-dir (v0.4.0 PR-B) which
emits a single RUN_DIR=<absolute-path> line on stdout. Capture
per-attempt stdout via tee and grep for the line:
local stdout_log="/tmp/sweep_attempt_${seed}.log"
if runpod-deploy run --print-run-dir ... 2>&1 | tee "$stdout_log"; then
return 0
fi
local this_run_dir
this_run_dir=$(grep -oE '^RUN_DIR=.*' "$stdout_log" | head -1 | cut -d= -f2-)
See recipes/multi-config-sweep.md
“Pitfalls” §3.
Forensic recovery#
When something failed and you want to know what — these are the tools.
“Which DCs failed over most often this month?”#
runpod-deploy events-query --filter event=datacenter_failover --since 30d --json
“Show me every pod killed unexpectedly in the last week”#
runpod-deploy events-query --filter event=pod_killed_unexpected --since 7d --json
“What did this specific run do, minute by minute?”#
runpod-deploy events artifacts/runpod/20260515T120000Z
“Aggregate cost + failure rate across a multi-shard sweep”#
runpod-deploy manifest-summary --root artifacts/runpod
Outputs per-run summaries plus a == TOTALS == footer with the
manifest count, failure count, summed wall time, summed estimated cost.
“Compare two runs side-by-side”#
runpod-deploy compare-runs artifacts/runpod/20260515T120000Z \
artifacts/runpod/20260515T130000Z
Exit 1 if either manifest has failed: true — pairs well with CI
gating in driver scripts.
“List recent runs”#
runpod-deploy ls-runs --limit 20
Pulled-back-to-local table of recent run-dir manifests with pod_id, GPU, datacenter, wall time, failure flag, estimated cost.
Cost / cleanup#
These are the symptoms of the 2026-05-17 leak — and how to recognize recurrences early.
Stale paused pods are billing indefinitely#
Symptom: runpodctl user shows currentSpendPerHr > 0 despite
no apparent activity. runpodctl pod list (no -a) returns [] (no
RUNNING pods), but runpodctl pod list -a shows many EXITED entries.
Diagnosis: stopped pods retain their volume disk at ~$0.10/GB·month
indefinitely. runpodctl pod stop only pauses compute — it does
not release storage. The leak is silent: no GPU bill, just slow
accumulation on the volume side.
Fix:
# Audit (read-only): inventory + estimated daily cost
runpod-deploy ls-stale
# Release every paused pod (irreversible)
runpod-deploy cleanup --all-stopped --yes
Backstory: On 2026-05-17 this repo’s account had 76 EXITED pods totaling 3,930 GB ≈ $26/day idle burn. Account balance was 12 h from negative when caught. Read
lifecycle.md§7b for the post-mortem and the hygiene workflow.
To prevent recurrence: the v0.9 schema defaults to lifecycle.on_success: delete,
so successful runs release disk automatically. Failed runs still
preserve a paused pod for SSH forensics (on_failure: stop); the
orchestrator emits a multi-line WARNING with the exact release
command so the operator is never expected to remember the cleanup
syntax.
My failed run preserved a pod and I want to release it#
Symptom: After a failed runpod-deploy run, you see a WARNING
like:
[lifecycle] pod 'abc123' stopped for forensics.
Volume disk (50 GB) continues billing at ~$0.17/day (~$5.00/mo) until released.
When done investigating, release with:
runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete
Or audit all stale pods:
runpod-deploy ls-stale
Diagnosis: the run failed and the on_failure: stop default
paused the pod for SSH forensics. You’ve finished investigating and
want the volume disk back.
Fix: copy the runpod-deploy cleanup ... command from the
WARNING and run it. The default --mode is delete so the disk is
released. The state file is unlinked on success.
If you didn’t actually need SSH forensics for this workflow, switch
to lifecycle: {on_failure: delete} in the config so failed runs
release disk automatically (skip the manual cleanup step).
I want to keep payload state between runs (avoid re-uploading)#
Symptom: every runpod-deploy run re-pulls the Docker image
(~2–5 min), re-runs setup: (apt install, uv venv), and re-rsyncs
the staging payload (1–5 min for typical repos). Across a 100-job
sweep that’s hours of wall time.
Diagnosis: with storage.mode: ephemeral, the volume is
destroyed when the pod is destroyed. Every successful run with
lifecycle.on_success: delete (the new default) starts over from a
fresh image.
Fix: switch the workflow to storage.mode: network_volume with
a named, pre-created volume. The volume persists across pods; rsync
becomes incremental (only changed bytes go over); image layer cache
and uv venv survive in /workspace. See
recipes/payload-reuse-via-network-volume.md
for the step-by-step.
Trade-off: a 100 GB network volume costs ~$7/month sitting idle; network volumes are pinned to one datacenter.
Predictions discipline (consumer-side gotcha)#
This isn’t a runpod-deploy bug — it’s a recurring pattern in consumer-repo design that costs real money when missed.
Symptom: post-hoc you want to recompute a metric, ECE/Brier calibration, or paired-bootstrap delta. The eval pipeline only pulled summary metrics. You have to re-run inference on a fresh pod (~$5, ~30–80 min).
Diagnosis: only summary metrics (PR-AUC, ROC-AUC, recall@FPR at
fixed pinpoints) were persisted. Per-row predictions / y_score were
generated, used, and discarded with the pod.
Fix: persist per-row predictions alongside summary metrics. The
pattern is documented in
recipes/predictions-only-eval.md.
Pull the parquet via artifacts: before pod teardown:
artifacts:
- label: predictions
remote_path: "{remote_repo}/evals/v5_canonical/predictions/"
local_path: "{project_root}/evals/v5_canonical/predictions/"
required: true
For trained adapters / LoRA checkpoints, also push to HF Hub before pod teardown — local pod artifacts are destroyed with the pod.
Still stuck?#
Re-run with
--verboseto see DEBUG output:runpod-deploy run --verbose --config foo.yamlUse
--offline-dry-runto walk the lifecycle without provisioning; catches config issues for free.Inspect the manifest:
runpod-deploy manifest-summary <run-dir>/...jsonfor the full reproducibility record.Reach the maintainer with the run-dir tarball (
tar czf rundir.tgz artifacts/runpod/<ts>/).