# Recipe: forensics, then cleanup **Pattern:** when a `runpod-deploy run` fails, the default `lifecycle.on_failure: stop` preserves the pod paused so you can SSH in for post-mortem. After investigation, release the volume disk explicitly so it doesn't bill indefinitely. ## Why this is a recipe, not a schema feature The two halves — forensics and cleanup — are operator workflows, not configuration. `runpod-deploy` provides the affordances (`ls-stale`, `cleanup --state-file`, `cleanup --all-stopped`) and an actionable WARNING per failed run; chaining them into your team's hygiene practice is a workflow concern. ## The trigger: a per-run WARNING When a run fails with `lifecycle.on_failure: stop` (the default), the orchestrator emits a multi-line WARNING ending with the exact release command: ``` [lifecycle] pod 'abc123' stopped for forensics. Volume disk (50 GB) continues billing at ~$0.17/day (~$5.00/mo) until released. When done investigating, release with: runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete Or audit all stale pods: runpod-deploy ls-stale ``` Treat this WARNING as a TODO. Until you act on it, the volume disk is billing. ## Step 1 — Inspect offline (no SSH needed for most diagnoses) Most failure modes are visible in the artifacts already pulled: ```bash # What just happened? runpod-deploy manifest-summary artifacts/runpod//runpod_deploy_pull_manifest.json # Tail the run log tail -n 200 artifacts/runpod//run.log # Inspect telemetry (GPU mem over time, dmesg, etc.) ls artifacts/runpod//telemetry/ ``` `runpod-deploy events` and `events-query` walk the event log for structured per-step status. Most "why did this fail" questions are answered without ever logging into the pod. ## Step 2 — SSH only if needed If the offline inspection isn't enough: ```bash # Resume the paused pod runpodctl pod start # Attach runpodctl pod ssh ``` The pod retains its `/workspace`, so any state inspection (look at a `.pyc` file, re-run a single Python expression in the same venv) is available. When done, exit the SSH session — but **don't forget to delete the pod**; resuming it puts compute billing back on top of the storage billing you've already been paying since the failure. ## Step 3 — Release Once forensics is complete, run the command from the WARNING: ```bash runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete ``` The default `--mode` is `delete`, so `runpod-deploy cleanup --state-file ` is also sufficient. The state file is unlinked on success; the pod is gone; volume disk is released. ## Step 4 — Bulk-release after a sweep If you ran a 50-config sweep and 12 failed, do not chase each one individually. Run a single bulk command after the sweep: ```bash # What's still around? runpod-deploy ls-stale # Release everything (interactive y/N) runpod-deploy cleanup --all-stopped # Or non-interactive (for CI / cron) runpod-deploy cleanup --all-stopped --yes ``` The bulk path collects failures rather than aborting on the first delete error, so one transient API hiccup doesn't leave the rest billing. ## Step 5 — Add hygiene to your weekly rhythm The leak that motivated this recipe (2026-05-17, 76 stale pods, $26/day) accumulated over weeks because no one was looking. Wire `ls-stale` into your regular review: ```bash # Add to a personal Makefile / Justfile / weekly cron runpod-deploy-audit: runpod-deploy ls-stale ``` Or, for an automated nudge: see [`recipes/stale-pod-audit.md`](stale-pod-audit.md) for the JSON output pattern that feeds a Slack ping or GitHub Action. ## What lives where | Concern | Owner | |---|---| | Emitting the cleanup-required WARNING with the exact `runpod-deploy cleanup` command | `runpod-deploy run` (`provider._log_stop_cleanup_nudge`) | | Listing EXITED pods + per-pod daily storage cost | `runpod-deploy ls-stale` | | Resuming a paused pod for SSH forensics | `runpodctl pod start ` + `runpod-deploy logs --config ` | | Bulk-releasing every EXITED pod | `runpod-deploy cleanup --all-stopped [--yes]` | | Deciding *which* stale pods to release vs preserve for ongoing analysis | You (operator judgment) | | Scheduling the audit (cron / GH Action / Slack ping) | Your hygiene rotation (see `stale-pod-audit.md`) | ## Anti-pattern to avoid Don't set `lifecycle.on_failure: stop` "just in case I want to debug later" if you don't actually do post-mortems. The default is `stop` for the case where you *will* SSH in; if your team never does, set: ```yaml lifecycle: on_success: delete on_failure: delete # opt out of forensics, opt out of storage billing ``` Failed runs still pull artifacts and write the manifest before the delete, so `runpod-deploy manifest-summary` and the run log are still available for analysis.