Recipe: forensics, then cleanup#

Pattern: when a runpod-deploy run fails, the default lifecycle.on_failure: stop preserves the pod paused so you can SSH in for post-mortem. After investigation, release the volume disk explicitly so it doesn’t bill indefinitely.

Why this is a recipe, not a schema feature#

The two halves — forensics and cleanup — are operator workflows, not configuration. runpod-deploy provides the affordances (ls-stale, cleanup --state-file, cleanup --all-stopped) and an actionable WARNING per failed run; chaining them into your team’s hygiene practice is a workflow concern.

The trigger: a per-run WARNING#

When a run fails with lifecycle.on_failure: stop (the default), the orchestrator emits a multi-line WARNING ending with the exact release command:

[lifecycle] pod 'abc123' stopped for forensics.
  Volume disk (50 GB) continues billing at ~$0.17/day (~$5.00/mo) until released.
  When done investigating, release with:
      runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete
  Or audit all stale pods:
      runpod-deploy ls-stale

Treat this WARNING as a TODO. Until you act on it, the volume disk is billing.

Step 1 — Inspect offline (no SSH needed for most diagnoses)#

Most failure modes are visible in the artifacts already pulled:

# What just happened?
runpod-deploy manifest-summary artifacts/runpod/<ts>/runpod_deploy_pull_manifest.json

# Tail the run log
tail -n 200 artifacts/runpod/<ts>/run.log

# Inspect telemetry (GPU mem over time, dmesg, etc.)
ls artifacts/runpod/<ts>/telemetry/

runpod-deploy events and events-query walk the event log for structured per-step status. Most “why did this fail” questions are answered without ever logging into the pod.

Step 2 — SSH only if needed#

If the offline inspection isn’t enough:

# Resume the paused pod
runpodctl pod start <pod-id>

# Attach
runpodctl pod ssh <pod-id>

The pod retains its /workspace, so any state inspection (look at a .pyc file, re-run a single Python expression in the same venv) is available. When done, exit the SSH session — but don’t forget to delete the pod; resuming it puts compute billing back on top of the storage billing you’ve already been paying since the failure.

Step 3 — Release#

Once forensics is complete, run the command from the WARNING:

runpod-deploy cleanup --state-file ~/.runpod-deploy-current --mode delete

The default --mode is delete, so runpod-deploy cleanup --state-file <path> is also sufficient. The state file is unlinked on success; the pod is gone; volume disk is released.

Step 4 — Bulk-release after a sweep#

If you ran a 50-config sweep and 12 failed, do not chase each one individually. Run a single bulk command after the sweep:

# What's still around?
runpod-deploy ls-stale

# Release everything (interactive y/N)
runpod-deploy cleanup --all-stopped

# Or non-interactive (for CI / cron)
runpod-deploy cleanup --all-stopped --yes

The bulk path collects failures rather than aborting on the first delete error, so one transient API hiccup doesn’t leave the rest billing.

Step 5 — Add hygiene to your weekly rhythm#

The leak that motivated this recipe (2026-05-17, 76 stale pods, $26/day) accumulated over weeks because no one was looking. Wire ls-stale into your regular review:

# Add to a personal Makefile / Justfile / weekly cron
runpod-deploy-audit:
	runpod-deploy ls-stale

Or, for an automated nudge: see recipes/stale-pod-audit.md for the JSON output pattern that feeds a Slack ping or GitHub Action.

What lives where#

Concern	Owner
Emitting the cleanup-required WARNING with the exact `runpod-deploy cleanup` command	`runpod-deploy run` (`provider._log_stop_cleanup_nudge`)
Listing EXITED pods + per-pod daily storage cost	`runpod-deploy ls-stale`
Resuming a paused pod for SSH forensics	`runpodctl pod start <id>` + `runpod-deploy logs --config <yaml>`
Bulk-releasing every EXITED pod	`runpod-deploy cleanup --all-stopped [--yes]`
Deciding which stale pods to release vs preserve for ongoing analysis	You (operator judgment)
Scheduling the audit (cron / GH Action / Slack ping)	Your hygiene rotation (see `stale-pod-audit.md`)

Anti-pattern to avoid#

Don’t set lifecycle.on_failure: stop “just in case I want to debug later” if you don’t actually do post-mortems. The default is stop for the case where you will SSH in; if your team never does, set:

lifecycle:
  on_success: delete
  on_failure: delete       # opt out of forensics, opt out of storage billing

Failed runs still pull artifacts and write the manifest before the delete, so runpod-deploy manifest-summary and the run log are still available for analysis.