# Recipe: weekly stale-pod audit **Pattern:** wire `runpod-deploy ls-stale` into your regular hygiene rotation so storage drift is caught early — before it becomes a leak like the 2026-05-17 incident (76 stale pods, 3,930 GB, $26/day). ## Why this is a recipe, not a schema feature `runpod-deploy ls-stale` is read-only and idempotent. Scheduling it is an operator concern; the SDK exposes the JSON output so any ergonomic — cron, GitHub Action, Slack ping, terminal alias — composes on top. ## Option 1 — Terminal alias / Makefile target The minimum-friction version. Read the table, decide whether to release. ```bash # ~/.zshrc or your project Makefile alias stale='runpod-deploy ls-stale' ``` Or: ```makefile .PHONY: audit audit: runpod-deploy ls-stale ``` ## Option 2 — Weekly cron with email For solo developers who want a passive nudge: ```cron # Every Monday 09:00 UTC — email the inventory if non-empty 0 9 * * 1 cd ~/projects/runpod-deploy && \ out=$(.venv/bin/runpod-deploy ls-stale) && \ [ "$out" != "No stale (EXITED) pods." ] && \ echo "$out" | mail -s "[runpod-deploy] stale pods" you@example.com ``` The `--json` mode is suitable for parsing if you want to add a threshold ("only alert if total > $5/day"): ```bash runpod-deploy ls-stale --json | jq ' [.[].estimated_daily_cost_usd] | add | if . > 5 then "ALERT: $\(.)/day" else "ok" end ' ``` ## Option 3 — GitHub Action (CI-native) If you'd rather the audit live in version control: ```yaml # .github/workflows/runpod-stale-audit.yml name: runpod-stale-audit on: schedule: - cron: '0 9 * * 1' # Mondays 09:00 UTC workflow_dispatch: {} jobs: audit: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v3 - run: uv pip install -e .[dev] runpodctl - env: RUNPOD_API_KEY: ${{ secrets.RUNPOD_API_KEY }} run: | out=$(.venv/bin/runpod-deploy ls-stale) echo "$out" if [ "$out" != "No stale (EXITED) pods." ]; then echo "::warning::Stale pods detected" # Optional: post to Slack curl -X POST -H 'Content-type: application/json' \ --data "{\"text\":\"runpod-deploy stale-pod alert:\n\`\`\`\n$out\n\`\`\`\"}" \ ${{ secrets.SLACK_WEBHOOK_URL }} fi ``` The job is read-only — `ls-stale` never deletes anything. Deciding *whether* to release is still a human judgment call (a pod kept around deliberately for SSH-forensics should not be auto-deleted). ## Option 4 — Pair with a release threshold (auto-cleanup) If you don't care about preserving paused pods after a window — e.g., you know any pod that's been EXITED for >7 days is safe to release — you can pair `ls-stale --json` with `cleanup --all-stopped`: ```bash # Delete everything that's been stopped >7 days stale_old=$(runpod-deploy ls-stale --json | jq '[.[] | select(.age_hours > 168)]') count=$(echo "$stale_old" | jq 'length') if [ "$count" -gt 0 ]; then echo "Releasing $count pods older than 7 days" echo "$stale_old" | jq -r '.[].pod_id' | xargs -I{} runpodctl pod delete {} fi ``` This pattern is intentionally not a CLI subcommand — the threshold and policy are too consumer-specific to ship in the SDK. The ingredients (`ls-stale --json`, `pod delete`) compose into whatever hygiene rule fits your team. ## What "stale" means `ls-stale` reports every pod in the `EXITED` status, regardless of how it got there: - A successful run with `lifecycle.on_success: stop` → EXITED. - A failed run with `lifecycle.on_failure: stop` (the default) → EXITED. - A manually-stopped pod (`runpodctl pod stop`) → EXITED. All three cases share the same cost characteristic: volume disk is billing until you call `runpodctl pod delete`. The audit treats them uniformly because the cost concern is uniform. ## What lives where | Concern | Owner | |---|---| | Listing EXITED pods + per-pod daily storage cost | `runpod-deploy ls-stale` (read-only; idempotent) | | Emitting machine-readable JSON for scripting | `runpod-deploy ls-stale --json` | | Releasing N stale pods in one bulk call | `runpod-deploy cleanup --all-stopped [--yes]` | | Scheduling the audit | Your cron / GH Action / Slack ping (consumer-side) | | Deciding the staleness threshold (release if `age_hours > N`) | You (project-specific; see Option 4) | | Alerting on the inventory (Slack / email / PagerDuty) | Your alerting glue (consumer-side) | ## Anti-pattern to avoid Don't auto-release pods inside `runpod-deploy run` based on age — the operator should always be the one deciding whether a stale pod is forgotten waste vs forensic state mid-investigation. The CLI gives you `ls-stale` (visibility) + `cleanup --all-stopped` (action) as separate primitives precisely so that automation can audit without deleting. Don't pipe `ls-stale --json` directly into `cleanup` without an age filter; you'll race against just-stopped pods that the operator hasn't yet inspected. Use Option 4's `jq '.[] | select(.age_hours > 168)'` pattern (or your team's threshold) to scope the cleanup. ## See also - [`forensics-then-cleanup.md`](forensics-then-cleanup.md) — the one-failed-pod workflow. - [`lifecycle.md` §7b](../lifecycle.md#7b-cost-discipline-cleaning-up-after-forensics) — the cost-discipline narrative and the 2026-05-17 backstory.