Recipe: weekly stale-pod audit#
Pattern: wire runpod-deploy ls-stale into your regular hygiene
rotation so storage drift is caught early — before it becomes a leak
like the 2026-05-17 incident (76 stale pods, 3,930 GB,
$26/day).
Why this is a recipe, not a schema feature#
runpod-deploy ls-stale is read-only and idempotent. Scheduling it
is an operator concern; the SDK exposes the JSON output so any
ergonomic — cron, GitHub Action, Slack ping, terminal alias —
composes on top.
Option 1 — Terminal alias / Makefile target#
The minimum-friction version. Read the table, decide whether to release.
# ~/.zshrc or your project Makefile
alias stale='runpod-deploy ls-stale'
Or:
.PHONY: audit
audit:
runpod-deploy ls-stale
Option 2 — Weekly cron with email#
For solo developers who want a passive nudge:
# Every Monday 09:00 UTC — email the inventory if non-empty
0 9 * * 1 cd ~/projects/runpod-deploy && \
out=$(.venv/bin/runpod-deploy ls-stale) && \
[ "$out" != "No stale (EXITED) pods." ] && \
echo "$out" | mail -s "[runpod-deploy] stale pods" you@example.com
The --json mode is suitable for parsing if you want to add a
threshold (“only alert if total > $5/day”):
runpod-deploy ls-stale --json | jq '
[.[].estimated_daily_cost_usd] | add
| if . > 5 then "ALERT: $\(.)/day" else "ok" end
'
Option 3 — GitHub Action (CI-native)#
If you’d rather the audit live in version control:
# .github/workflows/runpod-stale-audit.yml
name: runpod-stale-audit
on:
schedule:
- cron: '0 9 * * 1' # Mondays 09:00 UTC
workflow_dispatch: {}
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv pip install -e .[dev] runpodctl
- env:
RUNPOD_API_KEY: ${{ secrets.RUNPOD_API_KEY }}
run: |
out=$(.venv/bin/runpod-deploy ls-stale)
echo "$out"
if [ "$out" != "No stale (EXITED) pods." ]; then
echo "::warning::Stale pods detected"
# Optional: post to Slack
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"runpod-deploy stale-pod alert:\n\`\`\`\n$out\n\`\`\`\"}" \
${{ secrets.SLACK_WEBHOOK_URL }}
fi
The job is read-only — ls-stale never deletes anything. Deciding
whether to release is still a human judgment call (a pod kept
around deliberately for SSH-forensics should not be auto-deleted).
Option 4 — Pair with a release threshold (auto-cleanup)#
If you don’t care about preserving paused pods after a window —
e.g., you know any pod that’s been EXITED for >7 days is safe to
release — you can pair ls-stale --json with cleanup --all-stopped:
# Delete everything that's been stopped >7 days
stale_old=$(runpod-deploy ls-stale --json | jq '[.[] | select(.age_hours > 168)]')
count=$(echo "$stale_old" | jq 'length')
if [ "$count" -gt 0 ]; then
echo "Releasing $count pods older than 7 days"
echo "$stale_old" | jq -r '.[].pod_id' | xargs -I{} runpodctl pod delete {}
fi
This pattern is intentionally not a CLI subcommand — the threshold
and policy are too consumer-specific to ship in the SDK. The
ingredients (ls-stale --json, pod delete) compose into whatever
hygiene rule fits your team.
What “stale” means#
ls-stale reports every pod in the EXITED status, regardless of
how it got there:
A successful run with
lifecycle.on_success: stop→ EXITED.A failed run with
lifecycle.on_failure: stop(the default) → EXITED.A manually-stopped pod (
runpodctl pod stop) → EXITED.
All three cases share the same cost characteristic: volume disk is
billing until you call runpodctl pod delete. The audit treats them
uniformly because the cost concern is uniform.
What lives where#
Concern |
Owner |
|---|---|
Listing EXITED pods + per-pod daily storage cost |
|
Emitting machine-readable JSON for scripting |
|
Releasing N stale pods in one bulk call |
|
Scheduling the audit |
Your cron / GH Action / Slack ping (consumer-side) |
Deciding the staleness threshold (release if |
You (project-specific; see Option 4) |
Alerting on the inventory (Slack / email / PagerDuty) |
Your alerting glue (consumer-side) |
Anti-pattern to avoid#
Don’t auto-release pods inside runpod-deploy run based on age — the
operator should always be the one deciding whether a stale pod is
forgotten waste vs forensic state mid-investigation. The CLI gives
you ls-stale (visibility) + cleanup --all-stopped (action) as
separate primitives precisely so that automation can audit without
deleting.
Don’t pipe ls-stale --json directly into cleanup without an
age filter; you’ll race against just-stopped pods that the operator
hasn’t yet inspected. Use Option 4’s jq '.[] | select(.age_hours > 168)'
pattern (or your team’s threshold) to scope the cleanup.
See also#
forensics-then-cleanup.md— the one-failed-pod workflow.lifecycle.md§7b — the cost-discipline narrative and the 2026-05-17 backstory.