Recipe: weekly stale-pod audit#

Pattern: wire runpod-deploy ls-stale into your regular hygiene rotation so storage drift is caught early — before it becomes a leak like the 2026-05-17 incident (76 stale pods, 3,930 GB, $26/day).

Why this is a recipe, not a schema feature#

runpod-deploy ls-stale is read-only and idempotent. Scheduling it is an operator concern; the SDK exposes the JSON output so any ergonomic — cron, GitHub Action, Slack ping, terminal alias — composes on top.

Option 1 — Terminal alias / Makefile target#

The minimum-friction version. Read the table, decide whether to release.

# ~/.zshrc or your project Makefile
alias stale='runpod-deploy ls-stale'

Or:

.PHONY: audit
audit:
	runpod-deploy ls-stale

Option 2 — Weekly cron with email#

For solo developers who want a passive nudge:

# Every Monday 09:00 UTC — email the inventory if non-empty
0 9 * * 1 cd ~/projects/runpod-deploy && \
  out=$(.venv/bin/runpod-deploy ls-stale) && \
  [ "$out" != "No stale (EXITED) pods." ] && \
  echo "$out" | mail -s "[runpod-deploy] stale pods" you@example.com

The --json mode is suitable for parsing if you want to add a threshold (“only alert if total > $5/day”):

runpod-deploy ls-stale --json | jq '
  [.[].estimated_daily_cost_usd] | add
  | if . > 5 then "ALERT: $\(.)/day" else "ok" end
'

Option 3 — GitHub Action (CI-native)#

If you’d rather the audit live in version control:

# .github/workflows/runpod-stale-audit.yml
name: runpod-stale-audit
on:
  schedule:
    - cron: '0 9 * * 1'   # Mondays 09:00 UTC
  workflow_dispatch: {}

jobs:
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - run: uv pip install -e .[dev] runpodctl
      - env:
          RUNPOD_API_KEY: ${{ secrets.RUNPOD_API_KEY }}
        run: |
          out=$(.venv/bin/runpod-deploy ls-stale)
          echo "$out"
          if [ "$out" != "No stale (EXITED) pods." ]; then
            echo "::warning::Stale pods detected"
            # Optional: post to Slack
            curl -X POST -H 'Content-type: application/json' \
              --data "{\"text\":\"runpod-deploy stale-pod alert:\n\`\`\`\n$out\n\`\`\`\"}" \
              ${{ secrets.SLACK_WEBHOOK_URL }}
          fi

The job is read-only — ls-stale never deletes anything. Deciding whether to release is still a human judgment call (a pod kept around deliberately for SSH-forensics should not be auto-deleted).

Option 4 — Pair with a release threshold (auto-cleanup)#

If you don’t care about preserving paused pods after a window — e.g., you know any pod that’s been EXITED for >7 days is safe to release — you can pair ls-stale --json with cleanup --all-stopped:

# Delete everything that's been stopped >7 days
stale_old=$(runpod-deploy ls-stale --json | jq '[.[] | select(.age_hours > 168)]')
count=$(echo "$stale_old" | jq 'length')
if [ "$count" -gt 0 ]; then
  echo "Releasing $count pods older than 7 days"
  echo "$stale_old" | jq -r '.[].pod_id' | xargs -I{} runpodctl pod delete {}
fi

This pattern is intentionally not a CLI subcommand — the threshold and policy are too consumer-specific to ship in the SDK. The ingredients (ls-stale --json, pod delete) compose into whatever hygiene rule fits your team.

What “stale” means#

ls-stale reports every pod in the EXITED status, regardless of how it got there:

  • A successful run with lifecycle.on_success: stop → EXITED.

  • A failed run with lifecycle.on_failure: stop (the default) → EXITED.

  • A manually-stopped pod (runpodctl pod stop) → EXITED.

All three cases share the same cost characteristic: volume disk is billing until you call runpodctl pod delete. The audit treats them uniformly because the cost concern is uniform.

What lives where#

Concern

Owner

Listing EXITED pods + per-pod daily storage cost

runpod-deploy ls-stale (read-only; idempotent)

Emitting machine-readable JSON for scripting

runpod-deploy ls-stale --json

Releasing N stale pods in one bulk call

runpod-deploy cleanup --all-stopped [--yes]

Scheduling the audit

Your cron / GH Action / Slack ping (consumer-side)

Deciding the staleness threshold (release if age_hours > N)

You (project-specific; see Option 4)

Alerting on the inventory (Slack / email / PagerDuty)

Your alerting glue (consumer-side)

Anti-pattern to avoid#

Don’t auto-release pods inside runpod-deploy run based on age — the operator should always be the one deciding whether a stale pod is forgotten waste vs forensic state mid-investigation. The CLI gives you ls-stale (visibility) + cleanup --all-stopped (action) as separate primitives precisely so that automation can audit without deleting.

Don’t pipe ls-stale --json directly into cleanup without an age filter; you’ll race against just-stopped pods that the operator hasn’t yet inspected. Use Option 4’s jq '.[] | select(.age_hours > 168)' pattern (or your team’s threshold) to scope the cleanup.

See also#