Recipe: recycle a pod for fast iteration#

Pattern: for workflows you run repeatedly with the same image and GPU class — daily benchmarks, debug-rerun loops, sweeps — set lifecycle.on_success: recycle so successful runs pause the pod and preserve the state-file. The next runpod-deploy run with the same state_file: finds the paused pod, validates compatibility, and calls runpodctl pod start <id> instead of pod create. Image pull and cold-boot are skipped; setup commands re-run (idempotent) and rsync is incremental.

When to use this#

Scenario	Use recycle?
One-shot run, occasional sweeps	No — fresh `delete` default is fine; the storage cost is negligible.
Same config, multiple times per week	Yes — 3–5 min × N runs saved per week.
Iterating on `run.body:` against a heavy image	Yes — image cache survives between attempts.
Two configs sharing one paused pod	No — `state_file:` is per-config; share via network volume instead.
`lifecycle.on_failure`	No — `recycle` is success-path only; failed pods have potentially corrupted state.

Why this is a recipe, not a default#

The trade-off is real: a 50 GB pod sitting paused costs ~$0.17/day (~$5/mo). For one-shot configs that’s waste. For a 5×/week workflow that recovers 30+ minutes of cold-start time per week, it pays off. Recycle is opt-in.

Minimum config#

# foo.yaml
name: my-benchmark
run_id_prefix: my-benchmark
state_file: ~/.runpod-my-benchmark-current   # IMPORTANT: per-config path

pod:
  image: runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
  datacenters: [EUR-IS-1]
  gpu_order: [NVIDIA L4]

storage:
  mode: ephemeral
  volume_gb: 50

run:
  script_path: /workspace/run.sh
  log_path: /workspace/run.log
  success_marker: "[my-benchmark] DONE"
  body: |
    echo "[my-benchmark] DONE"

lifecycle:
  on_success: recycle    # ← the only change vs. a regular config
  on_failure: stop

The state_file: MUST be a path unique to this config. The default ~/.runpod-deploy-current would clobber across configs — point this field at a per-workflow path like ~/.runpod-<workflow-name>-current.

Lifecycle walkthrough#

First run (cold)#

runpod-deploy run --config foo.yaml

What happens:

State-file ~/.runpod-my-benchmark-current doesn’t exist yet → no resume attempt.
runpodctl pod create → image pull (~3 min on cold host) → SSH-ready.
setup: runs → rsync staging → run script → artifact pull.
lifecycle.on_success: recycle → runpodctl pod stop <id>.
State-file written: {pod_id, gpu_id, image, datacenter_id}.
Manifest shows pod_resumed: false.

Second run (warm)#

runpod-deploy run --config foo.yaml

What happens:

State-file exists → try_resume_pod reads it.
runpodctl pod get <pod_id> confirms the pod is EXITED and the stored image/GPU/datacenter match the current spec.
runpodctl pod start <pod_id> (no image pull — cache is warm).
SSH-ready typically in 30–60 s (vs 3–5 min cold).
setup: re-runs (idempotent — apt skip, uv venv skip).
rsync re-pushes (incremental — only changed bytes).
Run script → artifact pull → pod stop again.
Manifest shows pod_resumed: true.

Image drift#

If you bump pod.image in the YAML between runs:

[recycle] drift detected for pod 'abc123': image: stored='runpod/pytorch:OLD' current='runpod/pytorch:NEW'; deleting stale pod and fresh-creating

The stale pod is deleted, state-file is unlinked, and a fresh runpodctl pod create fires. The next run is cold again. Same for GPU class or datacenter changes.

Forcing a fresh run on demand#

runpod-deploy run --config foo.yaml --force-fresh

Skips the resume attempt for this run only:

Deletes any stale paused pod referenced by the state-file.
Unlinks the state-file.
Provisions a fresh pod via runpodctl pod create.

Useful for “did I actually pull the new image?” debugging without editing YAML.

Equivalent manual recipe: rm ~/.runpod-my-benchmark-current && runpod-deploy run --config foo.yaml.

Hygiene: pair with `ls-stale`#

A paused recycle pod shows up in runpod-deploy ls-stale like any other EXITED pod — that’s intentional. The audit lists give you visibility into what you’re recycling:

runpod-deploy ls-stale

POD_ID           NAME                                         GB    $/day     $/mo
----------------------------------------------------------------------------------
abc12345         my-benchmark-20260517T120000Z                 50     0.17     5.00
def67890         my-other-workflow-20260517T103000Z           100     0.33    10.00
----------------------------------------------------------------------------------
TOTAL: 2 pods, $0.50/day (~$15.00/mo)

If a recycle pod has been idle for weeks and you’ve moved on, just delete it via runpod-deploy cleanup --state-file <path> --mode delete (or hand-runpodctl pod delete <id>).

When recycle stops paying off#

Three signals that a workflow shouldn’t be on recycle anymore:

Image drift WARNING fires every run — your image is rotating too fast for the cache to help.
ls-stale shows a recycle pod sitting > 30 days unused — you’ve moved off this workflow; switch to on_success: delete and rm the state-file.
GPU class changes per run — sweeps that rotate gpu_order force fresh-create every time anyway.

What lives where#

Concern	Owner
Issuing `runpodctl pod stop` at end-of-run + preserving state-file	`runpod-deploy run` (`provider._cleanup_recycle`)
Storing the resume pointer (pod_id + image + GPU + DC)	The state-file at `state_file:` (per-config)
Validating that the paused pod still matches the spec (image / GPU / DC drift)	`runpod-deploy run` (`provider.try_resume_pod`)
Calling `runpodctl pod start <id>` when validation passes	`runpod-deploy run` (`provider.try_resume_pod`)
Forcing a fresh-create regardless of state-file	`runpod-deploy run --force-fresh`
Deciding when recycle stops paying off (volume cost vs run frequency)	You (the trade-off above)
Pairing with weekly `ls-stale` to catch forgotten paused pods	Your hygiene rotation

Anti-pattern to avoid#

Don’t use lifecycle.on_success: recycle on configs you run less than ~3×/week. A 50 GB paused pod costs ~$5/mo idle; the recycle benefit (saved cold-start time × run frequency) needs to clear that floor. For low-frequency configs, the default lifecycle.on_success: delete is cheaper.

Don’t share one state_file: across multiple configs. The state-file is the resume pointer; if two configs both write to ~/.runpod-deploy-current and the second config recycles a pod provisioned by the first, image / GPU / DC drift detection will fall through to fresh-create (with a WARNING) — defeating the purpose. Always give recycled configs unique state_file: paths.