Recipe: recycle a pod for fast iteration#
Pattern: for workflows you run repeatedly with the same image and
GPU class — daily benchmarks, debug-rerun loops, sweeps — set
lifecycle.on_success: recycle so successful runs pause the pod
and preserve the state-file. The next runpod-deploy run with the
same state_file: finds the paused pod, validates compatibility,
and calls runpodctl pod start <id> instead of pod create. Image
pull and cold-boot are skipped; setup commands re-run (idempotent)
and rsync is incremental.
When to use this#
Scenario |
Use recycle? |
|---|---|
One-shot run, occasional sweeps |
No — fresh |
Same config, multiple times per week |
Yes — 3–5 min × N runs saved per week. |
Iterating on |
Yes — image cache survives between attempts. |
Two configs sharing one paused pod |
No — |
|
No — |
Why this is a recipe, not a default#
The trade-off is real: a 50 GB pod sitting paused costs ~\(0.17/day (~\)5/mo). For one-shot configs that’s waste. For a 5×/week workflow that recovers 30+ minutes of cold-start time per week, it pays off. Recycle is opt-in.
Minimum config#
# foo.yaml
name: my-benchmark
run_id_prefix: my-benchmark
state_file: ~/.runpod-my-benchmark-current # IMPORTANT: per-config path
pod:
image: runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
datacenters: [EUR-IS-1]
gpu_order: [NVIDIA L4]
storage:
mode: ephemeral
volume_gb: 50
run:
script_path: /workspace/run.sh
log_path: /workspace/run.log
success_marker: "[my-benchmark] DONE"
body: |
echo "[my-benchmark] DONE"
lifecycle:
on_success: recycle # ← the only change vs. a regular config
on_failure: stop
The state_file: MUST be a path unique to this config. The default
~/.runpod-deploy-current would clobber across configs — point this
field at a per-workflow path like ~/.runpod-<workflow-name>-current.
Lifecycle walkthrough#
First run (cold)#
runpod-deploy run --config foo.yaml
What happens:
State-file
~/.runpod-my-benchmark-currentdoesn’t exist yet → no resume attempt.runpodctl pod create→ image pull (~3 min on cold host) → SSH-ready.setup:runs → rsync staging → run script → artifact pull.lifecycle.on_success: recycle→runpodctl pod stop <id>.State-file written:
{pod_id, gpu_id, image, datacenter_id}.Manifest shows
pod_resumed: false.
Second run (warm)#
runpod-deploy run --config foo.yaml
What happens:
State-file exists →
try_resume_podreads it.runpodctl pod get <pod_id>confirms the pod is EXITED and the stored image/GPU/datacenter match the current spec.runpodctl pod start <pod_id>(no image pull — cache is warm).SSH-ready typically in 30–60 s (vs 3–5 min cold).
setup:re-runs (idempotent — apt skip, uv venv skip).rsync re-pushes (incremental — only changed bytes).
Run script → artifact pull →
pod stopagain.Manifest shows
pod_resumed: true.
Image drift#
If you bump pod.image in the YAML between runs:
[recycle] drift detected for pod 'abc123': image: stored='runpod/pytorch:OLD' current='runpod/pytorch:NEW'; deleting stale pod and fresh-creating
The stale pod is deleted, state-file is unlinked, and a fresh
runpodctl pod create fires. The next run is cold again. Same for
GPU class or datacenter changes.
Forcing a fresh run on demand#
runpod-deploy run --config foo.yaml --force-fresh
Skips the resume attempt for this run only:
Deletes any stale paused pod referenced by the state-file.
Unlinks the state-file.
Provisions a fresh pod via
runpodctl pod create.
Useful for “did I actually pull the new image?” debugging without editing YAML.
Equivalent manual recipe: rm ~/.runpod-my-benchmark-current && runpod-deploy run --config foo.yaml.
Hygiene: pair with ls-stale#
A paused recycle pod shows up in runpod-deploy ls-stale like any
other EXITED pod — that’s intentional. The audit lists give you
visibility into what you’re recycling:
runpod-deploy ls-stale
POD_ID NAME GB $/day $/mo
----------------------------------------------------------------------------------
abc12345 my-benchmark-20260517T120000Z 50 0.17 5.00
def67890 my-other-workflow-20260517T103000Z 100 0.33 10.00
----------------------------------------------------------------------------------
TOTAL: 2 pods, $0.50/day (~$15.00/mo)
If a recycle pod has been idle for weeks and you’ve moved on, just
delete it via runpod-deploy cleanup --state-file <path> --mode delete (or hand-runpodctl pod delete <id>).
When recycle stops paying off#
Three signals that a workflow shouldn’t be on recycle anymore:
Image drift WARNING fires every run — your image is rotating too fast for the cache to help.
ls-staleshows a recycle pod sitting > 30 days unused — you’ve moved off this workflow; switch toon_success: deleteandrmthe state-file.GPU class changes per run — sweeps that rotate
gpu_orderforce fresh-create every time anyway.
What lives where#
Concern |
Owner |
|---|---|
Issuing |
|
Storing the resume pointer (pod_id + image + GPU + DC) |
The state-file at |
Validating that the paused pod still matches the spec (image / GPU / DC drift) |
|
Calling |
|
Forcing a fresh-create regardless of state-file |
|
Deciding when recycle stops paying off (volume cost vs run frequency) |
You (the trade-off above) |
Pairing with weekly |
Your hygiene rotation |
Anti-pattern to avoid#
Don’t use lifecycle.on_success: recycle on configs you run less than
~3×/week. A 50 GB paused pod costs ~$5/mo idle; the recycle benefit
(saved cold-start time × run frequency) needs to clear that floor.
For low-frequency configs, the default lifecycle.on_success: delete
is cheaper.
Don’t share one state_file: across multiple configs. The state-file
is the resume pointer; if two configs both write to
~/.runpod-deploy-current and the second config recycles a pod
provisioned by the first, image / GPU / DC drift detection will fall
through to fresh-create (with a WARNING) — defeating the purpose.
Always give recycled configs unique state_file: paths.
See also#
lifecycle.md§7 — the full action table includingrecycle.payload-reuse-via-network-volume.md— orthogonal: use a network volume to share state ACROSS pods (e.g., for parallel sweeps). Recycle reuses ONE pod across runs.troubleshooting.md— diagnosing drift WARNINGs and unexpected fresh-creates.