# Recipe: recycle a pod for fast iteration **Pattern:** for workflows you run repeatedly with the same image and GPU class — daily benchmarks, debug-rerun loops, sweeps — set `lifecycle.on_success: recycle` so successful runs pause the pod *and preserve the state-file*. The next `runpod-deploy run` with the same `state_file:` finds the paused pod, validates compatibility, and calls `runpodctl pod start ` instead of `pod create`. Image pull and cold-boot are skipped; setup commands re-run (idempotent) and rsync is incremental. ## When to use this | Scenario | Use recycle? | |---|---| | One-shot run, occasional sweeps | **No** — fresh `delete` default is fine; the storage cost is negligible. | | Same config, multiple times per week | **Yes** — 3–5 min × N runs saved per week. | | Iterating on `run.body:` against a heavy image | **Yes** — image cache survives between attempts. | | Two configs sharing one paused pod | **No** — `state_file:` is per-config; share via network volume instead. | | `lifecycle.on_failure` | **No** — `recycle` is success-path only; failed pods have potentially corrupted state. | ## Why this is a recipe, not a default The trade-off is real: a 50 GB pod sitting paused costs ~$0.17/day (~$5/mo). For one-shot configs that's waste. For a 5×/week workflow that recovers 30+ minutes of cold-start time per week, it pays off. Recycle is opt-in. ## Minimum config ```yaml # foo.yaml name: my-benchmark run_id_prefix: my-benchmark state_file: ~/.runpod-my-benchmark-current # IMPORTANT: per-config path pod: image: runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 datacenters: [EUR-IS-1] gpu_order: [NVIDIA L4] storage: mode: ephemeral volume_gb: 50 run: script_path: /workspace/run.sh log_path: /workspace/run.log success_marker: "[my-benchmark] DONE" body: | echo "[my-benchmark] DONE" lifecycle: on_success: recycle # ← the only change vs. a regular config on_failure: stop ``` The `state_file:` MUST be a path unique to this config. The default `~/.runpod-deploy-current` would clobber across configs — point this field at a per-workflow path like `~/.runpod--current`. ## Lifecycle walkthrough ### First run (cold) ```bash runpod-deploy run --config foo.yaml ``` What happens: 1. State-file `~/.runpod-my-benchmark-current` doesn't exist yet → no resume attempt. 2. `runpodctl pod create` → image pull (~3 min on cold host) → SSH-ready. 3. `setup:` runs → rsync staging → run script → artifact pull. 4. `lifecycle.on_success: recycle` → `runpodctl pod stop `. 5. State-file written: `{pod_id, gpu_id, image, datacenter_id}`. 6. Manifest shows `pod_resumed: false`. ### Second run (warm) ```bash runpod-deploy run --config foo.yaml ``` What happens: 1. State-file exists → `try_resume_pod` reads it. 2. `runpodctl pod get ` confirms the pod is EXITED and the stored image/GPU/datacenter match the current spec. 3. `runpodctl pod start ` (no image pull — cache is warm). 4. SSH-ready typically in 30–60 s (vs 3–5 min cold). 5. `setup:` re-runs (idempotent — apt skip, uv venv skip). 6. rsync re-pushes (incremental — only changed bytes). 7. Run script → artifact pull → `pod stop` again. 8. Manifest shows `pod_resumed: true`. ### Image drift If you bump `pod.image` in the YAML between runs: ``` [recycle] drift detected for pod 'abc123': image: stored='runpod/pytorch:OLD' current='runpod/pytorch:NEW'; deleting stale pod and fresh-creating ``` The stale pod is deleted, state-file is unlinked, and a fresh `runpodctl pod create` fires. The next run is cold again. Same for GPU class or datacenter changes. ### Forcing a fresh run on demand ```bash runpod-deploy run --config foo.yaml --force-fresh ``` Skips the resume attempt for this run only: - Deletes any stale paused pod referenced by the state-file. - Unlinks the state-file. - Provisions a fresh pod via `runpodctl pod create`. Useful for "did I actually pull the new image?" debugging without editing YAML. Equivalent manual recipe: `rm ~/.runpod-my-benchmark-current && runpod-deploy run --config foo.yaml`. ## Hygiene: pair with `ls-stale` A paused recycle pod shows up in `runpod-deploy ls-stale` like any other EXITED pod — that's intentional. The audit lists give you visibility into what you're recycling: ```bash runpod-deploy ls-stale ``` ``` POD_ID NAME GB $/day $/mo ---------------------------------------------------------------------------------- abc12345 my-benchmark-20260517T120000Z 50 0.17 5.00 def67890 my-other-workflow-20260517T103000Z 100 0.33 10.00 ---------------------------------------------------------------------------------- TOTAL: 2 pods, $0.50/day (~$15.00/mo) ``` If a recycle pod has been idle for weeks and you've moved on, just delete it via `runpod-deploy cleanup --state-file --mode delete` (or hand-`runpodctl pod delete `). ## When recycle stops paying off Three signals that a workflow shouldn't be on recycle anymore: 1. **Image drift WARNING fires every run** — your image is rotating too fast for the cache to help. 2. **`ls-stale` shows a recycle pod sitting > 30 days unused** — you've moved off this workflow; switch to `on_success: delete` and `rm` the state-file. 3. **GPU class changes per run** — sweeps that rotate `gpu_order` force fresh-create every time anyway. ## What lives where | Concern | Owner | |---|---| | Issuing `runpodctl pod stop` at end-of-run + preserving state-file | `runpod-deploy run` (`provider._cleanup_recycle`) | | Storing the resume pointer (pod_id + image + GPU + DC) | The state-file at `state_file:` (per-config) | | Validating that the paused pod still matches the spec (image / GPU / DC drift) | `runpod-deploy run` (`provider.try_resume_pod`) | | Calling `runpodctl pod start ` when validation passes | `runpod-deploy run` (`provider.try_resume_pod`) | | Forcing a fresh-create regardless of state-file | `runpod-deploy run --force-fresh` | | Deciding when recycle stops paying off (volume cost vs run frequency) | You (the trade-off above) | | Pairing with weekly `ls-stale` to catch forgotten paused pods | Your hygiene rotation | ## Anti-pattern to avoid Don't use `lifecycle.on_success: recycle` on configs you run less than ~3×/week. A 50 GB paused pod costs ~$5/mo idle; the recycle benefit (saved cold-start time × run frequency) needs to clear that floor. For low-frequency configs, the default `lifecycle.on_success: delete` is cheaper. Don't share one `state_file:` across multiple configs. The state-file is the resume pointer; if two configs both write to `~/.runpod-deploy-current` and the second config recycles a pod provisioned by the first, image / GPU / DC drift detection will fall through to fresh-create (with a WARNING) — defeating the purpose. Always give recycled configs unique `state_file:` paths. ## See also - [`lifecycle.md` §7](../lifecycle.md#7-lifecycle-action-cleanup) — the full action table including `recycle`. - [`payload-reuse-via-network-volume.md`](payload-reuse-via-network-volume.md) — orthogonal: use a network volume to share state ACROSS pods (e.g., for parallel sweeps). Recycle reuses ONE pod across runs. - [`troubleshooting.md`](../troubleshooting.md) — diagnosing drift WARNINGs and unexpected fresh-creates.