Recipe: recycle a pod for fast iteration#

Pattern: for workflows you run repeatedly with the same image and GPU class — daily benchmarks, debug-rerun loops, sweeps — set lifecycle.on_success: recycle so successful runs pause the pod and preserve the state-file. The next runpod-deploy run with the same state_file: finds the paused pod, validates compatibility, and calls runpodctl pod start <id> instead of pod create. Image pull and cold-boot are skipped; setup commands re-run (idempotent) and rsync is incremental.

When to use this#

Scenario

Use recycle?

One-shot run, occasional sweeps

No — fresh delete default is fine; the storage cost is negligible.

Same config, multiple times per week

Yes — 3–5 min × N runs saved per week.

Iterating on run.body: against a heavy image

Yes — image cache survives between attempts.

Two configs sharing one paused pod

Nostate_file: is per-config; share via network volume instead.

lifecycle.on_failure

Norecycle is success-path only; failed pods have potentially corrupted state.

Why this is a recipe, not a default#

The trade-off is real: a 50 GB pod sitting paused costs ~\(0.17/day (~\)5/mo). For one-shot configs that’s waste. For a 5×/week workflow that recovers 30+ minutes of cold-start time per week, it pays off. Recycle is opt-in.

Minimum config#

# foo.yaml
name: my-benchmark
run_id_prefix: my-benchmark
state_file: ~/.runpod-my-benchmark-current   # IMPORTANT: per-config path

pod:
  image: runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
  datacenters: [EUR-IS-1]
  gpu_order: [NVIDIA L4]

storage:
  mode: ephemeral
  volume_gb: 50

run:
  script_path: /workspace/run.sh
  log_path: /workspace/run.log
  success_marker: "[my-benchmark] DONE"
  body: |
    echo "[my-benchmark] DONE"

lifecycle:
  on_success: recycle    # ← the only change vs. a regular config
  on_failure: stop

The state_file: MUST be a path unique to this config. The default ~/.runpod-deploy-current would clobber across configs — point this field at a per-workflow path like ~/.runpod-<workflow-name>-current.

Lifecycle walkthrough#

First run (cold)#

runpod-deploy run --config foo.yaml

What happens:

  1. State-file ~/.runpod-my-benchmark-current doesn’t exist yet → no resume attempt.

  2. runpodctl pod create → image pull (~3 min on cold host) → SSH-ready.

  3. setup: runs → rsync staging → run script → artifact pull.

  4. lifecycle.on_success: recyclerunpodctl pod stop <id>.

  5. State-file written: {pod_id, gpu_id, image, datacenter_id}.

  6. Manifest shows pod_resumed: false.

Second run (warm)#

runpod-deploy run --config foo.yaml

What happens:

  1. State-file exists → try_resume_pod reads it.

  2. runpodctl pod get <pod_id> confirms the pod is EXITED and the stored image/GPU/datacenter match the current spec.

  3. runpodctl pod start <pod_id> (no image pull — cache is warm).

  4. SSH-ready typically in 30–60 s (vs 3–5 min cold).

  5. setup: re-runs (idempotent — apt skip, uv venv skip).

  6. rsync re-pushes (incremental — only changed bytes).

  7. Run script → artifact pull → pod stop again.

  8. Manifest shows pod_resumed: true.

Image drift#

If you bump pod.image in the YAML between runs:

[recycle] drift detected for pod 'abc123': image: stored='runpod/pytorch:OLD' current='runpod/pytorch:NEW'; deleting stale pod and fresh-creating

The stale pod is deleted, state-file is unlinked, and a fresh runpodctl pod create fires. The next run is cold again. Same for GPU class or datacenter changes.

Forcing a fresh run on demand#

runpod-deploy run --config foo.yaml --force-fresh

Skips the resume attempt for this run only:

  • Deletes any stale paused pod referenced by the state-file.

  • Unlinks the state-file.

  • Provisions a fresh pod via runpodctl pod create.

Useful for “did I actually pull the new image?” debugging without editing YAML.

Equivalent manual recipe: rm ~/.runpod-my-benchmark-current && runpod-deploy run --config foo.yaml.

Hygiene: pair with ls-stale#

A paused recycle pod shows up in runpod-deploy ls-stale like any other EXITED pod — that’s intentional. The audit lists give you visibility into what you’re recycling:

runpod-deploy ls-stale
POD_ID           NAME                                         GB    $/day     $/mo
----------------------------------------------------------------------------------
abc12345         my-benchmark-20260517T120000Z                 50     0.17     5.00
def67890         my-other-workflow-20260517T103000Z           100     0.33    10.00
----------------------------------------------------------------------------------
TOTAL: 2 pods, $0.50/day (~$15.00/mo)

If a recycle pod has been idle for weeks and you’ve moved on, just delete it via runpod-deploy cleanup --state-file <path> --mode delete (or hand-runpodctl pod delete <id>).

When recycle stops paying off#

Three signals that a workflow shouldn’t be on recycle anymore:

  1. Image drift WARNING fires every run — your image is rotating too fast for the cache to help.

  2. ls-stale shows a recycle pod sitting > 30 days unused — you’ve moved off this workflow; switch to on_success: delete and rm the state-file.

  3. GPU class changes per run — sweeps that rotate gpu_order force fresh-create every time anyway.

What lives where#

Concern

Owner

Issuing runpodctl pod stop at end-of-run + preserving state-file

runpod-deploy run (provider._cleanup_recycle)

Storing the resume pointer (pod_id + image + GPU + DC)

The state-file at state_file: (per-config)

Validating that the paused pod still matches the spec (image / GPU / DC drift)

runpod-deploy run (provider.try_resume_pod)

Calling runpodctl pod start <id> when validation passes

runpod-deploy run (provider.try_resume_pod)

Forcing a fresh-create regardless of state-file

runpod-deploy run --force-fresh

Deciding when recycle stops paying off (volume cost vs run frequency)

You (the trade-off above)

Pairing with weekly ls-stale to catch forgotten paused pods

Your hygiene rotation

Anti-pattern to avoid#

Don’t use lifecycle.on_success: recycle on configs you run less than ~3×/week. A 50 GB paused pod costs ~$5/mo idle; the recycle benefit (saved cold-start time × run frequency) needs to clear that floor. For low-frequency configs, the default lifecycle.on_success: delete is cheaper.

Don’t share one state_file: across multiple configs. The state-file is the resume pointer; if two configs both write to ~/.runpod-deploy-current and the second config recycles a pod provisioned by the first, image / GPU / DC drift detection will fall through to fresh-create (with a WARNING) — defeating the purpose. Always give recycled configs unique state_file: paths.

See also#