# Recipe: reuse the staging payload via a network volume **Pattern:** when you run the same workflow repeatedly (a daily benchmark, a parameter sweep over the same repo), most of each `runpod-deploy run` is *re-paying* fixed costs: image pull (~2–5 min), `setup:` re-runs (apt install, uv venv), full rsync of the staging payload (1–5 min for typical repos). Switching the workflow to `storage.mode: network_volume` lets those costs amortize across runs — rsync becomes incremental, the image stays cached on the volume, and the venv survives. ## Why this is a recipe, not a default The trade-off is real: - **Network volume**: $0.07/GB·month for the first 1 TB. A 100 GB volume sitting idle is ~$7/month. Pinned to one datacenter — if the GPU you want is unavailable there, the run waits. - **Ephemeral**: free when no pod exists. Full rebuild every run. Free GPU choice across all of your `pod.datacenters:`. For a hello-world or smoke config that runs occasionally, ephemeral wins. For a benchmark you run 5× a week against the same 2 GB repo with 8 GB of cached venv state, network_volume wins by hours of wall time. ## Step 1 — Create the volume ```bash # Pick a datacenter where your GPU type is consistently available. runpodctl network-volume create \ --name pid-workspace-100gb \ --datacenter EU-RO-1 \ --size 100 ``` The volume is empty. RunPod returns an ID; remember the name (`pid-workspace-100gb`) — your config references it by name, not by ID. Audit volumes anytime: ```bash runpodctl network-volume list ``` ## Step 2 — Update the YAML ```yaml storage: mode: network_volume volume_name: pid-workspace-100gb # must match the name above volume_mount: /workspace # default; what your scripts assume pod: # Network volume is pinned to one datacenter; constrain pod placement to match. datacenters: [EU-RO-1] gpu_order: - NVIDIA H100 80GB HBM3 - NVIDIA A100-SXM4-80GB lifecycle: on_success: delete # release compute; volume persists on_failure: stop # preserve compute too for forensics ``` The `lifecycle.on_success: delete` action releases the *pod's* compute and its container disk — but the network volume is a separate, named resource that survives the pod. Across runs, your `/workspace` is the same filesystem. ## Step 3 — Make the staging incremental `runpod-deploy` already uses `rsync` for staging, so the second run of the same workflow only pushes changed bytes. You don't need to change anything in the `staging:` block — rsync's incremental behavior happens for free against a persistent destination. A few hygiene patterns help: - Keep `staging.excludes_default: true` (project default) so the hygiene exclusions like `**/__pycache__/`, `.git/`, `.venv/`, `.pytest_cache/` don't fight rsync. - Add project-specific large-but-stable directories to `staging.excludes_extra` if they're populated on the volume some other way (e.g., a Hugging Face cache). ## Step 4 — Cache slow setup output on the volume Your `setup:` commands run on every pod, but if they write to `/workspace`, their output persists across pods. Two common idioms: ```yaml setup: - command: | # uv venv is idempotent; second run is a fast no-op if [ ! -d /workspace/.venv ]; then uv venv /workspace/.venv --python 3.11 fi source /workspace/.venv/bin/activate uv pip install --quiet -e /workspace/repo timeout_sec: 300 ``` ```yaml setup: - command: | # Cache the HF model downloads on the volume export HF_HOME=/workspace/.cache/huggingface mkdir -p $HF_HOME timeout_sec: 30 ``` After the first run, the venv and the HF cache are warm. Subsequent runs spend most of `setup:` re-establishing environment variables, not pulling bytes. ## What this does *not* solve - **Docker image pull is still per-pod** — the image isn't on the volume; it's on the pod's container disk. Image pull is ~30 s for a cached image / ~2–5 min for an uncached image. RunPod caches popular images at the datacenter level; the `runpod/pytorch:...` base used in this repo's examples is usually already cached. - **`setup:` commands still run** every pod, even if the outputs are cached. They must be idempotent; the cost is just the short-circuit branches, which is small. If image-pull + setup re-execution is still the dominant cost, the right next step is to leave a pod *paused* between runs and resume it directly. That's `lifecycle.on_success: recycle` — see [`recipes/recycle-pod-for-fast-iteration.md`](recycle-pod-for-fast-iteration.md). Recycle reuses ONE pod across runs (saves image-pull + cold-boot); network volume here reuses a /workspace ACROSS pods (saves rsync + venv state). They're orthogonal and compose: ephemeral storage + recycle is the most common combination for "fast iteration on one workflow"; network volume + delete is the right combo for "share state across parallel sweep workers". ## What this DOES solve For a typical benchmark workflow (5 runs/week, 2 GB repo, 8 GB cached venv, 12 GB HF model): - First run: ~3 min image pull + ~2 min setup + ~3 min staging + run. - Second run onward: ~30 s image pull (cached at DC) + ~10 s setup (everything cached) + ~5 s staging (rsync says "nothing to do") + run. That's ~7 minutes saved per run, or ~30 minutes/week. The $7/month volume cost is recovered if your time is worth more than $0.20/h. ## What lives where | Concern | Owner | |---|---| | Creating / sizing / pinning the network volume to a datacenter | `runpodctl network-volume create` (one-shot, manual) | | Resolving the volume by name on every run | `runpod-deploy run` (`provider.resolve_volume`) | | Mounting the volume at `/workspace` on the pod | `runpod-deploy run` (`storage.volume_mount`) | | Making rsync incremental (delete-aware) | `staging[].delete: false` in YAML | | Caching slow setup output (venv, model weights, apt caches) | Your `setup:` commands writing under `/workspace/...` | | Deciding when the cost of an idle volume is worth the speed | You (recipe trade-off table above) | | Migrating a volume to a different datacenter | Manual — RunPod doesn't move volumes; recreate + rsync | ## Anti-pattern to avoid Don't pin every project to its own 1 TB volume "just in case". A 1 TB volume costs ~$70/month idle. Size to the workflow: typical Python project + cached venv + model weights = 50–100 GB. Don't share one volume across unrelated workflows by mounting at `/workspace` and writing into a project-specific subdirectory. `storage.volume_mount` is shared by every consumer that mounts the same volume; concurrent runs from different projects will corrupt each other's setup state. Either give each workflow its own volume, or pair the share with strict per-run subdirectory isolation enforced in `run.body:`. ## See also - [`lifecycle.md` §7b](../lifecycle.md#7b-cost-discipline-cleaning-up-after-forensics) — why `lifecycle.on_success: delete` is the default even when the volume persists. - [`config-reference.md`](../config-reference.md) — full `storage.mode: network_volume` schema reference.