Recipe: reuse the staging payload via a network volume#
Pattern: when you run the same workflow repeatedly (a daily
benchmark, a parameter sweep over the same repo), most of each
runpod-deploy run is re-paying fixed costs: image pull (~2–5
min), setup: re-runs (apt install, uv venv), full rsync of the
staging payload (1–5 min for typical repos). Switching the workflow
to storage.mode: network_volume lets those costs amortize across
runs — rsync becomes incremental, the image stays cached on the
volume, and the venv survives.
Why this is a recipe, not a default#
The trade-off is real:
Network volume: \(0.07/GB·month for the first 1 TB. A 100 GB volume sitting idle is ~\)7/month. Pinned to one datacenter — if the GPU you want is unavailable there, the run waits.
Ephemeral: free when no pod exists. Full rebuild every run. Free GPU choice across all of your
pod.datacenters:.
For a hello-world or smoke config that runs occasionally, ephemeral wins. For a benchmark you run 5× a week against the same 2 GB repo with 8 GB of cached venv state, network_volume wins by hours of wall time.
Step 1 — Create the volume#
# Pick a datacenter where your GPU type is consistently available.
runpodctl network-volume create \
--name pid-workspace-100gb \
--datacenter EU-RO-1 \
--size 100
The volume is empty. RunPod returns an ID; remember the name
(pid-workspace-100gb) — your config references it by name, not by
ID.
Audit volumes anytime:
runpodctl network-volume list
Step 2 — Update the YAML#
storage:
mode: network_volume
volume_name: pid-workspace-100gb # must match the name above
volume_mount: /workspace # default; what your scripts assume
pod:
# Network volume is pinned to one datacenter; constrain pod placement to match.
datacenters: [EU-RO-1]
gpu_order:
- NVIDIA H100 80GB HBM3
- NVIDIA A100-SXM4-80GB
lifecycle:
on_success: delete # release compute; volume persists
on_failure: stop # preserve compute too for forensics
The lifecycle.on_success: delete action releases the pod’s
compute and its container disk — but the network volume is a
separate, named resource that survives the pod. Across runs, your
/workspace is the same filesystem.
Step 3 — Make the staging incremental#
runpod-deploy already uses rsync for staging, so the second run
of the same workflow only pushes changed bytes. You don’t need to
change anything in the staging: block — rsync’s incremental
behavior happens for free against a persistent destination.
A few hygiene patterns help:
Keep
staging.excludes_default: true(project default) so the hygiene exclusions like**/__pycache__/,.git/,.venv/,.pytest_cache/don’t fight rsync.Add project-specific large-but-stable directories to
staging.excludes_extraif they’re populated on the volume some other way (e.g., a Hugging Face cache).
Step 4 — Cache slow setup output on the volume#
Your setup: commands run on every pod, but if they write to
/workspace, their output persists across pods. Two common idioms:
setup:
- command: |
# uv venv is idempotent; second run is a fast no-op
if [ ! -d /workspace/.venv ]; then
uv venv /workspace/.venv --python 3.11
fi
source /workspace/.venv/bin/activate
uv pip install --quiet -e /workspace/repo
timeout_sec: 300
setup:
- command: |
# Cache the HF model downloads on the volume
export HF_HOME=/workspace/.cache/huggingface
mkdir -p $HF_HOME
timeout_sec: 30
After the first run, the venv and the HF cache are warm. Subsequent
runs spend most of setup: re-establishing environment variables,
not pulling bytes.
What this does not solve#
Docker image pull is still per-pod — the image isn’t on the volume; it’s on the pod’s container disk. Image pull is ~30 s for a cached image / ~2–5 min for an uncached image. RunPod caches popular images at the datacenter level; the
runpod/pytorch:...base used in this repo’s examples is usually already cached.setup:commands still run every pod, even if the outputs are cached. They must be idempotent; the cost is just the short-circuit branches, which is small.
If image-pull + setup re-execution is still the dominant cost, the
right next step is to leave a pod paused between runs and resume
it directly. That’s lifecycle.on_success: recycle — see
recipes/recycle-pod-for-fast-iteration.md.
Recycle reuses ONE pod across runs (saves image-pull + cold-boot);
network volume here reuses a /workspace ACROSS pods (saves rsync +
venv state). They’re orthogonal and compose: ephemeral storage +
recycle is the most common combination for “fast iteration on one
workflow”; network volume + delete is the right combo for “share
state across parallel sweep workers”.
What this DOES solve#
For a typical benchmark workflow (5 runs/week, 2 GB repo, 8 GB cached venv, 12 GB HF model):
First run: ~3 min image pull + ~2 min setup + ~3 min staging + run.
Second run onward: ~30 s image pull (cached at DC) + ~10 s setup (everything cached) + ~5 s staging (rsync says “nothing to do”) + run.
That’s ~7 minutes saved per run, or ~30 minutes/week. The \(7/month volume cost is recovered if your time is worth more than \)0.20/h.
What lives where#
Concern |
Owner |
|---|---|
Creating / sizing / pinning the network volume to a datacenter |
|
Resolving the volume by name on every run |
|
Mounting the volume at |
|
Making rsync incremental (delete-aware) |
|
Caching slow setup output (venv, model weights, apt caches) |
Your |
Deciding when the cost of an idle volume is worth the speed |
You (recipe trade-off table above) |
Migrating a volume to a different datacenter |
Manual — RunPod doesn’t move volumes; recreate + rsync |
Anti-pattern to avoid#
Don’t pin every project to its own 1 TB volume “just in case”. A 1 TB volume costs ~$70/month idle. Size to the workflow: typical Python project + cached venv + model weights = 50–100 GB.
Don’t share one volume across unrelated workflows by mounting at
/workspace and writing into a project-specific subdirectory.
storage.volume_mount is shared by every consumer that mounts the
same volume; concurrent runs from different projects will corrupt
each other’s setup state. Either give each workflow its own volume,
or pair the share with strict per-run subdirectory isolation enforced
in run.body:.
See also#
lifecycle.md§7b — whylifecycle.on_success: deleteis the default even when the volume persists.config-reference.md— fullstorage.mode: network_volumeschema reference.