Recipes#

Composition patterns for runpod-deploy. None of these are schema features — they’re the conventions consumers use to wire runpod-deploy run into a larger pipeline.

Why recipes instead of schema features#

runpod-deploy is a deployment-primitives library. Its single responsibility is the pod lifecycle: GPU/DC selection (with failover), staging, remote execution, telemetry capture, artifact pull, and the manifest that records what happened.

It does not orchestrate consumer-domain steps (audit, plotting, aggregation). Those stay in your project’s Makefile / shell scripts / Python entry points and call runpod-deploy run from inside their own flow. This keeps runpod-deploy decoupled from any one consumer’s domain logic.

Index#

local-preflight-then-run.md — Makefile pattern: run a local audit (leakage check, fixture re-render, dataset validation) then invoke runpod-deploy run.
local-postprocess-after-run.md — pull artifacts via runpod-deploy run, then run plotting / aggregation locally over the pulled artifacts/runpod/<ts>/ directory.
embed-deploy-metadata.md — pipe runpod-deploy capture-env into your own evals manifest. Replaces hand-rolled git rev-parse HEAD injection in Makefile targets.
multi-config-sweep.md — bash for-loop over a set of configs that share a local.project_root.
cost-reconciliation.md — read wall_time_sec and estimated_cost_usd from past manifests to validate assumed_hourly_rate_usd settings and detect drift.
predictions-only-eval.md — GPU pod emits only predictions_full.parquet + adapters; metrics / bootstrap CIs / paired tests run locally on CPU. Decouples the cost of running the model from the cost of evaluating it.
flash-attention-fallback.md — transformer scorers degrade gracefully when the GPU class doesn’t support flash_attention_2 (portability across the GPU-failover pool).
reproducibility.md — pod.python_version YAML field + uv python install/pin auto-injection to lock the CPython interpreter version across sweep runs.
python-api-for-forensics.md — multi-manifest forensics via runpod_deploy.walk_run_dirs + load_manifest + load_events; the strongest Python-API use case.
stock-out-diagnostic.md — pre-launch probe + four-action decision menu (wait / widen / switch / defer) for when the configured gpu_order × datacenters matrix is empty in your target cloud_type.
forensics-then-cleanup.md — workflow for handling a failed run with lifecycle.on_failure: stop: inspect the pulled manifest, SSH if needed, then release the pod with runpod-deploy cleanup. Pairs the per-run WARNING with the cleanup / ls-stale CLI.
stale-pod-audit.md — wire runpod-deploy ls-stale into a weekly cron / GitHub Action / Slack ping so storage drift is detected early. Prevents the silent-leak failure mode (76 stale pods, $26/day) that motivated the lifecycle redesign.
payload-reuse-via-network-volume.md — when you run the same workflow repeatedly, switch storage.mode: network_volume so rsync is incremental and the venv / HF cache survive between pods. Trades $7/mo for hours of wall time.
recycle-pod-for-fast-iteration.md — set lifecycle.on_success: recycle so successful runs pause the pod and the next run resumes it directly. Skips image-pull + cold-boot per recurring run; ~$0.17/day per paused pod. Orthogonal to network-volume; the two compose.

By use case#

When you know what you’re trying to accomplish, this table maps the goal to the recipes worth reading. Recipes compose — most non-trivial workflows pull from 3–4 of these.

Use case	Recipes
Hyperparameter sweep over seeds / backbones	`multi-config-sweep.md` + `predictions-only-eval.md` + `reproducibility.md` + `cost-reconciliation.md`
Paper-grade canonical eval	`predictions-only-eval.md` + `reproducibility.md` + `embed-deploy-metadata.md`
Save money on big sweeps	`local-preflight-then-run.md` + `predictions-only-eval.md` + `cost-reconciliation.md`
Portability across GPU classes	`flash-attention-fallback.md` + `reproducibility.md`
Post-mortem a failed sweep	`cost-reconciliation.md` + `local-postprocess-after-run.md` + `forensics-then-cleanup.md` + `troubleshooting.md` (Forensic recovery)
Keep storage costs low across many runs	`forensics-then-cleanup.md` + `stale-pod-audit.md` + `payload-reuse-via-network-volume.md`
Fast-iterate on one workflow (skip cold-start every run)	`recycle-pod-for-fast-iteration.md` + `payload-reuse-via-network-volume.md`
Stitch deploy provenance into your own evals manifest	`embed-deploy-metadata.md` + `local-postprocess-after-run.md`
First-time consumer setup	`local-preflight-then-run.md` + the parent `quickstart.md`
Multi-manifest forensics in Python	`python-api-for-forensics.md` + `../python-api-vs-cli.md`
Diagnose a stock-out before burning retry budget	`stock-out-diagnostic.md` + `multi-config-sweep.md`