Recipes#
Composition patterns for runpod-deploy. None of these are schema
features — they’re the conventions consumers use to wire runpod-deploy run into a larger pipeline.
Why recipes instead of schema features#
runpod-deploy is a deployment-primitives library. Its single
responsibility is the pod lifecycle: GPU/DC selection (with failover),
staging, remote execution, telemetry capture, artifact pull, and the
manifest that records what happened.
It does not orchestrate consumer-domain steps (audit, plotting,
aggregation). Those stay in your project’s Makefile / shell scripts /
Python entry points and call runpod-deploy run from inside their own
flow. This keeps runpod-deploy decoupled from any one consumer’s
domain logic.
Index#
local-preflight-then-run.md— Makefile pattern: run a local audit (leakage check, fixture re-render, dataset validation) then invokerunpod-deploy run.local-postprocess-after-run.md— pull artifacts viarunpod-deploy run, then run plotting / aggregation locally over the pulledartifacts/runpod/<ts>/directory.embed-deploy-metadata.md— piperunpod-deploy capture-envinto your own evals manifest. Replaces hand-rolledgit rev-parse HEADinjection in Makefile targets.multi-config-sweep.md— bash for-loop over a set of configs that share alocal.project_root.cost-reconciliation.md— readwall_time_secandestimated_cost_usdfrom past manifests to validateassumed_hourly_rate_usdsettings and detect drift.predictions-only-eval.md— GPU pod emits onlypredictions_full.parquet+ adapters; metrics / bootstrap CIs / paired tests run locally on CPU. Decouples the cost of running the model from the cost of evaluating it.flash-attention-fallback.md— transformer scorers degrade gracefully when the GPU class doesn’t supportflash_attention_2(portability across the GPU-failover pool).reproducibility.md—pod.python_versionYAML field +uv python install/pinauto-injection to lock the CPython interpreter version across sweep runs.python-api-for-forensics.md— multi-manifest forensics viarunpod_deploy.walk_run_dirs+load_manifest+load_events; the strongest Python-API use case.stock-out-diagnostic.md— pre-launch probe + four-action decision menu (wait / widen / switch / defer) for when the configuredgpu_order×datacentersmatrix is empty in your targetcloud_type.forensics-then-cleanup.md— workflow for handling a failed run withlifecycle.on_failure: stop: inspect the pulled manifest, SSH if needed, then release the pod withrunpod-deploy cleanup. Pairs the per-run WARNING with thecleanup/ls-staleCLI.stale-pod-audit.md— wirerunpod-deploy ls-staleinto a weekly cron / GitHub Action / Slack ping so storage drift is detected early. Prevents the silent-leak failure mode (76 stale pods, $26/day) that motivated the lifecycle redesign.payload-reuse-via-network-volume.md— when you run the same workflow repeatedly, switchstorage.mode: network_volumeso rsync is incremental and the venv / HF cache survive between pods. Trades $7/mo for hours of wall time.recycle-pod-for-fast-iteration.md— setlifecycle.on_success: recycleso successful runs pause the pod and the next run resumes it directly. Skips image-pull + cold-boot per recurring run; ~$0.17/day per paused pod. Orthogonal to network-volume; the two compose.
By use case#
When you know what you’re trying to accomplish, this table maps the goal to the recipes worth reading. Recipes compose — most non-trivial workflows pull from 3–4 of these.
Use case |
Recipes |
|---|---|
Hyperparameter sweep over seeds / backbones |
|
Paper-grade canonical eval |
|
Save money on big sweeps |
|
Portability across GPU classes |
|
Post-mortem a failed sweep |
|
Keep storage costs low across many runs |
|
Fast-iterate on one workflow (skip cold-start every run) |
|
Stitch deploy provenance into your own evals manifest |
|
First-time consumer setup |
|
Multi-manifest forensics in Python |
|
Diagnose a stock-out before burning retry budget |