# Recipes Composition patterns for `runpod-deploy`. None of these are schema features — they're the conventions consumers use to wire `runpod-deploy run` into a larger pipeline. ## Why recipes instead of schema features `runpod-deploy` is a **deployment-primitives library**. Its single responsibility is the pod lifecycle: GPU/DC selection (with failover), staging, remote execution, telemetry capture, artifact pull, and the manifest that records what happened. It does **not** orchestrate consumer-domain steps (audit, plotting, aggregation). Those stay in your project's Makefile / shell scripts / Python entry points and call `runpod-deploy run` from inside their own flow. This keeps `runpod-deploy` decoupled from any one consumer's domain logic. ## Index - [`local-preflight-then-run.md`](local-preflight-then-run.md) — Makefile pattern: run a local audit (leakage check, fixture re-render, dataset validation) then invoke `runpod-deploy run`. - [`local-postprocess-after-run.md`](local-postprocess-after-run.md) — pull artifacts via `runpod-deploy run`, then run plotting / aggregation locally over the pulled `artifacts/runpod//` directory. - [`embed-deploy-metadata.md`](embed-deploy-metadata.md) — pipe `runpod-deploy capture-env` into your own evals manifest. Replaces hand-rolled `git rev-parse HEAD` injection in Makefile targets. - [`multi-config-sweep.md`](multi-config-sweep.md) — bash for-loop over a set of configs that share a `local.project_root`. - [`cost-reconciliation.md`](cost-reconciliation.md) — read `wall_time_sec` and `estimated_cost_usd` from past manifests to validate `assumed_hourly_rate_usd` settings and detect drift. - [`predictions-only-eval.md`](predictions-only-eval.md) — GPU pod emits only `predictions_full.parquet` + adapters; metrics / bootstrap CIs / paired tests run locally on CPU. Decouples the cost of *running* the model from the cost of *evaluating* it. - [`flash-attention-fallback.md`](flash-attention-fallback.md) — transformer scorers degrade gracefully when the GPU class doesn't support `flash_attention_2` (portability across the GPU-failover pool). - [`reproducibility.md`](reproducibility.md) — `pod.python_version` YAML field + `uv python install/pin` auto-injection to lock the CPython interpreter version across sweep runs. - [`python-api-for-forensics.md`](python-api-for-forensics.md) — multi-manifest forensics via `runpod_deploy.walk_run_dirs` + `load_manifest` + `load_events`; the strongest Python-API use case. - [`stock-out-diagnostic.md`](stock-out-diagnostic.md) — pre-launch probe + four-action decision menu (wait / widen / switch / defer) for when the configured `gpu_order` × `datacenters` matrix is empty in your target `cloud_type`. - [`forensics-then-cleanup.md`](forensics-then-cleanup.md) — workflow for handling a failed run with `lifecycle.on_failure: stop`: inspect the pulled manifest, SSH if needed, then release the pod with `runpod-deploy cleanup`. Pairs the per-run WARNING with the `cleanup` / `ls-stale` CLI. - [`stale-pod-audit.md`](stale-pod-audit.md) — wire `runpod-deploy ls-stale` into a weekly cron / GitHub Action / Slack ping so storage drift is detected early. Prevents the silent-leak failure mode (76 stale pods, $26/day) that motivated the lifecycle redesign. - [`payload-reuse-via-network-volume.md`](payload-reuse-via-network-volume.md) — when you run the same workflow repeatedly, switch `storage.mode: network_volume` so rsync is incremental and the venv / HF cache survive between pods. Trades $7/mo for hours of wall time. - [`recycle-pod-for-fast-iteration.md`](recycle-pod-for-fast-iteration.md) — set `lifecycle.on_success: recycle` so successful runs pause the pod and the next run resumes it directly. Skips image-pull + cold-boot per recurring run; ~$0.17/day per paused pod. Orthogonal to network-volume; the two compose. ## By use case When you know what you're trying to accomplish, this table maps the goal to the recipes worth reading. Recipes compose — most non-trivial workflows pull from 3–4 of these. | Use case | Recipes | |---|---| | **Hyperparameter sweep over seeds / backbones** | [`multi-config-sweep.md`](multi-config-sweep.md) + [`predictions-only-eval.md`](predictions-only-eval.md) + [`reproducibility.md`](reproducibility.md) + [`cost-reconciliation.md`](cost-reconciliation.md) | | **Paper-grade canonical eval** | [`predictions-only-eval.md`](predictions-only-eval.md) + [`reproducibility.md`](reproducibility.md) + [`embed-deploy-metadata.md`](embed-deploy-metadata.md) | | **Save money on big sweeps** | [`local-preflight-then-run.md`](local-preflight-then-run.md) + [`predictions-only-eval.md`](predictions-only-eval.md) + [`cost-reconciliation.md`](cost-reconciliation.md) | | **Portability across GPU classes** | [`flash-attention-fallback.md`](flash-attention-fallback.md) + [`reproducibility.md`](reproducibility.md) | | **Post-mortem a failed sweep** | [`cost-reconciliation.md`](cost-reconciliation.md) + [`local-postprocess-after-run.md`](local-postprocess-after-run.md) + [`forensics-then-cleanup.md`](forensics-then-cleanup.md) + [`troubleshooting.md`](../troubleshooting.md) (Forensic recovery) | | **Keep storage costs low across many runs** | [`forensics-then-cleanup.md`](forensics-then-cleanup.md) + [`stale-pod-audit.md`](stale-pod-audit.md) + [`payload-reuse-via-network-volume.md`](payload-reuse-via-network-volume.md) | | **Fast-iterate on one workflow (skip cold-start every run)** | [`recycle-pod-for-fast-iteration.md`](recycle-pod-for-fast-iteration.md) + [`payload-reuse-via-network-volume.md`](payload-reuse-via-network-volume.md) | | **Stitch deploy provenance into your own evals manifest** | [`embed-deploy-metadata.md`](embed-deploy-metadata.md) + [`local-postprocess-after-run.md`](local-postprocess-after-run.md) | | **First-time consumer setup** | [`local-preflight-then-run.md`](local-preflight-then-run.md) + the parent [`quickstart.md`](../quickstart.md) | | **Multi-manifest forensics in Python** | [`python-api-for-forensics.md`](python-api-for-forensics.md) + [`../python-api-vs-cli.md`](../python-api-vs-cli.md) | | **Diagnose a stock-out before burning retry budget** | [`stock-out-diagnostic.md`](stock-out-diagnostic.md) + [`multi-config-sweep.md`](multi-config-sweep.md) |