Recipes#

Composition patterns for runpod-deploy. None of these are schema features — they’re the conventions consumers use to wire runpod-deploy run into a larger pipeline.

Why recipes instead of schema features#

runpod-deploy is a deployment-primitives library. Its single responsibility is the pod lifecycle: GPU/DC selection (with failover), staging, remote execution, telemetry capture, artifact pull, and the manifest that records what happened.

It does not orchestrate consumer-domain steps (audit, plotting, aggregation). Those stay in your project’s Makefile / shell scripts / Python entry points and call runpod-deploy run from inside their own flow. This keeps runpod-deploy decoupled from any one consumer’s domain logic.

Index#

  • local-preflight-then-run.md — Makefile pattern: run a local audit (leakage check, fixture re-render, dataset validation) then invoke runpod-deploy run.

  • local-postprocess-after-run.md — pull artifacts via runpod-deploy run, then run plotting / aggregation locally over the pulled artifacts/runpod/<ts>/ directory.

  • embed-deploy-metadata.md — pipe runpod-deploy capture-env into your own evals manifest. Replaces hand-rolled git rev-parse HEAD injection in Makefile targets.

  • multi-config-sweep.md — bash for-loop over a set of configs that share a local.project_root.

  • cost-reconciliation.md — read wall_time_sec and estimated_cost_usd from past manifests to validate assumed_hourly_rate_usd settings and detect drift.

  • predictions-only-eval.md — GPU pod emits only predictions_full.parquet + adapters; metrics / bootstrap CIs / paired tests run locally on CPU. Decouples the cost of running the model from the cost of evaluating it.

  • flash-attention-fallback.md — transformer scorers degrade gracefully when the GPU class doesn’t support flash_attention_2 (portability across the GPU-failover pool).

  • reproducibility.mdpod.python_version YAML field + uv python install/pin auto-injection to lock the CPython interpreter version across sweep runs.

  • python-api-for-forensics.md — multi-manifest forensics via runpod_deploy.walk_run_dirs + load_manifest + load_events; the strongest Python-API use case.

  • stock-out-diagnostic.md — pre-launch probe + four-action decision menu (wait / widen / switch / defer) for when the configured gpu_order × datacenters matrix is empty in your target cloud_type.

  • forensics-then-cleanup.md — workflow for handling a failed run with lifecycle.on_failure: stop: inspect the pulled manifest, SSH if needed, then release the pod with runpod-deploy cleanup. Pairs the per-run WARNING with the cleanup / ls-stale CLI.

  • stale-pod-audit.md — wire runpod-deploy ls-stale into a weekly cron / GitHub Action / Slack ping so storage drift is detected early. Prevents the silent-leak failure mode (76 stale pods, $26/day) that motivated the lifecycle redesign.

  • payload-reuse-via-network-volume.md — when you run the same workflow repeatedly, switch storage.mode: network_volume so rsync is incremental and the venv / HF cache survive between pods. Trades $7/mo for hours of wall time.

  • recycle-pod-for-fast-iteration.md — set lifecycle.on_success: recycle so successful runs pause the pod and the next run resumes it directly. Skips image-pull + cold-boot per recurring run; ~$0.17/day per paused pod. Orthogonal to network-volume; the two compose.

By use case#

When you know what you’re trying to accomplish, this table maps the goal to the recipes worth reading. Recipes compose — most non-trivial workflows pull from 3–4 of these.

Use case

Recipes

Hyperparameter sweep over seeds / backbones

multi-config-sweep.md + predictions-only-eval.md + reproducibility.md + cost-reconciliation.md

Paper-grade canonical eval

predictions-only-eval.md + reproducibility.md + embed-deploy-metadata.md

Save money on big sweeps

local-preflight-then-run.md + predictions-only-eval.md + cost-reconciliation.md

Portability across GPU classes

flash-attention-fallback.md + reproducibility.md

Post-mortem a failed sweep

cost-reconciliation.md + local-postprocess-after-run.md + forensics-then-cleanup.md + troubleshooting.md (Forensic recovery)

Keep storage costs low across many runs

forensics-then-cleanup.md + stale-pod-audit.md + payload-reuse-via-network-volume.md

Fast-iterate on one workflow (skip cold-start every run)

recycle-pod-for-fast-iteration.md + payload-reuse-via-network-volume.md

Stitch deploy provenance into your own evals manifest

embed-deploy-metadata.md + local-postprocess-after-run.md

First-time consumer setup

local-preflight-then-run.md + the parent quickstart.md

Multi-manifest forensics in Python

python-api-for-forensics.md + ../python-api-vs-cli.md

Diagnose a stock-out before burning retry budget

stock-out-diagnostic.md + multi-config-sweep.md