# Recipe: predictions-only-eval **Pattern:** the GPU pod emits ONLY the per-row predictions (`predictions_full.parquet` + any trained adapters); all metrics, bootstrap CIs, paired tests, and calibration fits run **locally on CPU** after `runpod-deploy run` pulls the parquet back. ## Why this is a recipe, not a schema feature The split between "what runs on GPU" and "what runs on CPU" is *consumer-domain*: which metrics you care about, what bootstrap N is appropriate for your sample size, whether your calibration fit is GPU-tractable, and whether you need paired tests between scorer variants all depend on your evaluation protocol. Baking a metrics runner into `runpod-deploy` would force one philosophy on every consumer. What `runpod-deploy` owns is the *pull*: `artifacts[*]` with `required: true` ensures the predictions parquet lands locally even on partial failures, so post-run CPU analysis has a deterministic input. The CPU-side workflow itself is yours. Bootstrap N=10K–100K across a multi-slice × multi-scorer matrix is ~minutes of billed GPU time per shard but ~seconds on a beefy local CPU at higher N. Keeping the GPU pod's job tight (predict + checkpoint only) shrinks the billed window and decouples the cost of *running the model* from the cost of *evaluating it*. As a bonus, all metrics become deterministic re-runs from the parquet without re-spending on GPU. Validated end-to-end in `prompt-injection-v5`'s canonical sweep (`configs/runpod/v5_canonical_combined.yaml`), which is the working reference. ## Pattern ```yaml # In the YAML config the pod runs: run: body: | cd {remote_repo} set -euo pipefail uv run python -m piv5.cli.predict \ --config configs/canonical_{backbone}.yaml \ --seed {seed} \ --out evals/v5_canonical_{family}_{backbone}/seed{seed}/predictions_full.parquet artifacts: - label: predictions remote_path: "{remote_repo}/evals/v5_canonical_{family}_{backbone}/seed{seed}/predictions_full.parquet" local_path: "{project_root}/evals/v5_canonical_{family}_{backbone}/seed{seed}/" required: true ``` Then post-run locally (driver-side): ```sh uv run python -m piv5.cli.merge \ --root evals/v5_canonical_${family}_${backbone} \ --bootstrap-resamples 10000 \ --seed 42 ``` ## Enforcing the contract Consumers preventing CPU-on-pod regressions ship a pod-contract lint test that greps configs for forbidden invocations: ```python notest # tests/unit/test_pod_contract.py def test_pod_does_not_run_bootstrap(): for config in CONFIGS: body = yaml.safe_load(config.read_text())["run"]["body"] assert "bootstrap" not in body.lower(), ( f"{config}: bootstrap belongs on the local CPU, not the billed pod" ) ``` Cheap to maintain; catches accidental regressions during config-template refactors. ## What lives where | Concern | Owner | |---|---| | Running the model and emitting per-row predictions | Pod (GPU) — your `run.body` invokes your `predict` entry point | | Pulling `predictions_full.parquet` back to the local run-dir | `runpod-deploy run` (`artifacts[*]` with `required: true`) | | Bootstrap CIs, paired tests, calibration fits | Your local post-processing (CPU; consumer-domain) | | Enforcing the "no bootstrap on the pod" contract | Your project's lint / CI (see "Enforcing the contract" above) | | Deciding which slices / scorers / metrics to compute | Your evaluation protocol | ## Anti-pattern to avoid **Do not compute bootstrap CIs / paired tests / calibration fits on the pod.** They're CPU-cheap; making the GPU pod do them shrinks your billed window for no upside, and (worse) couples re-evaluation cost to re-provisioning cost. If you find your config doing this during a template refactor, the lint test above catches it. **Do not skip the `required: true` flag on the predictions artifact.** If the pull fails silently (network blip, rsync race), your post-run metrics target will hit a missing file and you won't know whether the run succeeded. Marking the predictions parquet `required: true` turns a missed-pull into a loud failure that re-triggers the run; the few seconds of extra strictness saves the cost of an unnoticed silent regression. ## See also - [`multi-config-sweep.md`](multi-config-sweep.md) — the canonical invocation pattern (per-shard pod runs prediction-only; aggregation happens locally after all shards complete). - [`local-postprocess-after-run.md`](local-postprocess-after-run.md) — walks the pulled `predictions_full.parquet` for metrics, bootstrap CIs, paired tests. - [`reproducibility.md`](reproducibility.md) — pair with `pod.python_version` to lock the interpreter version across the sweep; otherwise per-row predictions could shift between shards on a 3.13 → 3.14 minor-version bump. - [`troubleshooting.md`](../troubleshooting.md) "Predictions discipline" — the failure mode this pattern prevents (recovering per-row scores after pod teardown costs real money).