Recipe: predictions-only-eval#
Pattern: the GPU pod emits ONLY the per-row predictions
(predictions_full.parquet + any trained adapters); all metrics,
bootstrap CIs, paired tests, and calibration fits run locally on CPU
after runpod-deploy run pulls the parquet back.
Why this is a recipe, not a schema feature#
The split between “what runs on GPU” and “what runs on CPU” is
consumer-domain: which metrics you care about, what bootstrap N is
appropriate for your sample size, whether your calibration fit is
GPU-tractable, and whether you need paired tests between scorer
variants all depend on your evaluation protocol. Baking a metrics
runner into runpod-deploy would force one philosophy on every
consumer.
What runpod-deploy owns is the pull: artifacts[*] with
required: true ensures the predictions parquet lands locally even on
partial failures, so post-run CPU analysis has a deterministic input.
The CPU-side workflow itself is yours.
Bootstrap N=10K–100K across a multi-slice × multi-scorer matrix is ~minutes of billed GPU time per shard but ~seconds on a beefy local CPU at higher N. Keeping the GPU pod’s job tight (predict + checkpoint only) shrinks the billed window and decouples the cost of running the model from the cost of evaluating it. As a bonus, all metrics become deterministic re-runs from the parquet without re-spending on GPU.
Validated end-to-end in prompt-injection-v5’s canonical sweep
(configs/runpod/v5_canonical_combined.yaml), which is the working
reference.
Pattern#
# In the YAML config the pod runs:
run:
body: |
cd {remote_repo}
set -euo pipefail
uv run python -m piv5.cli.predict \
--config configs/canonical_{backbone}.yaml \
--seed {seed} \
--out evals/v5_canonical_{family}_{backbone}/seed{seed}/predictions_full.parquet
artifacts:
- label: predictions
remote_path: "{remote_repo}/evals/v5_canonical_{family}_{backbone}/seed{seed}/predictions_full.parquet"
local_path: "{project_root}/evals/v5_canonical_{family}_{backbone}/seed{seed}/"
required: true
Then post-run locally (driver-side):
uv run python -m piv5.cli.merge \
--root evals/v5_canonical_${family}_${backbone} \
--bootstrap-resamples 10000 \
--seed 42
Enforcing the contract#
Consumers preventing CPU-on-pod regressions ship a pod-contract lint test that greps configs for forbidden invocations:
# tests/unit/test_pod_contract.py
def test_pod_does_not_run_bootstrap():
for config in CONFIGS:
body = yaml.safe_load(config.read_text())["run"]["body"]
assert "bootstrap" not in body.lower(), (
f"{config}: bootstrap belongs on the local CPU, not the billed pod"
)
Cheap to maintain; catches accidental regressions during config-template refactors.
What lives where#
Concern |
Owner |
|---|---|
Running the model and emitting per-row predictions |
Pod (GPU) — your |
Pulling |
|
Bootstrap CIs, paired tests, calibration fits |
Your local post-processing (CPU; consumer-domain) |
Enforcing the “no bootstrap on the pod” contract |
Your project’s lint / CI (see “Enforcing the contract” above) |
Deciding which slices / scorers / metrics to compute |
Your evaluation protocol |
Anti-pattern to avoid#
Do not compute bootstrap CIs / paired tests / calibration fits on the pod. They’re CPU-cheap; making the GPU pod do them shrinks your billed window for no upside, and (worse) couples re-evaluation cost to re-provisioning cost. If you find your config doing this during a template refactor, the lint test above catches it.
Do not skip the required: true flag on the predictions artifact.
If the pull fails silently (network blip, rsync race), your post-run
metrics target will hit a missing file and you won’t know whether the
run succeeded. Marking the predictions parquet required: true turns
a missed-pull into a loud failure that re-triggers the run; the few
seconds of extra strictness saves the cost of an unnoticed silent
regression.
See also#
multi-config-sweep.md— the canonical invocation pattern (per-shard pod runs prediction-only; aggregation happens locally after all shards complete).local-postprocess-after-run.md— walks the pulledpredictions_full.parquetfor metrics, bootstrap CIs, paired tests.reproducibility.md— pair withpod.python_versionto lock the interpreter version across the sweep; otherwise per-row predictions could shift between shards on a 3.13 → 3.14 minor-version bump.troubleshooting.md“Predictions discipline” — the failure mode this pattern prevents (recovering per-row scores after pod teardown costs real money).