Reproducibility — three reproduction tiers
Deep-dive reference for the methodology in WRITEUP_PAPER.md (academic) and WRITEUP_NARRATIVE.md (narrative). Pick a guide for the cover narrative; this spoke goes deeper.
How to read this spoke: For a fast skim, focus on the bolded Result subsections + the final §Summary if present. For a full audit, read the methodology paragraphs + the ADR references in headers.
- T0 — eval-from-hub (laptop, $0, ~10-30 min):
make eval-from-hub RUNG=frozen-probe(orlora) downloads the published HF Hub checkpoint and score-matches againstevals/results.jsonwithin 1e-4 absolute tolerance per ADR-034. Fully wired at v1.0.9 per ADR-058. - T1 — smoke (laptop, $0, <10 min):
make smokeverifies code health via fixture-data E2E pass; does NOT verify math correctness or headline numbers. - T3 — headline-cloud (cloud GPU, ~$125+, hours):
make headline-cloudretrains every rung from scratch on RunPod; full retraining from scratch reproduces headline numbers within seed variance. Hard cost-cap at $125 per job per ADR-020. - T2 (test-integration) is a developer-tool tier (requires local GPU); not promoted to reviewer-facing because T0+T3 cover the verification dimensions.
- ACM artifact-badging: T0 + T1 supply Available + Functional + Reusable; T3 supplies Reproducible (deepest).
This submission documents three tiers of reproduction, ordered by cost. Pick the tier matching your audit depth + budget.
This spoke is referenced by index.qmd’s “Deep-dive path — reproduce the numbers” section.
Tier coverage matrix
| Tier | Command | Cost | Time | Verifies | Does NOT verify |
|---|---|---|---|---|---|
| T0 — eval-from-hub | make eval-from-hub RUNG=<name> |
$0 (HF Hub bandwidth only) | ~10-30 min | Headline scores reproduce on the published checkpoint per ADR-032 | Data pipeline (no re-training); ablation rungs not in the published set |
| T1 — smoke | make smoke |
$0 | <10 min | Code health; fixture pipeline runs end-to-end per ADR-027 | Math correctness; real-data scoring; headline numbers |
| T3 — headline-cloud | make headline-cloud |
~$125+ per ADR-020 cost cap | hours | Full retraining from scratch reproduces headline numbers within seed variance | (verifies everything in scope) |
T0 — eval-from-hub (laptop, ~$0, ~10-30 min)
Status (v1.0.1): published rungs frozen-probe + lora per ADR-032 + Q10 lock (canonical fold0/seed42 checkpoint per rung); HF Hub repos go live at v1.0.1 close. Full-FT skipped per ADR-050 (FUSE EIO crash; weights missing locally).
Result (Maintainer note v1.0.9): the scripts/eval_from_hub.py non-dry-run body is fully wired per ADR-058 (narrow supersession of ADR-051 Block A). make eval-from-hub RUNG=frozen-probe + RUNG=lora exit 0 with score-match summary within 1e-4 absolute tolerance per ADR-034 §Tier T0. Per-row strict mode: exit 1 on any row exceeding tolerance (no silent failures); per-row delta diagnostics printed to stderr on fail. ADR-051 Block B (38 invariant scaffolds) remains carryforward to v1.1.x.
make eval-from-hub RUNG=frozen-probe
make eval-from-hub RUNG=loraWhat this does (per ADR-034):
- Calls
huggingface_hub.snapshot_downloadforBBehring/prompt-injection-<rung>. - Loads the model via
AutoModelForSequenceClassification.from_pretrained(CPU; bf16 if CUDA available; otherwise fp32). - Runs scoring against the local val slate (
evals/predictions/<rung>__fold0__seed42__*.parquet). - Emits per-row predictions to
evals/predictions/t0_eval_from_hub.parquet. - Score-matches against
evals/results.jsonheadline metrics within 1e-4 absolute tolerance (ADR-034 contract). - Exits non-zero if any headline metric drifts beyond tolerance.
Full-FT note: make eval-from-hub RUNG=full-ft is intentionally unsupported — the canonical full-FT checkpoint was lost in a Phase 5 X11 FUSE EIO crash per ADR-050. The 3-rung LODO ladder narrative survives via the 24 surviving Phase 2 LODO predictions in evals/predictions/full-ft__*.parquet; OOD inference for full-FT is methodologically impossible without a fresh 6-12 hour A100 80GB re-fire and is out of scope for v1.0.0.
Publishing: maintainers refresh the published checkpoints via make publish-hub (requires HF_TOKEN with write scope). See scripts/publish_to_hub.py for the upload-folder pattern.
T1 — smoke (laptop, ~$0, ~10 min)
Status: working at Phase 0-07 close (per ADR-027 — Makefile target exists).
make smokeWhat this does: - Runs pytest -m smoke + a fixture-data E2E pass through scripts/run_metrics_battery.py with configs/profiles/fixtures.yaml. - Verifies code health — every import resolves, every entry-point runs, no schema mismatches. - Does NOT verify math correctness (uses fixture data, not real data). - Wall-clock < 10 minutes; no GPU; no network.
T3 — headline-cloud (cloud-GPU, ~$125+, ~hours)
Status (v1.0.1): complete — verbatim commands below; RunPod account + secrets bootstrap documented in docs/REPRODUCIBILITY.md. Cost-capped per ADR-020.
# Pre-flight (free; no provisioning):
make headline-dry-run
# Canonical eval (billed; cost-capped per ADR-020):
make headline-cloudWhat this does: - Wraps runpod-deploy validate --all + runpod-deploy run --dry-run + runpod-deploy run --config configs/runpod/headline.yaml. - Provisions H100 (or fallback per ADR-020 gpu_order ladder). - Trains every rung from scratch + runs eval + emits per-row predictions + computes headline metrics. - Hard cost-cap at $125 per job per ADR-020 budget.cost_cap_usd + project-wide $200 hard cap via scripts/cost_rollup.py. - Wall-clock: hours.
Why T2 (test-integration) is not a reviewer-facing tier
The Makefile per ADR-027 exposes a make test-integration target that uses pytest.importorskip("torch") + pytest.mark.skipif(not torch.cuda.is_available()) for GPU-conditional skipping. T2 stays a developer-tool tier — not promoted to reviewer-facing — because:
- Requires a local GPU the reviewer may not have.
- Adds friction without enabling a new verification dimension — T0 covers eval; T3 covers full retraining; T2 only adds “subset eval on local GPU” which is strictly weaker than T3.
If a reviewer with a local GPU wants a mid-tier check, they may invoke make test-integration directly; the tier is documented for completeness but not part of the reviewer-facing ladder.
ACM artifact-badging mapping
The tier ladder maps approximately onto ACM Artifact Review and Badging (v1.1) conventions:
- T0 + T1 supply Artifacts Available + Functional + Reusable badge levels.
- T3 supplies the Reproducible badge level (deepest).
Result: the tier ladder is recognizable to reviewers familiar with ACM artifact policies.
Limitations + extension conditions
- T0 reproduces headline scores only on the published rungs (per ADR-032 Option C). Ablation-rung reproduction requires T3.
- T0 verifies score-match against the published checkpoint, not data-pipeline correctness — a reviewer who wants to verify the data preparation pipeline produces the same training tensors needs T3.
- T3 is the only paid tier; reviewer self-funds the RunPod cost.
- HF Hub checkpoint SHA pinning: if model card edits invalidate cached SHAs at Phase 5, pin SHAs in
make eval-from-hubviahuggingface_hub.snapshot_download(revision=<SHA>). The HF dataset SHA-pinning discipline from ADR-016 carries over to model SHA-pinning.
Cross-references
- Methodology guarantees (eval-toolkit + runpod-deploy + SDD discipline that backs T0/T1/T3) →
methodology-guarantees.md - Headline results consumed by T0 score-match (interpretation): WRITEUP_PAPER §4 (academic) or WRITEUP_NARRATIVE Act 3 (narrative)
- Headline tables (data): RESULTS §1
- Evaluation primitives + statistical apparatus →
eval-design.md - Model-rung publication scope (canonical fold0/seed42 per ADR-032) →
model-rungs.md - Aggregator docs (Makefile targets + manifest schema) →
../docs/REPRODUCIBILITY.md
Linked ADRs: ADR-027 (Makefile execution-context stratification), ADR-020 (cost cap), ADR-032 (HF Hub publication), ADR-034 (this spoke’s source-of-truth ADR).