scripts/eval_from_hub.py non-dry-run body wired — narrow supersession of ADR-051 Block A (T0 score-match carryforward); Block B (38 invariant scaffolds) remains carryforward
ADR-058 — eval_from_hub non-dry-run body wired (narrow supersession of ADR-051 Block A)
Status
Accepted (2026-05-19; landed in v1.0.9 patch alongside eval-toolkit v0.40.0 → v0.42.0 bump + fit_isotonic_binary_local adapter removal per library-first invariant + eval-toolkit#44 closure).
Context
ADR-051 was written at v1.0.2 (2026-05-18) to formally document the two governance gaps that the v1.0.0 / v1.0.1 submission tags did not close:
- Block A —
scripts/eval_from_hub.pynon-dry-run body was a ~7-line stub that printed a carryforward message and exited 2. ADR-034 §Tier T0 §Score-match contract was unmet from the consumer side (publish half closed cleanly at v1.0.1). - Block B — 38 stubs in
tests/test_invariants.pycarried@pytest.mark.skip(reason="v1.0.0 carryforward stub"). ADR-039 §gate 3 (“all stubs unskipped + green”) was unmet.
ADR-051 narrowly superseded ADR-034 (T0 axis only; T1 + T3 unchanged) + ADR-039 (gate 3 axis only; gates 1+2+4+5+6 unchanged) and codified both as v1.1.x carryforwards with explicit landing conditions.
v1.0.9 closes Block A. Block B remains a v1.1.x carryforward.
Decision
Implement the full ADR-034 §Tier T0 score-match contract in scripts/eval_from_hub.py and narrowly supersede ADR-051 — Block A only. ADR-051 Block B (38 invariant scaffolds) remains carryforward to v1.1.x; this ADR does not affect it.
Block A — implementation shape
Per the /exploring-options 2026-05-19 6-question execution-lock:
- Q1 lock — score-match strictness: exit 1 on any row exceeding 1e-4 absolute tolerance (no silent failures discipline per project Python standards). Per-row delta diagnostics printed to stderr on fail (top-5 exceedances by |delta|).
- Q6 lock — smoke test fetch strategy: mocked-only for CI hermeticity. Real
huggingface_hub.snapshot_download+ CPU inference are NOT exercised in CI (would require network + ~500MB checkpoint download per slice; ~30s CPU inference per slice). The reviewer- facing path (make eval-from-hub RUNG=frozen-probe) exercises the full body live.
Architecture-dispatched model loading
scripts/eval_from_hub.py::_load_model_and_tokenizer dispatches by rung:
frozen-probe/full-ft→AutoModelForSequenceClassification.from_pretrained(snapshot_path)with co-located tokenizer.lora→ baseModernBERT-base(pinned revision8949b909ec900327062f0ebf497f51aef5e6f0c8per ADR-015) +PeftModel.from_pretrained(snapshot_path); tokenizer loaded from the pinned backbone (LoRA adapter folders do not ship a tokenizer).
CPU only; torch.float32; library-first reuse of src.training.train_modernbert._predict_proba for the inference loop (softmax_fp32 cast per ADR-019 numerical-stability discipline).
Score-match reference parquet
Per-row comparison against evals/predictions/<rung_underscore>__fold0__seed42__<slice>.parquet (committed reference; rung name uses underscore form per file-system convention: frozen_probe, lora, full_ft). The HF Hub repo_id uses kebab-case per ADR-032: BBehring/prompt-injection-<rung>. ADR-058’s _rung_to_underscore helper bridges the two naming conventions.
The per-row delta is computed as abs(new_proba - reference_proba) where:
new_proba= re-derivedpredicted_proba_class1from the freshly- downloaded HF Hub checkpoint.reference_proba= committedpredicted_proba_class1from the canonical fold0/seed42 training run.
A passing run requires every row’s |delta| ≤ 1e-4. ADR-034 §Tier T0 tolerance is per-row, not aggregated; the per-row formulation is strictly conservative (per-row pass implies aggregated AUPRC/AUROC pass; the converse is not true).
Eval slice + sample size
The --eval-slice CLI argument (default bipia) selects which OOD slice to score-match. The --n-rows argument (default 100) samples the first N rows of the reference parquet (deterministic; matches the committed run’s row ordering). If the reference parquet has fewer than N rows, all rows are used.
For reviewer-facing T0 verification, make eval-from-hub RUNG=<rung> runs with defaults; n=100 is sufficient to detect any non-trivial checkpoint divergence (CPU vs GPU floating-point differences at fp32 are typically < 1e-5; the 1e-4 tolerance has a 10× safety margin).
Library-first reuse
This implementation reuses existing project + library primitives rather than hand-rolling:
huggingface_hub.snapshot_download— upstream library, not hand-rolled fetching.transformers.AutoModelForSequenceClassification+AutoTokenizer— upstream library.peft.PeftModel.from_pretrained— upstream library; lora architecture-aware loading.src.training.load_modernbert.load_modernbert— internal helper for base ModernBERT loading at pinned revision.src.training.train_modernbert._predict_proba— internal inference loop with softmax_fp32 cast per ADR-019.
No new methodology primitives are introduced; all components are upstream library calls or existing internal helpers.
Consequences
- Reviewer experience:
make eval-from-hub RUNG=frozen-probe+RUNG=loranow exit 0 with score-match summary; the auto-generated HF Hub model cards (live since v1.0.1) become executable from the consumer side. The T0 reproducibility contract is sealed. - ADR-051 supersession scope: this ADR narrowly supersedes ADR-051 Block A only. ADR-051 Block B (38 invariant scaffolds) remains carryforward to v1.1.x. ADR-051’s frontmatter gains
superseded_by: ["058"]per ADR-029 in-place convention; ADR-051’s status remains Accepted (the body documents both blocks; only one is now closed). - No methodology changes: the implementation reuses existing primitives; no new ADR for inference loop / tokenization / etc. is needed. The only governance addition is this ADR.
- Smoke coverage: 9 tests added at
tests/smoke/test_eval_from_hub_smoke.pycovering (a)--dry-runsubprocess invocation for both published rungs, (b) score-match tolerance logic with synthetic numpy arrays (within-tol, exceeds-tol, length-mismatch), (c) published-rungs validator againstevals/results.json, (d) reference parquet loader, (e) kebab↔︎underscore rung-name translation. Per Q6 lock: no real HF Hub fetch; CI hermetic. - CHANGELOG entry:
[1.0.9]documents the closure narrative + the eval-toolkit v0.40.0 → v0.42.0 bump that lands in the same commit (consumes eval-toolkit#44 /fit_isotonic_binaryupstream per library-first invariant; removesfit_isotonic_binary_localadapter from v1.0.8).
Linked ADRs
- Superseded (narrow, Block A only): ADR-051.
- Referenced: ADR-034 (the T0 contract this ADR closes); ADR-032 (HF Hub publication; v1.0.1 deliverable that closed the publish half); ADR-019 (softmax_fp32 numerical stability discipline reused in inference); ADR-039 (gate 3 — Block B of ADR-051 unchanged by this ADR).
- Source: REPO_AUDIT_2026-05-18 §P0 “HF Hub / T0 reproducibility is unfinished” (audit finding that motivated ADR-051; this ADR closes the Block A half).
Transcript
Decisions surfaced during the 2026-05-19 v1.0.9 planning session (/exploring-options 6-question execution-lock; transcript at transcripts/2026-05-19__v1-0-9-adr-051-block-a-close.md).