v1.0.x carryforward of ADR-034 T0 score-match wiring + ADR-039 gate-3 invariant-scaffold unskip; narrow supersession with explicit v1.1.x landing condition

Published

May 18, 2026

Superseded on one or more axes by ADR-058. The body below retains its original prose per the ADR-073 immutability rule; the corrected position lives in the superseding ADR. See the Decisions index to navigate.

ADR-051: v1.0.x carryforward of ADR-034 T0 + ADR-039 gate 3; narrow supersession

Status

Accepted (2026-05-18). Block A closed at v1.0.9 via ADR-058. Block B (38 invariant-test stubs unskip per ADR-039 gate 3) remains carryforward to v1.1.x.

Context

REPO_AUDIT_2026-05-18 (committed at 9ed7dd9 on 2026-05-18 morning) declared the repo “not submission-ready” against ADR-039 / Phase 5 gates. The audit’s P0 + P1 sections flagged 8 blocking surfaces. The v1.0.0 + v1.0.1 tags closed 6 of those surfaces by direct implementation:

  • WRITEUP/EVIDENCE/HYPERPARAMETER_DISCLOSURE/SPEC_SHEET/NEXT_STEPS /THREAT_MODEL/REPRODUCIBILITY placeholder sweep.
  • Quarto 7-spoke split + render allowlist + secrets-free render.
  • CI Python 3.13 bump.
  • Single-class slice filter at source (per ADR-005 + WRITEUP §Methodology caveats).
  • HF Hub publish (canonical fold0/seed42 per rung; ADR-032 model card discipline; live at v1.0.1).
  • ADR-049/050 frontmatter closure metadata + SPEC_SHEET rung language alignment.
  • Push + green CI + green Publish + v1.0.0 tag + reviewer URLs at HTTP 200 + v0.9.0-rc1/rc2/rc3 rehearsal-tag trail.

The two remaining gaps are governance-grade: the audit explicitly said “implement OR supersede via ADR”. v1.0.0 + v1.0.1 documented carryforward intent in WRITEUP/reproducibility.md + tests/test_invariants.py module docstrings but did not write the supersession ADR. This ADR closes that loop.

Decision

Narrow supersession of ADR-034 (T0 tier) and ADR-039 (gate 3) with explicit v1.1.x landing conditions.

Block A — ADR-034 T0 score-match wiring → v1.1.x

The HF Hub publish half of T0 lands at v1.0.1 (per ADR-032 deliverable). The script body of scripts/eval_from_hub.py remains a scaffold that exits 2 with a clear message pointing at this ADR + the WRITEUP/reproducibility.md T0 maintainer note.

v1.1.x landing condition: make eval-from-hub RUNG=frozen-probe + RUNG=lora exit 0 with score-match summary within 1e-4 absolute tolerance per ADR-034 §Tier T0 §Score-match contract. The wiring is ~100 LOC across scripts/eval_from_hub.py (load + inference + score-match) and tests under tests/smoke/.

T1 + T3 tiers of ADR-034 unchanged. make test-smoke (T1; laptop, no GPU, no network, ~1 min) and make headline-cloud (T3; A100 80GB; ~$28; full LODO matrix re-train + re-eval) ship unchanged at v1.0.x.

Block B — ADR-039 gate 3 invariant scaffolds → v1.1.x

tests/test_invariants.py ships at v1.0.1 with:

  • 10 implemented invariants (all green; sourced from Phase 1-4 artifacts: data balance, source disjointness, dedup calibration, leakage report cleanness, contamination scan cleanness, reference-scorer schema uniformity, calibration battery output shape, etc.).
  • 38 scaffold stubs marked @pytest.mark.skip(reason="v1.0.0 carryforward stub — see module docstring; deferred to v1.1.x").

The module docstring catalogues the 38 stubs into 3 buckets:

  1. Spec-invariant scaffolds (~20 stubs). The invariant is true by code construction (e.g., test_hyperparameter_immutability would assert the config hash matches the committed value; src/utils/config_hash.py already enforces this at runtime). The executable assertion is the better-discipline form but the underlying invariant is already enforced.
  2. Reporting invariants (~5 stubs). e.g., test_reporting_completeness_assumptions_in_caveats would assert every severity ≥ medium assumption appears in WRITEUP §Methodology caveats. A manual review at v1.0.0 confirmed this; the pre-commit hook for no_emoji_check.py handles the related “no emoji” rule. Executable test deferred.
  3. ADR-050-orphaned invariants (~13 stubs). e.g., test_full_ft_ood_predictions_complete would assert full-FT OOD predictions exist for every (fold, seed, slice) cell — but full-FT OOD was dropped per ADR-050 X11 FUSE crash. The invariant describes a target that no longer exists; in v1.1.x these stubs either get superseded by ADR-050-aware versions or removed.

v1.1.x landing condition: pytest -m unit tests/test_invariants.py returns 48 passed / 0 skipped (or N skipped, each with an explicit ADR-numbered exemption reason that survives audit).

Gates 1 + 2 + 4 + 5 + 6 of ADR-039 unchanged. All five remain valid acceptance criteria for the v1.0.x submission tag (zero [OPEN] in SPEC_SHEET; zero open rows in SPEC_GREENFIELD ledger; SUBMISSION_AUDIT.md regenerates clean; v0.9.0-rc rehearsal tag fired before v1.0.0; all three reviewer URLs return 200 — all confirmed at v1.0.0 + v1.0.1 close).

Consequences

  • Governance: explicit, immutable record of the two carryforwards. ADR-034 + ADR-039 are not “violated” — they are narrowly superseded on the two specific axes with explicit landing conditions. The rest of both ADRs is unchanged.
  • Reviewer-facing: WRITEUP/reproducibility.md T0 maintainer note (already drafted at v1.0.1) cross-references this ADR by name. A reviewer who runs make eval-from-hub and gets exit 2 sees the script’s stderr message pointing at this ADR
    • the maintainer note.
  • Implementation: zero code or methodology changes ship with ADR-051. It is governance-only. The v1.0.2 tag = ADR-051 + decisions/README.md index update + CHANGELOG + the two ADR-034
    • ADR-039 frontmatter superseded_by field additions.
  • Audit-trail: SUBMISSION_AUDIT.md regenerates via scripts/regenerate_audit.py with ADR-051 included; the closure metadata fields (closing_commit, supersedes, superseded_by) are populated post-tag in a v1.0.2 final commit (or here in this ADR if the closing_commit is the ADR-051 commit itself).

Alternatives Considered

Retroactively documented per ADR-072 (2026-05-20 frontmatter + structural backfill). The three alternatives surfaced at 2026-05-18 lock time:

  1. Drop both T0 + invariant commitments outright. Rejected: ADR-034 + ADR-039 are methodology contracts; silently dropping them violates the SDD discipline that ADR-005 + AGENTS.md establish. The contracts need an explicit superseding record even if the implementation defers.
  2. Fix-forward inline before v1.0.0 tag. Rejected: the implementation surface for both blocks was substantial — Block A is ~100 LOC of T0 score-match wiring (HF Hub download + AutoModelForSequenceClassification load + score-match within 1e-4 tolerance per ADR-034 §Tier T0); Block B is 38 invariant test bodies covering class-balance + source-disjoint splits + leakage + calibration + reporting-completeness invariants per ADR-039 gate 3. Both implementations exceeded the v1.0.0 rehearsal-tag window’s available time.
  3. No carryforward ADR; let ADR-034 + ADR-039 stand unmet. Rejected: violates the immutability discipline ADR-005 + ADR-067 codify. An unmet methodology contract without an explicit supersession record creates an undocumented “methodology debt” that’s hard to audit. ADR-051’s existence makes the debt explicit + tracked.

The chosen path (narrow supersession + explicit v1.1.x landing conditions) preserves the methodology contract by binding it to a future close while keeping the audit trail honest.

Linked ADRs

  • Superseded (narrow): ADR-034 (T0 score-match wiring axis only); ADR-039 (gate 3 axis only).
  • Referenced: ADR-032 (HF Hub publication; v1.0.1 deliverable that closes the publish half of T0); ADR-046 (Phase 4 analysis bundle; defined many of the invariants now scaffolded).
  • Source: decisions/audits/REPO_AUDIT_2026-05-18.md (the explicit audit finding that invited the supersession-or-implementation decision); WRITEUP/reproducibility.md (T0 maintainer note); tests/test_invariants.py (module docstring with the v1.0.0-carryforward catalog).

Transcript

Decisions surfaced during the 2026-05-18 post-v1.0.1 audit re-examination conversation. Two /exploring-options batches (4 questions each) locked the supersession scope (single ADR covering both axes; immediate v1.0.2 tag rather than v1.1.0 defer). No transcript file required — the conversation history in the v1.0.1 → v1.0.2 commit-message bodies is the audit trail.